Keycloak
100 строк · 4.6 Кб
1<#import "/templates/guide.adoc" as tmpl>
2<#import "/templates/links.adoc" as links>
3
4<@tmpl.guide
5title="Concepts for sizing CPU and memory resources"
6summary="Understand these concepts to avoid resource exhaustion and congestion"
7tileVisible="false" >
8
9Use this as a starting point to size a product environment.
10Adjust the values for your environment as needed based on your load tests.
11
12== Performance recommendations
13
14[WARNING]
15====
16* Performance will be lowered when scaling to more Pods (due to additional overhead) and using a cross-datacenter setup (due to additional traffic and operations).
17
18* Increased cache sizes can improve the performance when {project_name} instances running for a longer time.
19This will decrease response times and reduce IOPS on the database.
20Still, those caches need to be filled when an instance is restarted, so do not set resources too tight based on the stable state measured once the caches have been filled.
21
22* Use these values as a starting point and perform your own load tests before going into production.
23====
24
25Summary:
26
27* The used CPU scales linearly with the number of requests up to the tested limit below.
28* The used memory scales linearly with the number of active sessions up to the tested limit below.
29
30Recommendations:
31
32* The base memory usage for an inactive Pod is 1000 MB of RAM.
33
34* For each 100,000 active user sessions, add 500 MB per Pod in a three-node cluster (tested with up to 200,000 sessions).
35+
36This assumes that each user connects to only one client.
37Memory requirements increase with the number of client sessions per user session (not tested yet).
38
39* In containers, Keycloak allocates 70% of the memory limit for heap based memory. It will also use approximately 300 MB of non-heap-based memory.
40To calculate the requested memory, use the calculation above. As memory limit, subtract the non-heap memory from the value above and divide the result by 0.7.
41
42* For each 30 user logins per second, 1 vCPU per Pod in a three-node cluster (tested with up to 300 per second).
43+
44{project_name} spends most of the CPU time hashing the password provided by the user.
45
46* For each 450 client credential grants per second, 1 vCPU per Pod in a three node cluster (tested with up to 2000 per second).
47+
48Most CPU time goes into creating new TLS connections, as each client runs only a single request.
49
50* For each 350 refresh token requests per second, 1 vCPU per Pod in a three-node cluster (tested with up to 435 refresh token requests per second).
51
52* Leave 200% extra head-room for CPU usage to handle spikes in the load.
53This ensures a fast startup of the node, and sufficient capacity to handle failover tasks like, for example, re-balancing Infinispan caches, when one node fails.
54Performance of {project_name} dropped significantly when its Pods were throttled in our tests.
55
56=== Calculation example
57
58Target size:
59
60* 50,000 active user sessions
61* 30 logins per seconds
62* 450 client credential grants per second
63* 350 refresh token requests per second
64
65Limits calculated:
66
67* CPU requested: 3 vCPU
68+
69(30 logins per second = 1 vCPU, 450 client credential grants per second = 1 vCPU, 350 refresh token = 1 vCPU)
70
71* CPU limit: 9 vCPU
72+
73(Allow for three times the CPU requested to handle peaks, startups and failover tasks, and also refresh token handling which we don't have numbers on, yet)
74
75* Memory requested: 1250 MB
76+
77(1000 MB base memory plus 250 MB RAM for 50,000 active sessions)
78
79* Memory limit: 1360 GB
80+
81(1250 MB expected memory usage minus 300 non-heap-usage, divided by 0.7)
82
83== Reference architecture
84
85The following setup was used to retrieve the settings above to run tests of about 10 minutes for different scenarios:
86
87* OpenShift 4.14.x deployed on AWS via ROSA.
88* Machinepool with `m5.4xlarge` instances.
89* {project_name} deployed with the Operator and 3 pods.
90* Default user password hashing with PBKDF2(SHA512) 210,000 hash iterations (which is the default).
91* Client credential grants don't use refresh tokens (which is the default).
92* Database seeded with 100,000 users and 100,000 clients.
93* Infinispan caches at default of 10,000 entries, so not all clients and users fit into the cache, and some requests will need to fetch the data from the database.
94* All sessions in distributed caches as per default, with two owners per entries, allowing one failing Pod without losing data.
95* OpenShift's reverse proxy running in passthrough mode were the TLS connection of the client is terminated at the Pod.
96* PostgreSQL deployed inside the same OpenShift with ephemeral storage.
97+
98Using a database with persistent storage will have longer database latencies, which might lead to longer response times; still, the throughput should be similar.
99
100</@tmpl.guide>
101