Keycloak

Форк
0
/
concepts-active-passive-sync.adoc 
171 строка · 10.7 Кб
1
<#import "/templates/guide.adoc" as tmpl>
2
<#import "/templates/links.adoc" as links>
3

4
<@tmpl.guide
5
title="Concepts for active-passive deployments"
6
summary="Understanding an active-passive deployment with synchronous replication" >
7

8
This topic describes a highly available active/passive setup and the behavior to expect. It outlines the requirements of the high availability active/passive architecture and describes the benefits and tradeoffs.
9

10
== When to use this setup
11

12
Use this setup to be able to fail over automatically in the event of a site failure, which reduces the likelihood of losing data or sessions. Manual interactions are usually required to restore the redundancy after the failover.
13

14
== Deployment, data storage and caching
15

16
Two independent {project_name} deployments running in different sites are connected with a low latency network connection.
17
Users, realms, clients, offline sessions, and other entities are stored in a database that is replicated synchronously across the two sites.
18
The data is also cached in the {project_name} embedded {jdgserver_name} as local caches.
19
When the data is changed in one {project_name} instance, that data is updated in the database, and an invalidation message is sent to the other site using the replicated `work` cache.
20

21
Session-related data is stored in the replicated caches of the embedded {jdgserver_name} of {project_name}, and forwarded to the external {jdgserver_name}, which forwards information to the external {jdgserver_name} running synchronously in the other site.
22
As session data of the external {jdgserver_name} is also cached in the embedded {jdgserver_name}, invalidation messages of the replicated `work` cache are needed for invalidation.
23

24
In the following paragraphs and diagrams, references to deploying {jdgserver_name} apply to the external {jdgserver_name}.
25

26
image::high-availability/active-passive-sync.dio.svg[]
27

28
== Causes of data and service loss
29

30
While this setup aims for high availability, the following situations can still lead to service or data loss:
31

32
* Network failures between the sites or failures of components can lead to short service downtimes while those failures are detected.
33
The service will be restored automatically.
34
The system is degraded until the failures are detected and the backup cluster is promoted to service requests.
35

36
* Once failures occur in the communication between the sites, manual steps are necessary to re-synchronize a degraded setup.
37

38
* Degraded setups can lead to service or data loss if additional components fail.
39
Monitoring is necessary to detect degraded setups.
40

41
== Failures which this setup can survive
42

43
[%autowidth]
44
|===
45
| Failure | Recovery | RPO^1^ | RTO^2^
46

47
| Database node
48
| If the writer instance fails, the database can promote a reader instance in the same or other site to be the new writer.
49
| No data loss
50
| Seconds to minutes (depending on the database)
51

52
| {project_name} node
53
| Multiple {project_name} instances run in each site. If one instance fails, it takes a few seconds for the other nodes to notice the change, and some incoming requests might receive an error message or are delayed for some seconds.
54
| No data loss
55
| Less than one minute
56

57
| {jdgserver_name} node
58
| Multiple {jdgserver_name} instances run in each site. If one instance fails, it takes a few seconds for the other nodes to notice the change. Sessions are stored in at least two {jdgserver_name} nodes, so a single node failure does not lead to data loss.
59
| No data loss
60
| Less than one minute
61

62
| {jdgserver_name} cluster failure
63
| If the {jdgserver_name} cluster fails in the active site, {project_name} will not be able to communicate with the external {jdgserver_name}, and the {project_name} service will be unavailable.
64
The loadbalancer will detect the situation as `/lb-check` returns an error, and will fail over to the other site.
65

66
The setup is degraded until the {jdgserver_name} cluster is restored and the session data is re-synchronized to the primary.
67
| No data loss^3^
68
| Seconds to minutes (depending on load balancer setup)
69

70
| Connectivity {jdgserver_name}
71
| If the connectivity between the two sites is lost, session information cannot be sent to the other site.
72
Incoming requests might receive an error message or are delayed for some seconds.
73
The primary site marks the secondary site offline, and will stop sending data to the secondary.
74
The setup is degraded until the connection is restored and the session data is re-synchronized to the secondary site.
75
| No data loss^3^
76
| Less than one minute
77

78
| Connectivity database
79
| If the connectivity between the two sites is lost, the synchronous replication will fail, and it might take some time for the primary site to mark the secondary offline.
80
Some requests might receive an error message or be delayed for a few seconds.
81
Manual operations might be necessary depending on the database.
82
| No data loss^3^
83
| Seconds to minutes (depending on the database)
84

85
| Primary site
86
| If none of the {project_name} nodes are available, the loadbalancer will detect the outage and redirect the traffic to the secondary site.
87
Some requests might receive an error message while the loadbalancer has not detected the primary site failure.
88
The setup will be degraded until the primary site is back up and the session state has been manually synchronized from the secondary to the primary site.
89
| No data loss^3^
90
| Less than one minute
91

92
| Secondary site
93
| If the secondary site is not available, it will take a moment for the primary {jdgserver_name} and database to mark the secondary site offline.
94
Some requests might receive an error message while the detection takes place.
95
Once the secondary site is up again, the session state needs to be manually synced from the primary site to the secondary site.
96
| No data loss^3^
97
| Less than one minute
98

99
|===
100

101
.Table footnotes:
102
^1^ Recovery point objective, assuming all parts of the setup were healthy at the time this occurred. +
103
^2^ Recovery time objective. +
104
^3^ Manual operations needed to restore the degraded setup.
105

106
The statement "`No data loss`" depends on the setup not being degraded from previous failures, which includes completing any pending manual operations to resynchronize the state between the sites.
107

108
== Known limitations
109

110
Upgrades::
111
* On {project_name} or {jdgserver_name} version upgrades (major, minor and patch), all session data (except offline session) will be lost as neither supports zero downtime upgrades.
112

113
Failovers::
114
* A successful failover requires a setup not degraded from previous failures.
115
All manual operations like a re-synchronization after a previous failure must be complete to prevent data loss.
116
Use monitoring to ensure degradations are detected and handled in a timely manner.
117

118
Switchovers::
119
* A successful switchover requires a setup not degraded from previous failures.
120
All manual operations like a re-synchronization after a previous failure must be complete to prevent data loss.
121
Use monitoring to ensure degradations are detected and handled in a timely manner.
122

123
Out-of-sync sites::
124
* The sites can become out of sync when a synchronous {jdgserver_name} request fails.
125
This situation is currently difficult to monitor, and it would need a full manual re-sync of {jdgserver_name} to recover.
126
Monitoring the number of cache entries in both sites and the {project_name} log file can show when resynch would become necessary.
127

128
Manual operations::
129
* Manual operations that re-synchronize the {jdgserver_name} state between the sites will issue a full state transfer which will put a stress on the system (network, CPU, Java heap in {jdgserver_name} and {project_name}).
130

131
== Questions and answers
132

133
Why synchronous database replication?::
134
A synchronously replicated database ensures that data written in the primary site is always available in the secondary site on failover and no data is lost.
135

136
Why synchronous {jdgserver_name} replication?::
137
A synchronously replicated {jdgserver_name} ensures that sessions created, updated and deleted in the primary site are always available in the secondary site on failover and no data is lost.
138

139
Why is a low-latency network between sites needed?::
140
Synchronous replication defers the response to the caller until the data is received at the secondary site.
141
For synchronous database replication and synchronous {jdgserver_name} replication, a low latency is necessary as each request can have potentially multiple interactions between the sites when data is updated which would amplify the latency.
142

143
Why active-passive?::
144
Some databases support a single writer instance with a reader instance which is then promoted to be the new writer once the original writer fails.
145
In such a setup, it is beneficial for the latency to have the writer instance in the same site as the currently active {project_name}.
146
Synchronous {jdgserver_name} replication can lead to deadlocks when entries in both sites are modified concurrently.
147

148
Is this setup limited to two sites?::
149
This setup could be extended to multiple sites, and there are no fundamental changes necessary to have, for example, three sites.
150
Once more sites are added, the overall latency between the sites increases, and the likeliness of network failures, and therefore short downtimes, increases as well.
151
Therefore, such a deployment is expected to have worse performance and an inferior.
152
For now, it has been tested and documented with blueprints only for two sites.
153

154
Is a synchronous cluster less stable than an asynchronous cluster?::
155
An asynchronous setup would handle network failures between the sites gracefully, while the synchronous setup would delay requests and will throw errors to the caller where the asynchronous setup would have deferred the writes to {jdgserver_name} or the database to the secondary site.
156
However, as the secondary site would never be fully up-to-date with the primary site, this setup could lead to data loss during failover.
157
This would include:
158
+
159
--
160
* Lost logouts, meaning sessions are logged in the secondary site although they are logged out in to the primary site at the point of failover when using an asynchronous {jdgserver_name} replication of sessions.
161
* Lost changes leading to users being able to log in with an old password because database changes are not replicated to the secondary site at the point of failover when using an asynchronous database.
162
* Invalid caches leading to users being able to log in with an old password because invalidating caches are not propagated at the point of failover to the secondary site when using an asynchronous {jdgserver_name} replication.
163
--
164
+
165
Therefore, tradeoffs exist between high availability and consistency. The focus of this topic is to prioritize consistency over availability with {project_name}.
166

167
== Next steps
168

169
Continue reading in the <@links.ha id="bblocks-active-passive-sync" /> {section} to find blueprints for the different building blocks.
170

171
</@tmpl.guide>
172

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.