Replication Primer

1 Introduction
2 A bit of history
3 How is it supposed to work ?
4 What if …?
5 Best practices
6 Future Improvements

Introduction

Replication is all about propagating updates from one server to other servers in a timely manner. There are many possible possible architectures for replication, mainly : - master-slave - multi-master - and a combination of both Replication can also follow one of two possible algorithms: - full replication (send the entire entry on every modification) - delta-syncrepl (only transmit the changed attributes of the entry on mods) In current Releases of OpenLDAP (since OpenLDAP 2.3), replication is based on a Pull model where servers wishing to receive replication from another server issue requests (SRCHs) to that other (the “Master” or, in more contemporary language, the “Producer” or “Provider”). Last, not least, there are different modes of activation: - refresh - requests replication information since the last time the server requested. This is an intermittent replication and data will probably be stale for much of the time. - refresh and persist - requests that the provider respond with replication information as soon as it is posted to the provider’s database. This results in very rapid replication without large surges of replication and long delays in data being updated. In any case, replication is fully specified in RFC 4533 (https://tools.ietf.org/html/rfc4533) and the LDAP LOG Schema draft (https://tools.ietf.org/html/draft-chu-ldap-logschema-00), which describes the format used to store the updates locally.

A bit of history

Replication in LDAP has been around since the beginning. First, it was included into the X.500 specification which is LDAP’s predecessor. It has evolved somewhat in OpenLDAP. Previous versions (2.3 and before), used slurpd which had a distinct process to read server change logs and push updates to remote servers. OpenLDAP 2.2 introduced syncrepl, which is a different implementation, using a pull model (it was rewritten in 2.3). In syncrepl the consumers initiate the replication and are responsible for replication state management. delta-syncrepl was introduced in 2.4

How is it supposed to work ?

From a theoretical point of view, and if we simply put aside all the possible hiccups, replication is about propagating updates across many servers as quickly as possible. It must provide a guarantee that if one server gets disconnected from the others, it can catch up as soon as it has reconnected. Last, but not least, replication should also handle collisions; the same entry being updated more or less at the same time on two or more servers should always be propagated in a way that guarantees the network to be stable at the end of this propagation (ie, we will have only one single version of this updated entry across all the servers).

Propagation in an MMR architecture

Let’s talk about how updates are propagated across multiple servers. First, there may be many servers, but the more we have, there is an increase in data traffic between the servers. Let’s see a few scenarios.

todo: provide specifics on the data traffic increase here. i.e. For each server, traffic increases X. Is X linear, logarithmic, exponential growth? 2 servers, MMR:

+----+            +----+
|    | ---------> |    |
| S1 |            | S2 |
|    | <--------- |    |
+----+            +----+

Here, any update applied to S1 will be sent to S2, and vice-versa. Of course, we should *not* send an update back - ever. We avoid it by adding a tag containing the origin of the update (the server ID). So any update sent by S1 will contain the server ID (1), and S2 will not send to S1 an update which contains the same ID. This is pretty straightforward.

3 servers, MMR :

+----+            +----+
|    | ---------> |    |
| S1 |            | S2 |
|    | <--------- |    |
+----+            +----+
 |  ^              ^  |
 |  |              |  |
 |  |              |  |
 |  |              |  |
 |  |              |  |
 |  |    +----+    |  |
 |  +--- |    | ---+  |
 |       | S3 |       |
 +-----> |    | <-----+
         +----+

Updates made on S1 will be propagated to S2 and S3. Each server when it receives an update, will propagate the update to all the others. Of course, it’s very likely that one or more server will receive duplicate update requests from another. For example as S1 updates, it sends updates to S2 and S3. Next S3 will send the same update to S2, and vice versa. Each not knowing the other has has already received the update from S1. These duplicate updates must be discarded and are a normal part of the synchronization processing. Let see how the propagation flow works :

           +---> S2 (applied) ---> S3 (discarded)
          /
U -> S1 -+
          \
           +---> S3 (applied) ---> S2  (discarded)

Of course, we can have another scenario, where the propagation is done in a different way :

           +---> S2 (applied) ---> S3 (applied)
          /
U -> S1 -+
          \
           +-------------------------------> S3 (discarded)

This is possible if the propagation from S1 ro S3 is slower than from S1 -> S2 -> S3. In any case, the result is the same. 4 servers, MMR :

+----+            +----+
|    | ---------> |    |
| S1 | <--+  +--> | S2 |
|    | <---\/---- |    |
+----+     /\     +----+
 |  ^ \   /  \   / ^  |
 |  |  \ /    \ /  |  |
 |  |   X      X   |  |
 |  |  / \    / \  |  |
 |  | /   \  /   \ |  |
+----+     \/     +----+
|    | ----/\---> |    |
| S3 | <--+  +--> | S4 |
|    | <--------- |    |
+----+            +----+

This is more complicated: each server is connected to three others, we have arrows in all directions! Actually, we have N x (N-1) connections if we have N nodes in a complete multi-master architecture. If N grows fast, the number of connection grows in O(n2).

This schema is a worst case scenario, where all the servers are connected to each other one. It will cause a lot of unnecessary non-productive replication traffic. We can have a simpler scenario where we don’t have a cross connection between S1 and S4 or S2 and S3.

What about the propagation of updates?

Worst case : an update is propagated to N-1 node, and each of those N-1 nodes must propagate the update to their N-2 related nodes which is where it stops because the update has already been received by those related nodes (one less, because we don’t propagate to the node which updated us), so bottom line we propagate the update (N-1) + (N-1) x (N-2) times : (N-1) ^2. We are still on the O(n2) kind of operation… All in all, that means once we have a not-so-big number of nodes in the MMR architecture, the number of updates being exchanged on the network is way too big, and you can expect the network to get congested pretty quickly. Four nodes in a MMR architecture is actually too big. Each entry updated requires it to be sent nine times! Here is a table to provide some perspective on what to expect :

+-------------+-----------------------+------------------------------+
| Nb of nodes | Number of connections | Number of entries being sent |
|             | N x (N-1)             | (N-1) x (N-1)                |
+-------------+-----------------------+------------------------------+
| 2           | 2                     | 1                            |
| 3           | 6                     | 4                            |
| 4           | 12                    | 9                            |
| 5           | 20                    | 16                           |
| 6           | 30                    | 25                           |
| 7           | 42                    | 36                           |
| 8           | 56                    | 49                           |
| 9           | 72                    | 64                           |
| 10          | 90                    | 81                           |
| 11          | 110                   | 100                          |
...
| 100         | 9900                  | 9801                         |

etc… This is pretty straightforward.

Also note that we have just exposed the entries being propagated, but the CSN are also going to be sent, increasing the traffic.

How can you minimize propagation ?

You should direct client applications that do database updates (“writes) to as few master servers as possible. One is ideal. This is true of configuring”chaining” of updates arriving on replica servers (consumers). The chaining configuration should concentrate the updates onto the master taking the primary update load.

How do we stop the propagation ?

One common question when using a MMR replication topology is : “how do we stop propagating updates ?”. For instance, with 3 nodes, S1, S2 and S3, an update on S1 will be propagated to S2 and S3, but S2 will also propagate the update to S3 and S3 to S2. We then need to stop propagation at some point …

The way it works is that we have 2 rules controlling propagation :

don’t propagate to the originator server (the CSN knows where teh update comes from, as it contains the originator SID)
don’t propagate if you already have received the update.

Those 2 rules are sufficient to guarantee that propagation will stop as fast as possible.

propagation in a Master Slave rchitecture

This is *way* simpler, because the slave doesn’t propagate anything : it only receives updates, and applies them. However, if a slave is connected to more than one Master, then it must deal with potential collisions (but this is unrelated to how updates propagate). So a MS architecture is always propagating an entry in O(n) : one master propagates an updates to all its slave, and no more.

Other architectures

Of course, we can imagine other architectures, with more than a single master and a more than a single slave. We have to think of them as sub-classes of the two previous use cases.

When things go wild

Ok, we have described a system where the servers remain synchronized, and where the updates propagate successfully. What happens when this is not the case? There are a few scenario to consider: - when servers get disconnected and reconnected - when servers aren’t time synchronized - when a server is added to a network of servers - when a server is removed from a network of servers

Disconnected server

This is the most likely scenario. Here the network isn’t working correctly and servers get disconnected for a invariant period of time, and reconnect later. There are four cases that fall under this scenario:

no updates are received over the network to the server
some updates are received over the network to the server
some updates occur on a server that is disconnected from the other servers
some updates occur over the network and on a server

It’s important to understand that a ‘disconnected’ server may still be visible to a client connected to it. Imagine a scenario where you have a server in New York and a server in Paris, with the connection between the two cities interrupted due to a submarine cable being cut.

Here are the actions to take for each of those use cases :

The best possible scenario : we have nothing to do. Still we have to study how the servers know that they’re are in sync…
Ok, when the server reconnects to the network, it lags behind, and should received *every* updates applied on the network, up to the point it is at the same level
Same thing, be the other way around : the server will have to send its updates to the network, up to the point they are all at the same level
Now, we need to deal with possible collisions… Some updates might be done on the very same entries and we have to decide which updates is accurate, and which one is not.

The key, here, is that updates are done one after the other, following a time sequence. A second important aspect is that each entry is uniquely identifier across all the servers by its UUID. That means if one entry is updated in two different servers, we will have to consider the time they have been updated to know which updates has to be applied. Things can quickly becoming complex when entries are deleted on one server, and updated on another server…

Anyway, in order to understand how replication works, one has to consider the network of servers as one single server holding all the data, which get updated one modificatio after the other. It does not matter if inside this network one server is updated before another one, because at the end, what count is the date of the initial update.

That being said, once a disconnected server reconnects, we have to deal with the potential differences between those two servers. Here, we will only consider differences from a local server to a remote server where the local server has been shutdown, and thus received no updates at all. In this case, we will just have to catch up with the updates done on the remote server. There are two phases :

refresh : this is when the local server catches up with updates done on the remote server
persist : this is an optional phase where the local server remains connected to the remote server forever, receiving updates as soon as they have been applied on the remote server The second phase is optional, and if not activated, the local server will have to periodically poll the remote server to get updates.

Note : Both syncrepl and delta-syncrepl are seen as a pull operation, and the antiquated slurpd was as a push operation, the difference being that in a pull mode, the client initiate the operation, while in push mode, the server does. It’s a bit odd as in a pull mode when using refresh-and-persist mode, as the client has to regularly ask the server to provide information, when in syncrepl/delta-syncrepl, once the initial phase (re-synchronisation) is done, the client never ask again for additional updates if the client is in ‘refresh-only’ mode. In refresh-only mode, we are clearly on a pull mode.

Syncrepl or delta-syncrepl ?

At this point, we have to talk about the updates, and their content. The initial Syncrepl protocol is about transferring full entries, up to the receiver to deal with the differences (in other words, applying the update is an mplementation detail). The modified protocol, delta-syncrepl, makes an update a container that stores the modified elements - and it can be a full entry, or way less, for instance if we just modified an attribute -. Delta-synrecpl as many advantages :

it’s less verbose : we don’t transmit full entries, which may be big - especially if they contain pictures
it’s faster, as we transmit less information (except when it comes to addition of entries)
it’s way faster on reconnect, because we can keep a log of many modifications (ten of thousands, or even milions), so we don’ need to play the present phase
deleted entries don’t have t be discovered on a reconnection : they are logged

It has also a few drawbacks :

every update will require an extra write in the access-log
managing conflict is slightly more complicated

For all these reasons, it’s the best possible choice for replication.

Replication steps

Replication is done in two steps :

a refresh phase where the consumer reconnects and re-synchronize with the producer
a persistent phase where the producer send all the updates to the consumer

We will describe those steps in this part of the documentation.

Refresh phase

The refresh phase is about re-syncing the two servers that were disconnected. We have two options here, depending on if we were using delta-syncrepl, or not

with delta-syncrepl : in this case, we have a log where updates are stored. This make things way simpler, as we just have to apply any one of those updates in the order they have been applied on the consumer server. Usually, we don’t have that many updates, so it’s pretty fast. There is a caveat though : the log *must* contains all the updates since the consumer has disconnected… If it’s not the case, then we will have to go though a more complex refresh phase (which is, actually, the same mechanism than the one used).
without delta-syncrepl : It’s a bit more complex. First of all, the producer does not keep a track of deleted entries (well, it can keep a list of deleted entries in memory, if you define a syncprov-sessionlog). We have to detect which entries are present on both sides, which one have been modified, and which one have been deleted.

Let’s see in detail how the sync works now… (Keep in mind that we may be in Multi-Master configuration, so when two servers get disconnected, we might have updates on both servers. We will fist focus on a simple use case : the producer gets all the updates, there is no update on the consumer. The real scenario - ie, both servers can be updated - will be analysed later on.)

Delta-syncrepl refresh phase

So we have a change-log that register all updates (addition, modifications, deletions) on the provider. Here is a timeline exposing the two possible cases :

                 U0  U1  U2      U3
                 |   |   |       |   
              T1 v   v   v    T2 v   T3
T0...------------+------------+~~~~~~+------> Tn
                 ^            ^      ^
                 |            |      |
                 |            |      +---- consumer reconnected
                 |            |
                 |            +---- consumer disconnected
                 |
                 +---- Oldest update in the log (U0)

Here, the disconnection and reconnection are still more recent than the oldest update stored in the log : we will be able to replay all the updates between T2 and T3 on the consumer. This is the best possible scenario. Here, U1 and U2 have already been propagated to the consumer, and U3 will be propagated when the consumer reconnects. We are good, syncing up the two servers is simple. The second case is when the consumer has been disconnected for too long, and we don’t have all the

            U0       U1  U2     U3   U4
            |        |   |      |    |   
            v    T1  v   v   T2 v    v T3
T0...-------+----+~~~~~~~~~~~~~~+~~~~~~+------> Tn
            |    ^              ^      ^
            |    |              |      |
            |    |              |      +---- Consumer reconnected
            |    |              |
            |    |              +---- Oldest update in the log
            |    |
            |    +---- Consumer disconnected
            |
            +--- latest update the consumer got.

Here, the disconnnection was previous the oldest update stored in the log. We are pretty unlucky : any updates that has been made between T1 and T2 have been lost (here, U1 and U2 are not anymore present in the log)… We have to do a full resync. Basically, we fail down to a full syncrepl scenario refresh mode with no update in memory, so see later on.

Full-syncrepl refresh phase

This is where things start to be interesting… Remember that we will use this mechanism when using delta-syncrepl if we don’t have enough data in the logs. The very first step is to determinate where to start on the producer. The consumer first send a cookie containing the latest update it received from the producer, up to the producer to grab all the entries younger than this date.

                       +--- The consumer disconnects
                       |
                U1     |   U2     U3
                |      |   |      |
                v      v   v      v    T5
Producer T0...---------+~~~~~~~~~~~~~~~+-----------> Tn
             T1 ^     T2   T3     T4   ^
                |                      |
                |                      +--- The consumer reconnects
                |
                +--- Latest updates received from the consumer

Here, we see that when the consumer reconnects at T5, it will send the timestamp for the latest updates it received, T1 (for the U1 update). The producer will then select all the updates more recent than T1, ie U2 and U3. Those entries will be sent back to the consumer, in order to be applied in the consumer.

Now, we may have had some entry deletions on the producer since the consumer disconnected. Deleted entries are not anymore visible, so we facing an issue : how do we detect deleted entries and reflect those deletions on the consumer ? The syncrepl protocol anticipated this issue.

So, first, the consumer sends the latest update’s timestamp it received (here, T1). The producer returns every of its entry which timestamp is younger to the consumer (the timestamp is either stored in the createTimestamp attribute or in the modifyTimestamp attribute). That are all the modified or added entries. The provider also sends an empty entry for every entry that has not been added nor modified (and we may have a lot) in order for the consumer to know which entries it has and does not exist anymore and should be deleted.

Let see in an example :

At Tn, Consumer contains {E1(Ta), E2(Tf), E3(Tc), E4(Th), E5(Tb)} (each entry has a timetsmap Ti which is the date they were added or modified. Each of those timestamp are older than Tn). The consumer contains the exact same entries with the exact same timestamps) :

Producer : {E1(Ta), E2(Tf), E3(Tc), E4(Th), E5(Tb)} or {E1(Ta), E5(Tb), E3(Tc), E2(Tf), E4(Th) } orderer chronologically
Consumer : {E1(Ta), E2(Tf), E3(Tc), E4(Th), E5(Tb)} or {E1(Ta), E5(Tb), E3(Tc), E2(Tf), E4(Th) } orderer chronologically

The latest update received by the consumer is E4(Th). The consumer disconnect at Tn. At Tn+1, E2 is deleted on the producer. At Tn+2, E5 is modified on the producer. Now, we have those contents :

Producer : {E1(Ta), E3(Tc), E4(Th), E5(Tn+2)} or {E1(Ta), E3(Tc), E4(Th), E5(Tn+2) } orderer chronologically
Consumer : {E1(Ta), E2(Tf), E3(Tc), E4(Th), E5(Tb)} or {E1(Ta), E5(Tb), E3(Tc), E2(Tf), E4(Th) } orderer chronologically

Then the consumer reconnects.

The producer sends E5 fully to the consumer (the modified entry), and sends E1, E3 and E4 with no attributes to the consumer (the unchanged entries). The consumer replaces its local version of E5 with the new one, and for every entries he has that are not listed in the list of received entries, it can delete them : here, it’s E2… : {E1, E2, E3, E4, E5} - {E5 (updated)} - {E1, E3, E4} = {E2}

And we are done!

As you have already realized, doing so will require for the provider to send a response for *each* entry it contains, and that can be millions… The refresh phase is expected to take quite a while in this case ! It will also be very demanding on the consumer side, as we will have to compare the received entries with the local entries (at least, their entryUUID).

There is another way to process the deleted entries : assuming we keep a track of the deleted entryUUID in the provider in some log, then it’s enough to send this list to the consumer. That clearly speeds up the sync phase, especially if the consumer last updates is more recent than the oldest deleted entry on the provider.

Last, not least, if the number of updates are too big, it migth be better to just scratch the consumer database entirely, and simply re-inject all the entries.

It’s also important to know that there are two options when using syncrepl : refresh and refresh-and-persist. The first option will simply get the two server being synchronized, periodically. That means if some updates occurs on one of the servers, they will only be propagated when the next refresh will occur. The refresh-and-persist option is not only synchronizing the servers, but it will remain connected forever, so that every update get immediately propagated. This is the preferred option when setting replication.

Miscellaneous

CSN

A CSN - Change Sequence Number - is a value that contains informations on time, and provenance, of an entry’s modification or addition. Every update need to be ‘timestamped’, and this timestamp has to be unique, and this is where the CSN comes to play. The CSN is described in https://tools.ietf.org/html/draft-sermersheim-ldap-csn-02 The CSN syntax is :

<CSN>            ::= <timestamp> # <changeCount> # <replicaId> # <modifierNumber>
<timestamp>      ::= A GMT based time, YYYYmmddHHMMSS.uuuuuuZ
<changeCount>    ::= [000000-ffffff] 
<serverID>       ::= [000-fff]
<modifierNumber> ::= [000000-ffffff]

It’s critical to note that this CSN contains an information about the Server where the associated entry has been originally updated. This information is used to avoid sending back the update to the original server, breaking a potential infinite loop.

The contextCSN

At some point, when a consumer reconnects, it has to send the latest updates it has received. This information is present in the latext entryCSN (each entry has a CSN). But searching it would be a costly operation, so would be the same operation on the server (we need to compare both CSN anyway). In order to avoid doing this search, we store a contextCSN directly at the root of the database. Having this information quickly available is convenient : in order to know if two servers are syncrhonized, it’s enough to compare their contextCSN : if they are not equal, they are not in sync. Now, just because they are equal does not mean the two servers are in sync : there is a sight possibility that some missing updates have been un-noticed. This is something being discussed in another document. Side note : it’s dubious that storing this information at the root is necessary : we can find this value at startup and keep it in memory. Doing that would save a write operation when applying an update. It’s also important to remember that we may have more than one contextCSN : this piece of information is related to a backend, not to a server. We may have many backends, thus many contextCSN will be present in the backend Root : one per server being part of the replication architecture (remember that the CSN contains the ServerID). One clear thing is that you can’t have 2 different replication areas for a single backend. let’s deep a bit more these concepts : - when one define a syncrepl directive, some parameters are used to describe the area being replicated. We use a search filter for that, a search base, and we can also specify which Attributes should be replicated. That does not necessarily covers all the entries in this backend. What you can’t have is a different syncrepl directive covering a different part of the backend : that will not work because there is already a replication pattern being defined on this backend.

Syncprov

This overlay is used on the producer side. It is responsible for sending entries or updates to the consumer. The key here is : syncprov sends Entries.

How does it work ? Actually, it depends on the type of replication you use : syncrepl or delta-syncrepl.

In syncrepl mode, syncprov will keep a set of updates in memory (and we have a limited number of updates we can keep, it’s defined by a parameter), which will be sent to the consumer when it reconnects. If the consumer has been disconnected for too long, we may fail out of those pending updates, and will fallback to a full refresh mode, where we exchange the entryUUID to retrieve the added or deleted entries.

In delta-syncrepl mode, it’s more complex : we use the accesslog database to store the updates, except when we have been disconnected for too long, and have to fall back to the refresh mode. The consequence is that we need to configure syncprov in 2 places : in the replicated database, and in the accesslog database… And here, it’s important to understand that the syncprov overlay sends Entries read from the accesslog database or from the application database.

What if …?

Ok, this is supposed to work just fine. However, from time to time, we may have some trouble. This part of the documentation is trying to list the possible pitfalls and explain how to figure out what’s going on, and how to fix replications.

ServerID

serverID are associated with a specific server. Once assigned, they should not be changed. Neatness counts at design time. Once the cluster has been up and replicating, neatness is too late.

When a server is taken out of service its serverID should be retired.

RID

The RID is assigned by the consumer for its own replication housekeeping. It is reported in the slapd logs so you can figure out which of the possihle providers the change was pulled from. Many consumers can use the same RIDs without OpenLDAP having an issue. However, it makes life simpler if you assign RIDs to be used to represent producers consistently among the consumers.

CSN too old

CSN too old tells you that the syncrepl message has a CSN that is older than the current contextCSN. That means the change was already processed. This message can be (is) ignored. Think of this as a REPEATed update … a non-productive replication action. In MMR environments, small numbers are to be expected. Large numbers of REPEATs (CSN too old) should trigger a review of the replication pairing with a view to reduce the non-productive activity.

Best practices

Here are a few best-practices that are to be followed when setting up replication.

Use Delta-syncrepl, with MDB.
Keep it simple. You usually don’t need more than 2 servers using MMR. Adding a third in “hot-standby” (disaster recovery) mode is OK so long as it doesn’t get write requests during normal operation.
Avoid complex topologies in which a server can be updated by two producers that are themselves replication each other : that would result in warnings in the log (CSN too old) because updates will flow from both producers, when the update will be updated already.
Sync logs are verbose. However this is the only way you can get some information about what’s going on. You want to have it set.
NTP. syncrepl simply does not correctly work unless the server clocks are VERY closely synchronized. That’s the basis of the protocol. NTP should be running on all servers.