Tracing Replication Issues in slapd Logs
THIS HAS TO BE BROKEN UP INTO MANY LINKED PAGES. PROBABLY GROSSLY TOO DETAILED
A few items to note:
- Log files with synchronization messages can be extremely large and
difficult to work with.
- Familiarity with text processing utilities like grep, sed, sort,
etc. can help filter out log messages unrelated to replication.
- Terminal-based text editors like vim or emacs tend to work better than GUI-based editors when working with extremely large log files.
- Familiarity with text processing utilities like grep, sed, sort,
etc. can help filter out log messages unrelated to replication.
- Most syncrepl client connections last for several days and log
hundreds of thousands of messages. If a syncrepl client connection spans
several days, log rotation will cause the connection messages to span
several archive files. This is normal.
- To recombine archived and compressed slapd logs, use
zcat file1.gz file2.gz [file...] > newlog.log
- To recombine archived and compressed slapd logs, use
Effective use of RIDs
Some syncrepl log messages are not recorded with an associated connection ID, but, they record the RID (Replica ID) used by the syncrepl client. RIDs may be an integer between 000 and 999.
In MMR environments a DIT being replicated may have one or more syncrepl client configurations (one for each provider). Each syncrepl client for that DIT must have a unique RID. Consumers do not share their RIDs with other consumers. This means consumers can use the same RIDs for their syncrepl clients as the other consumers. In an example MMR environment:
- There are two providers (P1 and P2) and two consumers (C1 and C2)
- C1 uses RIDs 001 for P1 and uses 002 for P2
- C2 uses RIDs 001 for P1 and uses 002 for P2
- P1 uses RID 001 for P2, and P2 uses RID 001 for P1
Having identical RIDs on each consumer is commonly used and does not create a conflict. However, using identical RIDs for a provider on several consumers can make troubleshooting difficult as identical RIDs make it difficult to tell from the provider side which consumer is associated with a replication issue. It is recommended that a numbering scheme is used to help differentiate syncrepl clients. For example:
- C1 uses RID 011 for P1 and uses 012 for P2
- C2 uses RID 021 for P1 and uses 022 for P2
- P1 uses RID 302 for P2, and P2 uses RID 301 for P1
Log Entry Format
When slapd is run in regular mode (no -d
The general pattern for a slapd log entry is as follows:
timestamp<space>hostname<space>slapd[process-id]:<space>message
Note: Timestamp formats vary depending on the OS and sylog configuration.
When running slapd in debug mode (-d
timestamp(epoch in hexadecimal)<space>message
This pattern prints the timestamp as a less than human readable Unix epoch timestamp in hexadecimal format. Debug log messages also exclude the hostname, process name and process ID number. If a human-readable timestamp is needed from a debug log file, the following perl one-liner will convert the timestamp:
perl -pi -e 'use POSIX "strftime"; s/^([[:xdigit:]]{8})/POSIX::strftime("%F %T", gmtime(hex $1))/ge' <infile>.log
Identifying and Following Replication Sessions
Identifying Replication Connections
When consumers are connected to providers, the normally use the sync type “refreshAndPersist” (only in very specific cases should consumers use “refreshOnly”). The “refreshAndPersist” sync type maintains a continual ldapsearch on the provider to monitor for and process changes from the provider. In the provider logs, the messages for these connections look like any other ldapsearch log messages. There are a couple of ways to find syncrepl connections in the log files.
- Look for connections with large “
op=NNN
” values. The “op=NNN
” is the sequence number of the operations performed during an individual connection. (Replication sessions typically last forever while other sessions are short-time lived) - Look for connections that begin with
‘
SRCH base="<searchbase>" scope=2 deref=0 filter="(objectClass=\*)"
’ and are followed by messages with ‘SRCH attr=* +
’ This is the exact search filter used by syncrepl, most of the time, unless the syncrepl configuration specified a different filter - and in this case, try to match it in the logs).
Following a Replication Session: Single-Master
The following entry and log examples follow an LDAP replication session in a single-master environment from the start of a syncrepl connection to the completion of one modification. In the example, the provider is “P1” and the consumer is “C1” using RID=001.
Pre-update contextCSN and entryCSN:
– P1 –
dn: dc=example,dc=com
contextCSN: 20161117195107.062105Z#000000#000#000000
dn: cn=foo,dc=example,dc=com
entryCSN: 20161117193250.972569Z#000000#000#000000
– C1 –
dn: dc=example,dc=com
contextCSN: 20161117195107.062105Z#000000#000#000000
dn: cn=foo,dc=example,dc=com
entryCSN: 20161117193250.972569Z#000000#000#000000
Provider Log:
Consumer connects to provider and starts persistent search:
01: Nov 17 11:56:33 P1 slapd[3011]: conn=1000 fd=10 ACCEPT from IP=127.0.0.1:54514 (IP=0.0.0.0:389)
02: Nov 17 11:56:33 P1 slapd[3011]: conn=1000 op=0 BIND dn="cn=replicator,dc=example,dc=com" method=128
03: Nov 17 11:56:33 P1 slapd[3011]: conn=1000 op=0 BIND dn="cn=replicator,dc=example,dc=com" mech=SIMPLE ssf=0
04: Nov 17 11:56:33 P1 slapd[3011]: conn=1000 op=0 RESULT tag=97 err=0 duration=0.229ms text=
05: Nov 17 11:56:33 P1 slapd[3011]: conn=1000 op=1 SRCH base="dc=example,dc=com" scope=2 deref=0 filter="(objectClass=*)"
06: Nov 17 11:56:33 P1 slapd[3011]: conn=1000 op=1 SRCH attr=* +
Client connects to provider and requests a MOD operation:
07: Nov 17 11:56:41 P1 slapd[3011]: conn=1001 fd=25 ACCEPT from IP=127.0.0.1:59724 (IP=0.0.0.0:389)
08: Nov 17 11:56:41 P1 slapd[3011]: conn=1001 op=0 BIND dn="dc=example,dc=com" method=128
09: Nov 17 11:56:41 P1 slapd[3011]: conn=1001 op=0 BIND dn="dc=example,dc=com" mech=SIMPLE ssf=0
10: Nov 17 11:56:41 P1 slapd[3011]: conn=1001 op=0 RESULT tag=97 err=0 duration=0.068ms text=
11: Nov 17 11:56:45 P1 slapd[3011]: conn=1001 op=1 MOD dn="cn=foo,dc=example,dc=com"
12: Nov 17 11:56:45 P1 slapd[3011]: conn=1001 op=1 MOD attr=sn
Provider queues a new entryCSN for the entry being modified:
13: Nov 17 11:56:45 P1 slapd[3011]: slap_queue_csn: queueing 0x7fa558106980 20161117195645.269065Z#000000#000#000000
Modification is written successfully:
14: Nov 17 11:56:45 P1 slapd[3011]: conn=1001 op=1 RESULT tag=103 err=0 duration=0.513ms text=
Provider commits the new entryCSN:
15: Nov 17 11:56:45 P1 slapd[3011]: slap_graduate_commit_csn: removing 0x7fa558106980 20161117195645.269065Z#000000#000#000000
Provider notifies the consumer of new contextCSN:
16: Nov 17 11:56:45 P1 slapd[3011]: syncprov_sendresp: cookie=rid=001,csn=20161117195645.269065Z#000000#000#000000
Client disconnects:
17: Nov 17 11:56:48 P1 slapd[3011]: conn=1001 op=2 UNBIND
18: Nov 17 11:56:48 P1 slapd[3011]: conn=1001 fd=25 closed
The persistent search from the consumer continues to listen for changes until the connection is severed or slapd is shut down
Consumer Log:
Consumer has connected with provider and is waiting for changes:
01: Nov 17 11:56:33 C1 slapd[4031]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
Provider has notified (“sent sync cookie”) to consumer with the new contextCSN:
02: Nov 17 11:56:45 C1 slapd[4031]: do_syncrep2: rid=001 cookie=rid=001,csn=20161117195645.269065Z#000000#000#000000
Syncrepl has a message to modify the cn=coo,dc=example,dc=com entry:
03: Nov 17 11:56:45 C1 slapd[4031]: syncrepl_message_to_entry: rid=001 DN: cn=foo,dc=example,dc=com, UUID: f9c248aa-3a99-1036-9792-57a50c7ece6f
Syncrepl searches for the entry to be modified:
04: Nov 17 11:56:45 C1 slapd[4031]: syncrepl_entry: rid=001 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY)
Entry found (“(0)” = Success):
05: Nov 17 11:56:45 C1 slapd[4031]: syncrepl_entry: rid=001 be_search (0) 06: Nov 17 11:56:45 C1 slapd[4031]: syncrepl_entry: rid=001 cn=foo,dc=example,dc=com
Slapd queues and commits the new CSN on the entry:
07: Nov 17 11:56:45 C1 slapd[4031]: slap_queue_csn: queueing 0x7f51f4109690 20161117195645.269065Z#000000#000#000000 08: Nov 17 11:56:45 C1 slapd[4031]: slap_graduate_commit_csn: removing 0x7f51f4109690 20161117195645.269065Z#000000#000#000000
Syncrepl successfully performs entry modification:
09: Nov 17 11:56:45 C1 slapd[4031]: syncrepl_entry: rid=001 be_modify cn=foo,dc=example,dc=com (0)
Slapd queues and commits the new contextCSN for the replica:
10: Nov 17 11:56:45 C1 slapd[4031]: slap_queue_csn: queueing 0x7f51f410cd80 20161117195645.269065Z#000000#000#000000
11: Nov 17 11:56:45 C1 slapd[4031]: slap_graduate_commit_csn: removing 0x7f51f410cd80 20161117195645.269065Z#000000#000#000000
Post-update contextCSN and entryCSN:
- On P1
dn: dc=example,dc=com
contextCSN: 20161117195645.269065Z#000000#000#000000
dn: cn=foo,dc=example,dc=com
entryCSN: 20161117195645.269065Z#000000#000#000000
- on C1
dn: dc=example,dc=com
contextCSN: 20161117195645.269065Z#000000#000#000000
dn: cn=foo,dc=example,dc=com
entryCSN: 20161117195645.269065Z#000000#000#000000
Following a Replication Session: Multi-Master
IN PROGRESS
Demo Server Environment
- MMR environment using two providers and two consumers
- Each provider replicates from the opposite provider
- Each consumer replicates from both providers
Name | Type | serverID | RID | Provider |
---|---|---|---|---|
p1 | Provider | 1 | 111 | ldap://p2/ |
p2 | Provider | 2 | 121 | ldap://p1/ |
c1 | Consumer | N/A | 211 | ldap://p1/ |
” | ” | ” | 212 | ldap://p2/ |
c2 | Consumer | N/A | 221 | ldap://p1/ |
” | ” | ” | 222 | ldap://p2/ |
In this demonstration, ldapmodify
applies the following
update:
dn: cn=te1,dc=example,dc=com"
changetype: modify
replace: sn
sn: Flenderson
Pre-update Replication State: All contextCSNs for the dc=example,dc=com and cn=accesslog backends are synchronized. The “0s” values indicate the time difference between CSNs.
ContextCSN values for: dc=example,dc=com
| RID| p1:389| p2:10389| c1:20389| c2:30389|
| 000| 20161122083007| 20161122083007| 20161122083007| 20161122083007|
| 001| 20161122104905| 0s| 0s| 0s|
| 002| 20161122105116| 0s| 0s| 0s|
ContextCSN values for: cn=accesslog
| RID| p1:389| p2:10389|
| 000| | 20161122083007|
| 001| 20161122104905| 0s|
| 002| 20161122105116| 0s|
Provider 1:
# Client binds to provider, makes MOD operation to change the sn attribute of "cn=te1,dc=example,dc=com".
00: Nov 22 03:00:08 p1 slapd[342]: conn=1005 fd=10 ACCEPT from IP=127.0.0.1:54514 (IP=0.0.0.0:389)
01: Nov 22 03:00:08 p1 slapd[342]: conn=1005 op=1 MOD dn="cn=te1,dc=example,dc=com"
02: Nov 22 03:00:08 p1 slapd[342]: conn=1005 op=1 MOD attr=sn
# Provider queues new CSNs for the entry and context
03: Nov 22 03:00:08 p1 slapd[342]: slap_queue_csn: queueing 0x7fed6810f650 20161122110008.590508Z#000000#001#000000
04: Nov 22 03:00:08 p1 slapd[342]: slap_queue_csn: queueing 0x7fed6810e4f0 20161122110008.590508Z#000000#001#000000
# MOD is successful, new CSN is applied to the entry and context
05: Nov 22 03:00:08 p1 slapd[342]: slap_graduate_commit_csn: removing 0x7fed6810e4f0 20161122110008.590508Z#000000#001#000000
07: Nov 22 03:00:08 p1 slapd[342]: slap_graduate_commit_csn: removing 0x7fed6810f650 20161122110008.590508Z#000000#001#000000
# Provider returns LDAP_SUCCESS message to client
06: Nov 22 03:00:08 p1 slapd[342]: conn=1005 op=1 RESULT tag=103 err=0 duration=0.805ms text=
# Syncprov sends sync cookie to p2 (serverID=2). Cookie is for replica ID 121 (on p2) from serverID=1 with new contextCSN
08: Nov 22 03:00:08 p1 slapd[342]: syncprov_sendresp: to=002, cookie=rid=121,sid=001,csn=20161122110008.590508Z#000000#001#000000
# Syncprov sends sync cookies to consumers c1 (rid=211) and c2 (rid=221) with new contextCSN
09: Nov 22 03:00:08 p1 slapd[342]: syncprov_sendresp: cookie=rid=211,sid=001,csn=20161122110008.590508Z#000000#001#000000
10: Nov 22 03:00:08 p1 slapd[342]: syncprov_sendresp: cookie=rid=221,sid=001,csn=20161122110008.590508Z#000000#001#000000
# Client unbinds from provider
11: Nov 22 03:00:10 p1 slapd[342]: conn=1005 op=2 UNBIND
<sync complete on provider>
Provider 2:
# Provider p2 receives the sync cookie from p1 with new contextCSN
01: Nov 22 03:00:08 p2 slapd[442]: do_syncrep2: rid=121 cookie=rid=121,sid=001,csn=20161122110008.590508Z#000000#001#000000
# Provider p2 queues the new CSN for the entry and context
02: Nov 22 03:00:08 p2 slapd[442]: slap_queue_csn: queueing 0x7f03e0101950 20161122110008.590508Z#000000#001#000000
03: Nov 22 03:00:08 p2 slapd[442]: slap_queue_csn: queueing 0x7f03e0105bd0 20161122110008.590508Z#000000#001#000000
# Syncprov skips sending updates to p1 (serverID=1), as it is the source of the MOD op
04: Nov 22 03:00:08 p2 slapd[442]: syncprov_matchops: skipping original sid 001
# Update from p1 is good, p2 commits the new CSN to the entry and context
05: Nov 22 03:00:08 p2 slapd[442]: slap_graduate_commit_csn: removing 0x7f03e0105bd0 20161122110008.590508Z#000000#001#000000
06: Nov 22 03:00:08 p2 slapd[442]: slap_graduate_commit_csn: removing 0x7f03e0101950 20161122110008.590508Z#000000#001#000000
# Indicates a successful MOD operation on the entry
07: Nov 22 03:00:08 p2 slapd[442]: syncrepl_message_to_op: rid=121 be_modify cn=te1,dc=example,dc=com (0)
08: Nov 22 03:00:08 p2 slapd[442]: slap_queue_csn: queueing 0x7f03e01019a0 20161122110008.590508Z#000000#001#000000
09: Nov 22 03:00:08 p2 slapd[442]: slap_queue_csn: queueing 0x7f03e010af80 20161122110008.590508Z#000000#001#000000
10: Nov 22 03:00:08 p2 slapd[442]: slap_graduate_commit_csn: removing 0x7f03e010af80 20161122110008.590508Z#000000#001#000000
11: Nov 22 03:00:08 p2 slapd[442]: slap_graduate_commit_csn: removing 0x7f03e01019a0 20161122110008.590508Z#000000#001#000000
# Syncprov sends sync cookies to c1 (rid=212) and c2 (rid=222)
12: Nov 22 03:00:08 p2 slapd[442]: syncprov_sendresp: cookie=rid=212,sid=002,csn=20161122110008.590508Z#000000#001#000000
13: Nov 22 03:00:08 p2 slapd[442]: syncprov_sendresp: cookie=rid=222,sid=002,csn=20161122110008.590508Z#000000#001#000000
<sync complete on provider>
Consumer 1:
# Syncrepl consumer rid=211 gets sync cookie for MOD from p1 (serverID=1) and queues CSN
01: Nov 22 03:00:08 c1 slapd[482]: do_syncrep2: rid=211 cookie=rid=211,sid=001,csn=20161122110008.590508Z#000000#001#000000
02: Nov 22 03:00:08 c1 slapd[482]: slap_queue_csn: queueing 0x7f68dc10cc40 20161122110008.590508Z#000000#001#000000
# Syncrepl consumer rid=212 gets sync cookie for MOD from p2 (serverID=2)
03: Nov 22 03:00:08 c1 slapd[482]: do_syncrep2: rid=212 cookie=rid=212,sid=002,csn=20161122110008.590508Z#000000#001#000000
# Update from p1 is written to c1 and queued CSN is committed
04: Nov 22 03:00:08 c1 slapd[482]: slap_graduate_commit_csn: removing 0x7f68dc10cc40 20161122110008.590508Z#000000#001#000000
05: Nov 22 03:00:08 c1 slapd[482]: syncrepl_message_to_op: rid=211 be_modify cn=te1,dc=example,dc=com (0)
06: Nov 22 03:00:08 c1 slapd[482]: slap_queue_csn: queueing 0x7f68dc10cee0 20161122110008.590508Z#000000#001#000000
07: Nov 22 03:00:08 c1 slapd[482]: slap_graduate_commit_csn: removing 0x7f68dc10cee0 20161122110008.590508Z#000000#001#000000
# Syncrepl consumer ignores update from rid=212, as it has already been saved via rid=211
08: Nov 22 03:00:08 c1 slapd[482]: do_syncrep2: rid=212 CSN pending, ignoring 20161122110008.590508Z#000000#001#000000 (reqStart=20161122110008.000001Z,cn=accesslog)
<sync complete on consumer>
Consumer 2:
# Syncrepl consumer rid=222 gets sync cookie for MOD from p2 (serverID=2), queues CSN
01: Nov 22 03:00:08 c2 slapd[593]: do_syncrep2: rid=222 cookie=rid=222,sid=002,csn=20161122110008.590508Z#000000#001#000000
02: Nov 22 03:00:08 c2 slapd[593]: slap_queue_csn: queueing 0x7fba68109960 20161122110008.590508Z#000000#001#000000
# Entry is updated, CSN is committed and success message is applied
03: Nov 22 03:00:08 c2 slapd[593]: slap_graduate_commit_csn: removing 0x7fba68109960 20161122110008.590508Z#000000#001#000000
04: Nov 22 03:00:08 c2 slapd[593]: syncrepl_message_to_op: rid=222 be_modify cn=te1,dc=example,dc=com (0)
05: Nov 22 03:00:08 c2 slapd[593]: slap_queue_csn: queueing 0x7fba6810ca50 20161122110008.590508Z#000000#001#000000
06: Nov 22 03:00:08 c2 slapd[593]: slap_graduate_commit_csn: removing 0x7fba6810ca50 20161122110008.590508Z#000000#001#000000
# Update form rid=221 is ignored because the CSN is less than or equal to the current CSN
07: Nov 22 03:00:08 c2 slapd[593]: do_syncrep2: rid=221 cookie=rid=221,sid=001,csn=20161122110008.590508Z#000000#001#000000
08: Nov 22 03:00:08 c2 slapd[593]: do_syncrep2: rid=221 CSN too old, ignoring 20161122110008.590508Z#000000#001#000000 (reqStart=20161122110008.000000Z,cn=accesslog)
<sync complete on consumer>
Post-Update Sync State:
ContextCSN values for: dc=example,dc=com
| RID| p1:389| p2:10389| c1:20389| c2:30389|
| 000| 20161122083007| 20161122083007| 20161122083007| 20161122083007|
| 001| 20161122110008| 0s| 0s| 0s|
| 002| 20161122105116| 0s| 0s| 0s|
ContextCSN values for: cn=accesslog
| RID| p1:389| p2:10389|
| 000| | 20161122083007|
| 001| 20161122110008| 0s|
| 002| 20161122105116| 0s|