Environment
eDirectory 8.8.X
Situation
Several servers in the replica ring (8)
Many partitions (16)
Mixed eDirectory versions - 2 eDir 8.8.7.1, 6 eDir 8.8.8.4
All other partitions in the tree are synchronizing without any problem or error.
A single partition reports the errors:
ndstrace log segment with TAGS, TIME SKLK CBUF LOCK VCLN SYDL
3425015552 CBUF: [2019/10/01 16:45:17.592] DEBUG: Client Reply - Context: 202c0010, DSAUpdateReplica, Size:0, failed, transport failure (-625)
3425015552 SKLK: [2019/10/01 16:45:17.592] DEBUG: DCRequest failed, transport failure (-625).
3425015552 LOCK: [2019/10/01 16:45:17.592] DEBUG: Exclusive Lock Obtained(autolock=true:
3425015552 LOCK: [2019/10/01 16:45:17.592] DEBUG: 1 [_SkulkerWorkerProc (Outbound Replication)]
3425015552 SKLK: [2019/10/01 16:45:17.593] DEBUG: Sending [00b4acc4] <.AB-12-34-56-78-99.foo.D.TREE.> from Sync Point 2 failed, transport failure (-625)
3425015552 SKLK: [2019/10/01 16:45:17.593] DEBUG: Send Partition Updates completed in Seconds 773, in MilliSeconds 39 - Total objects 51 Total Changes 1244, 3 Packet(s) Sent
3425015552 SKLK: [2019/10/01 16:45:17.593] DEBUG: Sync - objects: 51, total changes: 1244, sent to server <.ServerA.Servers.TREE.> for .foo.D.TREE..
3425015552 SKLK: [2019/10/01 16:45:17.593] DEBUG: Sync - Process: Send updates to <.ServerA.Servers.TREE.> for .foo.D.TREE. failed, transport failure (-625).
3425015552 CBUF: [2019/10/01 16:45:17.593] DEBUG: Client Request - Context: 202c0010, DSAEndUpdateReplica, Seqment: 0, Size:a8
3425015552 CBUF: [2019/10/01 16:45:17.593] DEBUG: Client Reply - Context: 202c0010, DSAEndUpdateReplica, Size:0, failed, transport failure (-625)
3425015552 VCLN: [2019/10/01 16:45:17.593] DEBUG: request DSAEndUpdateReplica by context 202c0010 ,cFlags=00000387 , scflags=00000000 failed, transport failure (-625)
3425015552 LOCK: [2019/10/01 16:45:17.593] DEBUG: Exclusive Lock Obtained(autolock=true:
3425015552 LOCK: [2019/10/01 16:45:17.593] DEBUG: 1 [_SkulkerWorkerProc (Outbound Replication)]
3425015552 SKLK: [2019/10/01 16:45:17.593] DEBUG: Sync - Process: End Update failed, transport failure (-625).
3425015552 SKLK: [2019/10/01 16:45:17.593] DEBUG: Sync - Start outbound sync with (#=3, state=0, type=1 partition .foo.D.TREE.) .ServerA.Servers.TREE..
Change cache for the partition contains over 200,000 objects.
SYDL tag shows about 15 to 20 minutes worth of objects not in window
OBJP tag shows the objects from SYDL with the message (already sent)
Partition and tree have no objects with high valued counts.
-625 error coincides with SYDL showing the last objects in change cache
Servers reporting -625 errors changes and are different depending on which server is the sending server.
Resolution
Increase the NCP Client time out on all the servers in the replica ring
For init.d OSs, (RH 6.X / SLES 11.X)
Add the following to the /etc/opt/novell/eDirectory/sbin/pre_ndsd_start and restart ndsd
NCPCLIENT_REQ_TIMEOUT=1800
export NCPCLIENT_REQ_TIMEOUT
For systemd OSs (RH 7.X / SLES 12.X)
Add the following to the /etc/opt/novell/eDirectory/conf/env and restart ndsd
NCPCLIENT_REQ_TIMEOUT=1800
NOTE: These settings are high timeouts and could postpone the resulting timeouts, if the timeout isn't sufficient to allow communication to complete.
Additionally, patching the servers to the current version eDir 8.8.8.11 incorporated performance improvements in synchronization that also impacted the ability to communicate and resolve the problem.
Cause
The connection between the sending server and the receiving server is closing or timing out before objects are sent over the connection.