Error: -625 only in one partition

  • 7024163
  • 02-Oct-2019
  • 02-Oct-2019

Environment

eDirectory 8.8.X

Situation

Several servers in the replica ring (8)
Many partitions (16)
Mixed eDirectory versions - 2 eDir 8.8.7.1, 6 eDir 8.8.8.4

All other partitions in the tree are synchronizing without any problem or error.
A single partition reports the errors:

ndstrace log segment with TAGS, TIME SKLK CBUF LOCK VCLN SYDL

3425015552 CBUF: [2019/10/01 16:45:17.592] DEBUG: Client Reply - Context: 202c0010, DSAUpdateReplica, Size:0, failed, transport failure (-625)
3425015552 SKLK: [2019/10/01 16:45:17.592] DEBUG: DCRequest failed, transport failure (-625).
3425015552 LOCK: [2019/10/01 16:45:17.592] DEBUG: Exclusive Lock Obtained(autolock=true:
3425015552 LOCK: [2019/10/01 16:45:17.592] DEBUG: 1 [_SkulkerWorkerProc (Outbound Replication)]
3425015552 SKLK: [2019/10/01 16:45:17.593] DEBUG: Sending [00b4acc4] <.AB-12-34-56-78-99.foo.D.TREE.> from Sync Point 2 failed, transport failure (-625)
3425015552 SKLK: [2019/10/01 16:45:17.593] DEBUG: Send Partition Updates completed in Seconds 773, in MilliSeconds 39 - Total objects 51 Total Changes 1244, 3 Packet(s) Sent 
3425015552 SKLK: [2019/10/01 16:45:17.593] DEBUG: Sync - objects: 51, total changes: 1244, sent to server <.ServerA.Servers.TREE.> for .foo.D.TREE..
3425015552 SKLK: [2019/10/01 16:45:17.593] DEBUG: Sync - Process: Send updates to <.ServerA.Servers.TREE.> for .foo.D.TREE. failed, transport failure (-625).
3425015552 CBUF: [2019/10/01 16:45:17.593] DEBUG: Client Request - Context: 202c0010, DSAEndUpdateReplica, Seqment: 0, Size:a8
3425015552 CBUF: [2019/10/01 16:45:17.593] DEBUG: Client Reply - Context: 202c0010, DSAEndUpdateReplica, Size:0, failed, transport failure (-625)
3425015552 VCLN: [2019/10/01 16:45:17.593] DEBUG: request DSAEndUpdateReplica by context 202c0010 ,cFlags=00000387 , scflags=00000000 failed, transport failure (-625)
3425015552 LOCK: [2019/10/01 16:45:17.593] DEBUG: Exclusive Lock Obtained(autolock=true:
3425015552 LOCK: [2019/10/01 16:45:17.593] DEBUG: 1 [_SkulkerWorkerProc (Outbound Replication)]
3425015552 SKLK: [2019/10/01 16:45:17.593] DEBUG: Sync - Process: End Update failed, transport failure (-625).
3425015552 SKLK: [2019/10/01 16:45:17.593] DEBUG: Sync - Start outbound sync with (#=3, state=0, type=1 partition .foo.D.TREE.) .ServerA.Servers.TREE..

Change cache for the partition contains over 200,000 objects.  

SYDL tag shows about 15 to 20 minutes worth of objects not in window
OBJP tag shows the objects from SYDL with the message (already sent)

Partition and tree have no objects with high valued counts.

-625 error coincides with SYDL showing the last objects in change cache

Servers reporting -625 errors changes and are different depending on which server is the sending server.



 



Resolution

Increase the NCP Client time out on all the servers in the replica ring

For init.d OSs,  (RH 6.X / SLES 11.X)
Add the following to the /etc/opt/novell/eDirectory/sbin/pre_ndsd_start and restart ndsd
NCPCLIENT_REQ_TIMEOUT=1800
export NCPCLIENT_REQ_TIMEOUT

For systemd OSs (RH 7.X / SLES 12.X)
Add the following to the /etc/opt/novell/eDirectory/conf/env and restart ndsd

NCPCLIENT_REQ_TIMEOUT=1800

NOTE:  These settings are high timeouts and could postpone the resulting timeouts, if the timeout isn't sufficient to allow communication to complete. 

Additionally, patching the servers to the current version eDir 8.8.8.11 incorporated performance improvements in synchronization that also impacted the ability to communicate and resolve the problem. 



Cause

The connection between the sending server and the receiving server is closing or timing out before objects are sent over the connection.