Intermittent -625 errors that get cleared when restarting ndsd

  • 7018286
  • 17-Nov-2016
  • 17-Nov-2016

Environment

NetIQ eDirectory 8.8 SP8 running on RHEL 6.6

Situation

These are some symptoms of this particular problem:
 - Replica synchronization reports to some server errors -625 to servers that are reachable.
 - If you connect to iMonitor -> Agent Activity, you see some thread take hold of the Write lock and never release
 - Requests to the affected server may work or may get stuck.
 - The affected server becomes unresponsive.
 - After restarting ndsd, the problem is resolved, at least for some time (a few hours or a few days)
 - If you use the utility gstack to get a list of the running threads, the lock is released and the server goes back to normal

Resolution

The error -625 indicates that a server failed to respond in a timely manner. There are some other conditions that can also cause this error, like high utilization conditions or when a server tries to write a very large amount of attributes for a particular object. In these scenarios, though, the issue reappears soon after restarting the ndsd process.

This particular problem is caused by a kernel bug in Linux, which affects mostly Red Hat Enterprise Linux 6.6, 7.0 and 7.1 (running kernel versions 2.6.32-504 up to and including 2.6.32-504.12.2), in particular servers on version 

For more information from Red Hat:
https://access.redhat.com/solutions/1386323

To avoid this issue, make sure that the latest patches are applied on your Red Hat Linux server and that the kernel version is higher than the ones mentioned above.