Environment
Novell Open Enterprise Server 11 (OES 11) Linux Support Pack 1
January 2014 Scheduled Maintenance update
January 2014 Scheduled Maintenance update
Situation
During regular day to day operation, it was observed that the server is irregularly crashing in NDSD.
The crashes could not be related to a specific series of actions or events, and appeared to be occurring a totally random hours during the day.
A number of these crashes appeared in /var/log/messages as below :
Analyzing the core, the crash occurred in the function : "INCP::ServiceStreamGroupConnections(StreamGroupStruct*) ()".
Multiple cores files were analyzed, and it turned out the crashes occurred at few different code offset's within the same function.
Back traces of the two cores :
The crashes could not be related to a specific series of actions or events, and appeared to be occurring a totally random hours during the day.
A number of these crashes appeared in /var/log/messages as below :
ndsd[10607]: segfault at 58 ip 00007f312b5bab81 sp 00007f3119998bc0 error 4 in libncpengine.so.0.0.0[7f312b54d000+109000]
ndsd[29717]: segfault at 58 ip 00007f5eb3d03b81 sp 00007f5e9cc31bc0 error 4 in libncpengine.so.0.0.0[7f5eb3c96000+109000]
Analyzing the core, the crash occurred in the function : "INCP::ServiceStreamGroupConnections(StreamGroupStruct*) ()".
Multiple cores files were analyzed, and it turned out the crashes occurred at few different code offset's within the same function.
Back traces of the two cores :
#bt
#0 0x00007f5eb3d03b81 in INCP::ServiceStreamGroupConnections(StreamGroupStruct*) ()
from /opt/novell/eDirectory/lib64/nds-modules/libncpengine.so
#1 0x00007f5eb3d042ba in NCPPollerThread(StreamGroupStruct*) () from
/opt/novell/eDirectory/lib64/nds-modules/libncpengine.so
#2 0x000000000041737c in ?? ()
#3 0x00007f5eb6e6b7f6 in sigcancel_handler () from /lib64/libpthread.so.0
#4 0x0000000000000000 in ?? ()
#
#bt
#0 0x00007ff7f9a52e71 in INCP::ServiceStreamGroupConnections(StreamGroupStruct*) ()
from /opt/novell/eDirectory/lib64/nds-modules/libncpengine.so
#1 0x00007ff7f9a535aa in NCPPollerThread(StreamGroupStruct*) () from
/opt/novell/eDirectory/lib64/nds-modules/libncpengine.so
#2 0x000000000041737c in ?? ()
#3 0x00007ff7fcab87f6 in start_thread () from /lib64/libpthread.so.0
#4 0x00007ff7fc07af8d in clone () from /lib64/libc.so.6
#5 0x0000000000000000 in ?? ()
#
Resolution
Global variable locking is now improved so that it cannot accidentally be cleared when it is used by multiple threads.
Cause
A global variable named "ss->receiveBuffer" which is used per connection, suddenly became NULL unexpectedly, and this caused NDSD to crash.
In the code at some locations this variable was properly protected by a lock, but few other locations were found where it was not protected by a lock, which also was the code path that was hit when the system crashed.
In the code at some locations this variable was properly protected by a lock, but few other locations were found where it was not protected by a lock, which also was the code path that was hit when the system crashed.