CORE: eDirectory does not create good/valid cores

  • 7004715
  • 21-Oct-2009
  • 27-Apr-2012

Environment

Novell eDirectory 8.7.3 for Linux
Novell eDirectory 8.8 for Linux
Novell Open Enterprise Server 1 (OES 1) Linux
Novell Open Enterprise Server 2 (OES 2) Linux
SUSE Linux Enterprise Server 9
SUSE Linux Enterprise Server 10

Situation

Whenever eDirectory cores, the core file created is invalid/corrupted making troubleshooting based on the core file impossible.
 
Here are some examples of what a stack taken from cores in this situation looks like:
[Stack #1]
Failed to read a valid object file image from memory.

Core was generated by `/opt/novell/eDirectory/sbin/ndsd'.
#0 0xffffe410 in ?? ()
(gdb) where
#0 0xffffe410 in ?? ()
#1 0xb7b7d41c in ?? ()
#2 0x00000033 in ?? ()
#3 0x00000000 in ?? ()
(gdb)
[Stack #2]
Failed to read a valid object file image from memory.
Core was generated by `/opt/novell/eDirectory/sbin/ndsd'.
#0 0xffffe410 in ?? ()
(gdb) where
#0 0xffffe410 in ?? ()
#1 0xb7ba340c in ?? ()
#2 0x00050493 in ?? ()
#3 0x00000000 in ?? ()
(gdb)
[Stack #3]
Failed to read a valid object file image from memory.
Core was generated by `/opt/novell/eDirectory/sbin/ndsd'.
#0 0xffffe410 in _start ()
(gdb) where
#0 0xffffe410 in _start ()
#1 0x55c8b3ec in ?? ()
#2 0x0007441f in ?? ()
#3 0x00000000 in ?? ()
(gdb)
[Stack #4]
Failed to read a valid object file image from memory.
Core was generated by `/opt/novell/eDirectory/sbin/ndsd'.
#0 0xffffe410 in _start ()
(gdb) where
#0 0xffffe410 in _start ()
#1 0x55c8b3ec in ?? ()
#2 0x00000045 in ?? ()
#3 0x00000000 in ?? ()
(gdb)
Here is an example of a what a good stack looks like:
[Stack #5]
Core was generated by `//opt/novell/eDirectory/sbin/ndsd'.
Program terminated with signal 6, Aborted.
#0 0xffffe410 in __kernel_vsyscall ()
#where
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xf7b3a8d0 in raise () from /.../lib/libc.so.6
#2 0xf7b3bff3 in abort () from /.../lib/libc.so.6
#3 0xf7b75cd9 in malloc_printerr () from /.../lib/libc.so.6
#4 0xf7b772c5 in free () from /.../lib/libc.so.6
#5 0xf7dc6e1d in SAL_free () from /.../lib/libsal.so.1
#6 0xda6d2bcc in PM_Free () from /.../lib/nds-modules/libnldap.so
#7 0xda6f69b1 in op_freeParams(int, slapi_operation_parameters*) () from /.../lib/nds-modules/libnldap.so
#8 0xda6d8bb6 in DoLBURPOperation () from /.../lib/nds-modules/libnldap.so
#9 0xd931bd0e in bulkLoadProcessing () from /.../lib/liblburp.so
#10 0xd931c750 in LBURPOpsProcessing () from /.../lib/liblburp.so
#11 0xd931c902 in LBURPMessageProcessing () from /.../lib/liblburp.so
#12 0xd931cc66 in LBURPOperationRequest () from /.../lib/liblburp.so
#13 0xd931cee8 in lburpExtensionHandler () from /.../lib/liblburp.so
#14 0xda6f1300 in DoExtended(slapi_pblock*) () from /.../lib/nds-modules/libnldap.so
#15 0xda6dde3e in OperationThread(op*) () from /.../lib/nds-modules/libnldap.so
#16 0xda6d35c9 in TPSetAvailableWorkInfo () from /.../lib/nds-modules/libnldap.so
#17 0x08058bed in PoolWorker(void*) ()
#18 0xf7c5e2ab in start_thread () from /.../lib/libpthread.so.0
#19 0xf7bd0a4e in clone () from /.../lib/libc.so.6
(gdb)
  

Resolution

In order to have this problem addressed, Kernel updates will be required to make sure we have *newer versions* than the following:
  • SLES 9 SP4:    2.6.5-7.319
  • SLES 10 SP22.6.16.60-0.42.5
  • SLES 10 SP3:  2.6.16.60-0.54.5
The SLES 11 Kernel (2.6.27.x and newer) does not show the problem in question.
 
 

Additional Information

First and foremost, it is important to understand this problem is not related to or caused by eDirectory. This problem is related to the Linux Kernel and therefore it could affect any Linux applications/services when trying to debug them through GDB,  which tries to access an incorrect ending address and with incorrect permissions (in the vsyscall memory region).
 
It seems the problem only takes place when a core is generated using GDB (gcore) to debug the process/application in question while it is still running and it does not takes place if a core was created automatically due to a crash. If a running process/application is killed with:  kill -SIGABRT {pid}  the core file created should be valid.
 
There are a few threads out there discussing this problem. Also, it was fixed and checked in for SLES 11, so we should not see this problem with it. However older versions of SLES (9, 10) require Kernel updates to fix it.