Linux Access Gateway randomly crashing every few weeks in nkEnterDebugger()

  • 7009774
  • 21-Nov-2011
  • 26-Apr-2012

Environment

Novell Access Manager 3.1 Linux Access Gateway
Novell Access Manager 3.1 Support Pack 3 applied

Situation

Access Manager setup and working well - users can successfully access all Linux Access Gateway (LAG) protected resources after authenticating to the Identity (IDP) server. Every few weeks however, the LAG proxy restarts and the administrator is notified by email due to the level of logging enabled.

By forcing a coredump when the event happened (simply requires the creation of the /tmp/.dumpcore file on the LAG), one could look at the backtrace to determine the stack when the core happened. The following backtrace was shown as the output to the following command:

# cd /chroot/lag/
# gdb opt/novell/bin/ics_dyn core.1234          (1234 being the processID of the ics_dyn process that crashed)

(gdb) bt
#0 nkEnterDebugger () at nksutil.c:693
#1 0x887bbf15 in SCacheFree (pCache=0x887d7484, pObject=0x955cb0c0) at scache.cc:2268
#2 0x887bc0f9 in NWUtilFree (pMem=0x955cb0c4) at scache.cc:2408
#3 0x87603579 in ~VccRequestHandle (this=0xa2e9a024) at /home/vitta/313-656868/vcp/s_vcc/VccService.cpp:1087
#4 0x876035bf in VccHandleEvent::handleDelete (this=0x9c692e64,vt=0xac23b924) at /home/vitta/313-656868/vcp/s_vcc/VccService.cpp:458
#5 0x87606d82 in VccHandleEvent::handleDelete (event=0xffffffff,eventQueue=0x90f7df38) at /home/vitta/313-656868/vcp/s_vcc/VccService.cpp:395
#6 0xb5efb463 in active_serial_mainloop (dispset=0x8069cc0) at /opt/novell/include/vxe/evtinfra/evtsched.h:88
#7 0xb7fcddf3 in threadMain (args=0xb0003870) at nksthread.c:156
#8 0xb7d931b5 in start_thread () from /lib/libpthread.so.0
#9 0xb7e783be in ?? () from /lib/libc.so.6
Backtrace stopped: previous frame inner to this frame (corrupt stack?)


Resolution

Manually restart the LAG after a large number of changes are applied. The issue occurs due to a small memory leak that occurs applying changes over a period of time. The crash is not usually seen within a 3/4 week period, and the impact to users is minimal in a fault tolerant environment as the server usually comes back up again within seconds (unless the administrator forces a coredump with the .dumpcore touch file).

Should the impact of the crash be of major concern, the LAG high availability option could be enabled. The feature described at https://www.novell.com/documentation/novellaccessmanager31/accessgatewayhelp/?page=/documentation/novellaccessmanager31/accessgatewayhelp/data/brpp00h.html allows the administrator run multiple instances of the ics_dyn process on the same LAG host. If one of the ics_dyn processes dies, another process configured on the same box will take over with no downtime at all for users.

The issue has been reported to engineering and will not be a problem with Access Manager 3.2 release (the AGS does not have the same interface).