Linux Access Gateway appliance randomly crash and restart

  • 7009914
  • 20-Dec-2011
  • 07-Jun-2013

Environment

Novell Access Manager 3.1 Linux Access Gateway SP3 IR2

Situation

Symptoms:

Linux Access Gateway appliance (LAG) is randomly suffering very short outages, usually less than a minute, after that returns up and running again without needs of manual intervention.

Resolution

This is fixed in NAM 3.1 SP4 IR1.

These type of symptoms can be associated to multiple root causes, please see also the Additional Information section for more information, however, in the specific scenario where this issue was reported, here is what we did to fix the problem:

1) Enable the following touch files and restart the LAG. Make sure that the LAG is patched at least to 3.1 SP3 IR2

# touch /var/novell/.releaseclosewait
# touch /var/novell/.fixCloseWait
# touch  /var/novell/.releasetimedoutbrowserconn
# /etc/init.d/novell-vmc stop
# rm /var/novell/.~newInstall
# /etc/init.d/novell-vmc start

These touch files will run a thread going through all TCP connections in the CLOSE_WAIT state for a period of time, before resetting them and cleaning up resources.

2) After enabling the touch files at point 1), the issue started to occur less frequently, however, some specific resources were still triggering the problem. An issue was found in the rewriter functionality code, related to dynamic pages and chunked response, that was reported to Engineering and fixed in SP4 IR1.

Additional Information

Further analysis of the problem revealed the ics_dyn process to crash and automatically restart, and having the touch file"/var/novell/.dumpcore" in place was possible to collect and analyze the core dumps obtained at every crash.

Here follows the back trace obtained form the core dumps, please note that in order to obtain this exact back trace the core need to be analysed having the proper symbols loaded:

Core was generated by `/opt/novell/bin/ics_dyn -t 0 -i 0 -m 25 -d -l -C 1 -M'. Program terminated with signal 11, Segmentation fault.
#0 SCacheFree (pCache=0xfd64e4, pObject=0x9a1fbc00) at scache.cc:2366
2366 scache.cc: No such file or directory.
in scache.cc

#0 SCacheFree (pCache=0xfd64e4, pObject=0x9a1fbc00) at scache.cc:2366
#1 0x87fbb8c5 in NWUtilFree (pMem=0x9a1fbc04) at scache.cc:2522
#2 0x87411a86 in SendUrl (wpRb=0x91887024) at /home/schoi/build_sles11/LinuxAccessGateway~AccessManager3.1_SP3_IR/LinuxAccessGateway/legacy/s_proxy/bmwps.c:3483
#3 0x87412755 in WpParseLoop (wpRb=0x91887024) at /home/schoi/build_sles11/LinuxAccessGateway~AccessManager3.1_SP3_IR/LinuxAccessGateway/legacy/s_proxy/bmwps.c:3978
#4 0x87412848 in BMWPSParseSendData (segment=0xfd64e4, rb=0x91887024) at /home/schoi/build_sles11/LinuxAccessGateway~AccessManager3.1_SP3_IR/LinuxAccessGateway/legacy/s_proxy/bmwps.c:4343
#5 0x8741292d in WPFilterRoutineCB (filterContext=0x92041a24,dataBuffer=0x9aaaaed0) at /home/schoi/build_sles11/LinuxAccessGateway~AccessManager3.1_SP3_IR/LinuxAccessGateway/legacy/s_proxy/bmwps.c:493

#6 0x87465a40 in CachedWebItem::writeData (this=0xa58f71cc,pBufSeg=0x9aaaaed0) at /home/schoi/build_sles11/LinuxAccessGateway~AccessManager3.1_SP3_IR/LinuxAccessGateway/legacy/s_proxy/s_cos/webcache.cpp:5016
#7 0x85b5ec16 in HttpFillResponseDataStreamManager::finishProcessData (context=0x9d4a9724, event=0x9cff3444) at /home/schoi/build_sles11/LinuxAccessGateway~AccessManager3.1_SP3_IR/LinuxAccessGateway/vcp/s_dataStream/HttpFillResponseDataStreamManager.cpp:391
#8 0x85b5fe9a in HttpDataStreamEventQueue::callback (context=0xa4ec6a24) at /home/schoi/build_sles11/LinuxAccessGateway~AccessManager3.1_SP3_IR/LinuxAccessGateway/vcp/s_dataStream/HttpDataStreamEvent.cpp:153
#9 0xb7f5f447 in _ExecuteWork (thread=0xb41fb1b0, work=0xa4ec6a24) at sysapi.c:637
#10 0xb7f5fff7 in _WorkThreadMain (param=0xb7f69780) at sysapi.c:909
#11 0xb7f5beee in threadMain (args=0xb41fb1b0) at nksthread.c:161
#12 0xb7d221b5 in start_thread () from /lib/libpthread.so.0
#13 0xb7e073be in clone () from /lib/libc.so.6



As mentioned earlier, the issue in this scenario was a combination of SCACHE memory allocation errors caused from memory leaks, and a rewriter problem related to dynamic pages and chucked response.

In case you think that your system is suffering the same problem and after that the touch files reported in the Resolution section of this document you still suffer the symptoms described, then please open a Service Request with Novell Technical Support referencing this TID.