Sentinel services are unresponsive

  • 7007293
  • 02-Dec-2010
  • 10-Oct-2012

Environment

Sentinel Log manager 1.1.x
Sentinel 7.0.x

Situation

Sentinel  is unresponsive. The Collector Manager is down and events cannot be queried.

Error found in the server0.0.logs
java.io.FileNotFoundException:(Too many open files) 
Full text of the server0.0.log errors;
Fri Jul 16 17:10:49 EEST
2010|SEVERE|IndexedLogComponent.LoggerThread|Unknown.unknown
java.io.FileNotFoundException:/opt/novell/sentinel_log_mgr_1.0_x86-64/data/eventdata/20100716_408E7E50-C02E-4325-B7C5-2B9FE4853476/events.evt/blocks
(Too many open files)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:212)
at esecurity.base.io.IndexedDeflateOutput.<init>(IndexedDeflateOutput.java:153)
at esecurity.ccs.comp.event.indexedlog.IndexedLogger.<init>(IndexedLogger.java:81)
at esecurity.ccs.comp.event.indexedlog.IndexedLogPartitionManager.getIndexedLogger(IndexedLogPartitionManager.java:343)
at esecurity.ccs.comp.event.indexedlog.IndexedLogComponent$LoggerThread.run(IndexedLogComponent.java:396)


Resolution

Suse Linux Enterprise Server, along with all distributions of linux, have limits set to ensure security for logged on connections. In this case the problem is that we see too many open files for the novell user. This is because the default limit is set relatively low and since everything on linux is a file, the processes for the novell user on SLES11 can quickly exceed the default limits.

To verify that this is the issue you must first do the following;

1> Determine the PID of the server.sh;

novell@sles11SLM:> ps -ef | grep novell

<sample output>
novell    4116  4114 41 14:23 ?        00:01:14 /opt/novell/sentinel_log_mgr/jre/bin/java -Dsrv_name=Server -server -Desecurity.home=/opt/novell/sentinel_log_mgr...
.....
.....

2> Check the open files for the novell server.sh process by use of the lsof command specifying the pid# obtained using the ps command above;

novell@sles11SLM:> lsof -p 4116 > /tmp/lsofout.txt

This command will take the output of the lsof on the pid above and write it to a temporary file called /tmp/lsofout.txt.

3> Obtain a word count on the file to see how many open files the process has i.e.

novell@sles11SLM:> wc -l /tmp/lsofout.txt
6115 /tmp/lsofout.txt

this tells me that I have 6116 open files for the novell user for the server.sh process.

4> Check the ulimit settings to determine what the max open files is set to;

novell@sles11SLM:> ulimit -aH | grep 'open files'
open files                     (-n) 6115

This tells us that we are at our limit for the max open files thus the error in the server0.0.logs.


To resolve this issue, you can change the max open files setting by editing the /etc/security/limits.conf file.

1> Stop the SLM/Sentinel service issuing server stop as the novell user
      novell@sles11SLM:> server.sh stop
2> Edit the /etc/security/limits.conf file adding in the following lines;

novell soft nofile 65000
novell hard nofile 65000
 
3> Save the file and start the SLM/Sentinel services using server start as the novell user
     novell@sles11SLM:> server.sh start

You should now have resolved the issue as your new max open files for the novell user is 65000.

It is strongly recommended that you consult your Linux administrator before making changes to the limits.conf file. The SLM processes should not exceed this even under heavy load but it is recommended to discuss this change with your system admin prior to making this change.

For more information on the limits.conf file check your man pages.

Bug Number

624095