Sentinel 7 Performance Monitoring

  • 7009554
  • 12-Oct-2011
  • 26-Apr-2012

Environment

Sentinel 7

Situation

Can Sentinel handle the event load (events per second)?

A common question administrators will want to know is if their system can handle the number of events being put into it every second.

Resolution

In Sentinel 7, this can now be determined pretty easily by inspecting the logs.

In Sentinel 7, the most critical performance related information is consolidated into a snapshot that by default is logged every 15 minutes.


The frequency of this snapshot is configured in a single place, in the InitializerComponent that can be found in the server.xml file in the Sentinel configuration directory (or the equivalent XML file for a remote correlation engine or collector manager).


<obj-component id="InitializerComponent">

<class>esecurity.base.ccs.comp.InitializerComponent</class>

...

<property name="throttleCheckIntervalSec">900</property>

</obj-component>


The property"throttleCheckIntervalSec" can be adjusted to have the snapshot reported more or less frequently. If no InitializerComponent is specified, the default of 900 seconds (15 minutes) is used.


Now let's see what exactly is reported:


**************************** Performance Snapshot ****************************

For the last 15.00 min:


[EventStoreService] Event throughput capacity is at 23% for the past 15.00 min.

Relative processing time of system components:

13% (1.97 min) Correlation

5% (50.62 sec) Event Indexing and Storage

2% (25.48 sec) Active Views

0% (45 ms) Security Intelligence


[RawDataStoreService] Event throughput capacity is at 6% for the past 15.00 min.


[EventRouterServer] Event throughput capacity is at 0% for the past 15.00 min.

Relative processing time of system components:

0% (4 ms) Mapping

0% (1 ms) Routing Rules and Actions

0% (0 ms) Tagging


Number of files currently in raw data message disk cache directory = 0

Number of files currently in event message disk cache directory = 0

EPS over last 900 seconds = 4,999.984, total events = 4,500,021

Events currently in log queue 0/10,000 (0%), Events currently in index queue 76/50,000 (0%)

Events fetched from store instead of RAM for indexing 0/5,138,810 (0%)

Current number of file handles this process has open 979/16,384 (5%)

Current number of search jobs = 0

56 event(s) with source [internal] avg delay 0 secs.

899,993 event(s) with source F4E636F0-D17F-102E-B5B8-B8AC6F8D0837 avg delay 1 secs.

899,993 event(s) with source F4E636F0-D17F-102E-B52B-B8AC6F8D0837 avg delay 1 secs.

899,993 event(s) with source F4E636F0-D17F-102E-B54F-B8AC6F8D0837 avg delay 1 secs.

899,993 event(s) with source F4E636F0-D17F-102E-B59B-B8AC6F8D0837 avg delay 1 secs.

899,993 event(s) with source F4E636F0-D17F-102E-B5D9-B8AC6F8D0837 avg delay 1 secs.

******************************************************************************


First let's look at this section:


[EventStoreService] Event throughput capacity is at 23% for the past 15.00 min.

Relative processing time of system components:

13% (1.97 min) Correlation

5% (50.62 sec) Event Indexing and Storage

2% (25.48 sec) Active Views

0% (45 ms) Security Intelligence


The EventStoreService is the component responsible for consuming events. There are several subcomponents that must process events, and this statistic is both a measure of the total time spent consuming events and the relative time each subcomponent spent consuming the events. In this case, you can see that about 23% of the time this component was busy processing events. The remaining 77% of the time it was basically idle. So it's pretty clear that from this statistic, the system does not appear to overwhelmed by the event rate. However, keep in mind that this is averaged out over 15 minutes, it could be that the system was overwhelmed at 100% for several minutes but then almost completely idle for the remainder, averaging out to being at 23% over a 15 minute period. This is why this is only part of the picture, we need to look at other statistics as well.


But first, let's look at the other sections reporting capacity statistics:


[RawDataStoreService] Event throughput capacity is at 6% for the past 15.00 min.


In addition to events flowing into the system, there is also raw data flowing in as well. This statistic has no subcomponents, but gives an idea of whether or not the system is able to consume the raw data quickly enough. In this case, the raw data store is only at 6% of capacity, which gives one indication that raw data processing is keeping up.


[EventRouterServer] Event throughput capacity is at 0% for the past 15.00 min.

Relative processing time of system components:

0% (4 ms) Mapping

0% (1 ms) Routing Rules and Actions

0% (0 ms) Tagging


The EventRouterServer looks at time spent processing events in the collector manager. In a typical configuration where a user only has remote collector managers, these stats will be close to 0% on the Sentinel server machine. The remote collector managers will have more useful statistics that can give one indication of how loaded a given collector manager is. However, this stat may still provide useful information on the Sentinel server for internal and correlation events, especially in cases where there are excessive numbers of these events being generated.


The capacity statistics are good place to start for understanding the system load. However, to see the full picture, we need to look at other stats as well. Let's start with these:


Number of files currently in raw data message disk cache directory = 0

Number of files currently in event message disk cache directory = 0


For both the raw data store and the event store, there is always the possibility that these subcomponents can get slow for various reasons. There could be tasks running that are compressing raw data, or there could be index maintenance going on for events for example. A temporary slowdown in these subcomponents would normally slow down the event flow. To prevent this from happening, both the raw data and events have a disk cache where events or raw data records are saved in cases where the normal stores get slow. In a healthy system, these directories will typically be empty or have a small number of files in them, but what is important to look for here is a trend over time. If the number of files in one of these directories continues to grow over time, this is a strong indication that the system is falling behind. These directories should only grow temporarily to handle spikes or temporary slowdowns in the system, then they should return to 0 size.


The next statistic reported is pretty self explanatory:


EPS over last 900 seconds = 4,999.984, total events = 4,500,021


This just provides information on what the events per second (EPS) rate has been for the last interval, and how many events were processed in that interval.


Now let's skip these statistics:


Events currently in log queue 0/10,000 (0%), Events currently in index queue 76/50,000 (0%)

Events fetched from store instead of RAM for indexing 0/5,138,810 (0%)

Current number of file handles this process has open 979/16,384 (5%)

Current number of search jobs = 0


The first two lines are really only helpful for understanding why event storage may be slow. View these as advanced statistics that you most likely will not need to look at. The next 2 lines provide random information about resources in the system. These stats are not useful in determining if Sentinel can handle the event load, but they may be useful in diagnosing other problems like search and reporting performance issues. The final block of statistics is however very useful:


56 event(s) with source [internal] avg delay 0 secs.

899,993 event(s) with source F4E636F0-D17F-102E-B5B8-B8AC6F8D0837 avg delay 1 secs.

899,993 event(s) with source F4E636F0-D17F-102E-B52B-B8AC6F8D0837 avg delay 1 secs.

899,993 event(s) with source F4E636F0-D17F-102E-B54F-B8AC6F8D0837 avg delay 1 secs.

899,993 event(s) with source F4E636F0-D17F-102E-B59B-B8AC6F8D0837 avg delay 1 secs.

899,993 event(s) with source F4E636F0-D17F-102E-B5D9-B8AC6F8D0837 avg delay 1 secs.


This provides a per Collector Manager breakdown of event flow and the average time difference between the time when the server receives events and when they were sent. If these delays are no more than a few seconds on average, then most likely the system is keeping up just fine. However, it is still possible for the event message disk cache to be growing, and still have small delays, This means that the real time subcomponents of the system (Correlation, Active Views, Security Intelligence) are able to keep up but the event store is not able to index and store fast enough. So the system is still not able to keep up overall in this case.


Also, something else that is very important. For these delay statistics to be meaningful, there is an assumption that the timestamps in the events can be compared to the time on the Sentinel server. If there is a significant time skew between the time on a collector manager machine and the Sentinel server machine, or if you have an event source configured to trust event source time and this time is skewed from the time of the collector manager machine, then these delay stats will be meaningless for determining if the system is keeping up or not. They may however be useful in determining if a given collector manager is producing events with old timestamps.