How to handle Event Insertion Failures in Sentinel's DAS Binary process

  • 3019521
  • 17-Aug-2007
  • 26-Apr-2012

Environment


Sentinel 5.1.3
Sentinel 5.1.3 Sentinel Server
Sentinel 6.0.xx
Sentinel 6.0.xx Sentinel Server

Situation

During the operation of Sentinel 5.1.3 or Sentinel 6 it is possible to have too many events for the database to receive in a given period of time or for the database to fill up so it cannot receive more events. In these cases DAS Binary, one of the Sentinel processes, will cache the events to files until the database is operating again. Some details about how that works are critical to troubleshooting and resolving issues properly and therefore are included below.

Resolution

First it is important to understand that not all events go to the file-based buffer. If there is a problem with a set of events or if the database is down the buffer will be used. Also in Sentinel 6 if the event insertion rate is too great for the database then the buffer will be utilized.

It is also important to realize that in the das_binary.xml file is a configurable option regarding how long events in the buffer should be retried. By default the value is eight hours but it can be changed. When events are older than this setting they are moved to the Expired directory where they will stay forever unless something is done.

After a database problem has been resolved some events may be so outdated that they should not be inserted into the database. Also if it is known that a certain event being inserted is corrupt and will never be entered it may be worthwhile to remove that event from the buffer completely; this would allow other valid events to proceed without having to wait for the expiration period to elapse on the broken event. To do this first stop DAS Binary and then delete or move the relevant files from the buffer directory. If some desired events were put in the Expired directory they should, with DAS Binary stopped, be able to be moved back into the normal event buffer directory to be reprocessed. After changing these files around start DAS Binary and events should proceed normally.

It is probably useful to note that the buffer directory's files hold multiple events in them. Future versions of Sentinel will try to process failed batch inserts individually to get as many events by default as possible. Currently if one "batch" has a single problem event then all events in that batch will fail until the problem is corrected. Eventually the expiration period comes and the batch is not processed anymore but this means more than just one bad event is expired.

If there are problems with event insertion be sure to check the das_binary*.* log files as well as any error message you can find specific to your database implementation. If Oracle is reporting a problem with an invalid value being inserted over and over that may be as effective for resolving the issue as the same error reported in das_binary*.* files. The DBA should be able to provide a list of errors taking place during operation.

The buffer directory used to store events that expired during regular database insertion is $ESEC_HOME/data/events/insertErrorBuffer/expired. To reprocess the events located therein simply rename 'expired' to 'data', or copy the contents of 'expired' to a sibling directory named 'data', while das_binary is stopped and then restart das_binary. das_binary can be stopped and restarted from within the Sentinel Control Center (SCC) under the Admin tab in the Servers section.

Additional Information

The contents of the 'expired' or 'data' directory is broken up by time. For example inside $ESEC_HOME/data/events/insertErrorBuffer/expired may be directories with the names such as 14012, 14013, 14014. Those directories may in turn have directories numbered zero through twenty-three. Finally the last level of directories will be numbered zero through fifty-nine.

The first level is the day number since 1970-01-01 (the Unix epoch). This number can be calculated by getting today's date in unixtime (seconds since 1970-01-01) and dividing by 86,400 (the number of seconds per day). The second represents the hour in the day and finally the third level indicates the minutes in the hour. Inside this directory will be various .zip files in most cases. These .zip files do not need to be renamed to .dat or extracted for the system to reprocess the events contained in them. The system created the .zip files and will properly extract them during processing.