NNMi Health shows StatePoller Minor Warning

  • KM03687760
  • 31-Jul-2020
  • 30-Apr-2021

Summary

2020-06-23 01:57:45,447 WARNING [com.hp.ov.nms.health.log] (pool-1-thread-40) NNMi System Health Report Hostname: hpnnm.hmc.org.qa Date: 2020-06-23 01:57:44.617 Overall Status: Minor History: [2020-06-22 21:47:44.582] 'StatePoller' has changed status from 'Normal' to 'Minor'

Error

2020-06-23 WARNING [com.hp.ov.nms.health.log] (pool-1-thread-40) NNMi System Health Report

  Hostname: 
  Date: 2020-06-23
  Overall Status: Minor

  History:
    [2020-06-22] 'StatePoller' has changed status from 'Normal' to 'Minor'

StatePoller

[Minor] Stale collections (2) has status Minor because there were between 0 and 5 stale collections in the last 5 minutes

Cause

This most commonly occurs when a node responds slowly to NNMi polling requests.

Fix

A solution would be to increase the SNMP timeout for this node. Identify the typical SNMP response time from this node and configure two times the typical value as the SNMP timeout for this node. Using two times the typical
value will accommodate some variability without resulting in a retransmission.

This may, for example, be done through the NNMi console, in Configuration, Communications Configuration, Specific Node Settings. 

This will have multiple beneficial effects:
Retransmissions will be reduced, so a given request will complete more reliably and quickly.
The total number of interfaces to be polled will be divided into multiple collections to help ensure each collection completes within the polling interval, and well before the stale collection timeout.
The calculation of the number of interfaces to include in a given SNMP collection considers the SNMP timeout, the polling interval, the number of objects (e.g. interfaces) to be polled and the number of varbinds per object.
Increasing the SNMP timeout will decrease the number of objects in a collection, other parameters being equal.
 
The polling interval also affects the number of objects included in a collection. If the polling interval for this node is in the range of 5 minutes to 10 minutes, adjusting the timeout as described above should resolve
the stale collection problem. However, if the polling interval is larger than this, it will tend to increase the number of objects in the collection to the point where it will not complete within the stale poll timeout.
In this case resetting the polling interval from 5 minutes to 10 minutes and changing the timeout as above should resolve the problem.

As for the stale collection issue. By increasing the timers for the stale collection warning, if the device is just both large and slow to respond then these two properties allow you increase the warning time from 10 to 20 minutes.
However if there is something else going on i.e. a thread is dieing and so the collection becomes stale and stopped then its not going to work since the collection is never going to complete.

To investigate the stale collection issue:

1) Enable a network trace for all traffic from NNMi to the device in question. You might need to enable logging to a file set.

2) Increase the number and size of trace log files, then enable statepoller tracing, use single node tracing to reduce the amount of data collected:

https://support.microfocus.com/kb/kmdoc.php?id=KM01451687

When analysing the data what you need to look out for is when the polling policy is enabled – probably the lan performance, in the network trace you can see the requests being sent to the device and the device
responding. However, it is still going on after 10 mins ( unless you change the properties to other value) when statepoller then reports a stale collection.
This being the case the property value needs to be increased.

It may be for your customer that the value needs to be even bigger. If the device has thousands of interfaces then it may take a very long time to get through the poll.
If this is the case then you may turn off the policy for the device if its not really needed. Or perhaps restrict the interface discovered on the device in order to reduce the size of the policy check.


A workaround is to increase the stale warning timeout. This is done by adding the following two properties in the properties file.
These changed the threshold from 10 minutes to 20 minutes which allowed the policy poll to complete:

/var/opt/OV/shared/nnm/conf/props/nms-apa.properties

nms.statepoller.collector.staleCollectionTimeout 1200000

nms.statepoller.collector.staleBulkCollectionTimeout 1200000