How SI-CPUBottleneckDiagnosis/Sys_CPUBottleneckDiagnosis

  • KM03024200
  • 27-Nov-2017
  • 24-Jul-2018

Summary

Explanation of the policy logic

Reference

“SI-CPUBottleneckDiagnosis”

 

Let me explain  about “SI-CPUBottleneckDiagnosis” policy.

Meaning that we are monitoring bottleneck condition and not just high utilization. Usually we are tend to mix these two terms. Also as even though the utilization might be high ,still the systems are not reaching the bottleneck condition.

Here is the calculation

'TotalCpuUtil'> Threshold AND
'MoreReadyProcs' > 0

The MoreReadyProcs is calculated internally as mentioned below:

Data Source is SCOPE
HPUX
GBL_CPU_QUEUE - GBL_ACTIVE_CPU
Other Platforms
GBL_LOADAVG - GBL_ACTIVE_CPU

If NOT SCOPE
GBL_RUN_QUEUE - GBL_ACTIVE_CPU

If you want to get alert only based on utilization then they can tweak the current policy logic by putting an OR condition instead of AND condition.
But then it will no longer be called BottleneckDiagnosis.

Have highlighted the changes which can be done in one of the Rule named "CPU bottleneck Major Threshold Rule"


if ( ( $Session->Value('TotalCpuUtil') >= $Session->Value('GlobalCpuUtilMajorThreshold') ) ||
( $Session->Value('MoreReadyProcs') > 0 ) )

And you can take out the ovcodautil –dumpcoda output and do the math.


So it’s not only the Threshold which you have set, there is an addition parameter which is being calculated with in the policy which is MoreReadyProcs. As per the logic, bottleneck condition is attained only when the CPU utilization is high as well as the MoreReadyProcs is greater than 0. The above table shows how the value is being calculated.

 

How to test this policy:

 

The policy SI-CPUBottleneckDiagnosis uses below logic for determining if an alert should be sent:

if ( ( $Session->Value('TotalCpuUtil') >= $Session->Value('GlobalCpuUtilCriticalThreshold') ) &&
  ( $Session->Value('MoreReadyProcs') > 0 ) )
{
 DisplayTopHoggingProcess();
 $Rule->Status(1);
}

The session variable MoreReadyProcs is calculated as:


  $LoadAvgSrc = $Policy->SourceEx("CODA\\\\SCOPE\\\\GLOBAL\\\\GBL_LOADAVG");
  if ($LoadAvgSrc && $LoadAvgSrc->DataAvailable())
  {
   $Session->Value("LoadAvg", floor($LoadAvgSrc->Value()));
  }
  else
  {
   $tracer->SendLogMessage ( "SourceEx API failed for metric: CODA\\\\SCOPE\\\\GLOBAL\\\\GBL_LOADAVG", "warning");
  }
  $MoreReadyProcs = $Session->Value('LoadAvg') - $Session->Value('NumCPUs');
  $Session->Value("MoreReadyProcs", $MoreReadyProcs);

The intention here is to alert when the CPU utilization is above a certain threshold AND there are threads in the run queue waiting for CPU shares.

The number of threads waiting for CPU shares is calculated as MoreReadyProcs = LoadAvg - NumCPUs, which should be OK on Unix because GBL_LOADAVG represents the 1 minute load average, which counts the average number of runnable processes/threads, which in turn includes both processes/threads that are currently running on a CPU and those that are waiting for CPU shares.

However, per the OA documentation, on Windows servers, GBL_LOADAVG counts the average number of threads that are in "Ready" state during the collection interval. As can be seen at https://msdn.microsoft.com/en-us/library/system.diagnostics.threadstate%28v=vs.110%29.aspx, the "Ready" state does not include threads that are currently "Running" on a CPU. Hence, for Windows servers, the number of CPUs (NumCPUs) should not be substracted from LoadAvg to calculate MoreReadyProcs.

As a consequence, the sensitivity of policy SI-CPUBottleneckDiagnosis is very low on Windows and alerts will only be sent upon a severe bottleneck.