Excessive Windows Performance Counter values reported on Virtual Servers. (NETIQKB73070)

  • 7773070
  • 30-Jun-2011
  • 12-Jul-2011

Environment

NetIQ AppManager 7.0.x

Situation

NT_CPULoaded job is reporting high CPU usage during idle times on Servers being hosted on Virtual Machines.
NT_MemUtil job is reporting high Memory usage during idle times on Servers being hosted on Virtual Machines.
Excessive Performance Counter values being reported by AppManager jobs that monitor Performance Counters via Perfmon for Servers being hosted on Virtual Machines (VMs).

Resolution

Microsoft warns that monitoring critical metrics like CPU and Memory utilization via standard Perfmon Counters on Servers being hosted on Virtual Machines (VMs) is highly suspect.  The problem being that the OS only knows how many CPU cycles it is being alotted by the Host server, and that may not be very much (if any) during idle times.

The following is a VMWare article that explains how CPU and Memory are handled during idle time, and the effect this has on most performance monitoring software (including AppManager):

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1077

In summary:  When the system is idle, the CPU cycles alotted to it by the VMWare (ESX) Host are nearly 0.  As a result almost anything you do, for a few seconds, will chew up all of the alloted CPU on that server... And a few seconds is all it takes for our CPU and Memory monitoring job to iterate.  The result is that you may be given what seems like an excessive value for CPU or Memory utilization (at or near 100%).

By the time the Host recognizes that CPU and/or Memory activity has climbed, and as a result, additional CPU and Memory alottments need to be made to the target server, the CPU or Memory monitoring jobs have already finished their iteration.

If you should decide to login to that server to determine why such a high value is being reported, the CPU allotment to that VM is boosted by the Host long before you can get into PerfMon to see what is going on, and the result is that by the time you configure Perfmon to show you the metrics in question, the values being returned are very low because the amount of CPU being used, versus the newly increased alottment of CPU cycles from the Host is now a very small percentage.

To by-pass this issue, and obtain more accurate counter values for those core metrics, the NT_CPULoaded and NT_MemUtil jobs in the most current Windows OS Module for AppManager include a new option to "Use virtual machine performance counters if available".  If this option is enabled in the Values tab of those jobs, the job ignores standard Perfmon CPU and Memory counter values, and instead uses VMWare counters that are added to the server's OS when the VMWare Tools are added to theserver.

Using the VMWare counters yields more accurate information regarding the CPU usage. The caveat is the VMWare counter can return more than 100% for CPU utilization. The following explains how more than 100% can be returned by the VMWare counters:

VMware reports CPU utilization as an overall component of MHz of capacity in use. Where this becomes an issue is when you have a VM with multiple vCPU?s, each with X MHz of capacity. Unfortunately, the way VMWare has decided to implement their counters is that the ?base? (or the 100% representation) is the pCPU capacity (for example, 2393 MHz). So when the VM gets going with multiple vCPU?s (each with 2393MHz) you can run into a situation where the VM vCPU utilization is greater than 100% because it?s reflecting the fact that multiple vCPU?s were being used to X amount of overall capacity.  The VMware counters don?t factor things like ?X CPU was running at Y, and Z CPU was running at N?. It?s just a total value, and because there are multiple vCPU?s, the total value can exceed the base.

Alternately, NetIQ recommends that you use our NetIQ AppManager for VMWare module to provide the most accurate CPU, Memory and other core metric values for each Server being Hosted on a given ESX Host.  This module collects CPU, Memory, and other core metric usage info directly from the Host ESX server, which means that you are getting accurate, real-time data regarding true core metric usage of each server being hosted by that ESX server.

You can read more about this module here:

https://www.netiq.com/products/am/modules/vmware.asp

Feel free to contact your NetIQ Sales Team if you would like to Trial this Module, or if you would like to view a Demo of the Module.

Cause

Standard OS Performance Counters in Perfmon (Such as CPU or Memory utilization counters) do not reflect true usage values when collected from Perfmon on OSs that are being hosted on Virtual Machines.

Additional Information

Formerly known as NETIQKB73070