22

For a while now I've been trying to figure out why quite a few of our business-critical systems are getting reports of "slowness" ranging from mild to extreme. I've recently turned my eye to the VMware environment where all the servers in question are hosted.

I recently downloaded and installed the trial for the Veeam VMware management pack for SCOM 2012, but I'm having a hard time beliving (and so is my boss) the numbers that it is reporting to me. To try to convince my boss that the numbers it's telling me are true I started looking into the VMware client itself to verify the results.

I've looked at this VMware KB article; specifically for the definition of Co-Stop which is defined as:

Amount of time a MP virtual machine was ready to run, but incurred delay due to co-vCPU scheduling contention

Which I am translating to

The guest OS needs time from the host but has to wait for resources to become available and therefore can be considered "unresponsive"

Does this translation seem correct?

If so, here is where I have a hard time beliving what I am seeing: The host that contains the majority of the VMs that are "slow" is currently showing a CPU Co-stop average of 127,835.94 milliseconds!

Does this mean that on average the VMs on this host have to wait 2+ minutes for CPU time???

This host does have two 4 core CPU's on it and it has 1x8 CPU guest and 14x4 CPU guests.

warren
  • 17,829
  • 23
  • 82
  • 134
Chuck Herrington
  • 517
  • 2
  • 7
  • 17
  • From my understanding: to avoid some problems all virtual CPUs of a VM are scheduled to run at the same time. If there is contention some VMs can run really slowly. Note assigning more vCPUs to VMs to try and improve performance when this is the problem will make things worse. – Brian Feb 20 '15 at 14:23
  • This host does have two 4 core CPU's on it and it has 1x8 CPU guest and 14x4 CPU guests. – Chuck Herrington Feb 20 '15 at 14:39
  • Why do so many of the guests have 4 vCPU configurations? – ewwhite Feb 20 '15 at 15:09
  • 6
    CPU co-scheduling contention is killing you. Need to reduce vCPU counts or move some VMs off that system. – Brian Feb 20 '15 at 15:15
  • @ChuckHerrington You should follow up or mark an answer. – ewwhite Mar 19 '15 at 12:29

4 Answers4

45

You state in the comments you have a dual quad-core ESXi host, and you're running one 8vCPU VM, and fourteen 4vCPU VMs.

If this was my environment, I would consider that to be grossly over-provisioned. I would at most put four to six 4vCPU guests on that hardware. (This is assuming that the VMs in question have load that requires them to have that high of a vCPU count.)

I'm assuming you don't know the golden rule... with VMware you should never assign a VM more cores than it needs. Reason? VMware uses somewhat strict co-scheduling that makes it hard for VMs to get CPU time unless there are as many cores available as the VM is assigned. Meaning, a 4vCPU VM cannot perform 1 unit of work unless there are 4 physical cores open at the same moment. In other words, it's architecturally better to have a 1vCPU VM with 90% CPU load, then to have a 2vCPU VM with 45% load per core.

So...ALWAYS create VMs with a minimum of vCPUs, and only add them when it's determined to be necessary.

For your situation, use Veeam to monitor CPU usage on your guests. Reduce vCPU count on as many as possible. I would be willing to bet that you could drop to 2vCPU on almost all your existing 4vCPU guests.

Granted, if all these VMs actually have the CPU load to require the vCPU count they have, then you simply need to buy additional hardware.

jlehtinen
  • 1,958
  • 2
  • 13
  • 15
  • 20
    This answer, I like it, another! (smashes coffee cup on ground) – MonkeyZeus Feb 20 '15 at 16:08
  • 2
    One thing to add.. Setup an alert for CPU % ready. http://www.davidklee.net/articles/sql-server-articles/cpu-overcommitment-and-its-impact-on-sql-server-performance-on-vmware/ – Stewpudaso Feb 21 '15 at 01:11
  • 1
    Shouldn't that be under-provisioning? – user253751 Feb 21 '15 at 05:34
  • 3
    Is that VMWare idiocy still in place? Hyper-V had the same - in the initial version and it got handled out as soon as possible. Now cores are independently scheduled. I can not imagine this still being the case for VmWare in the current version. – TomTom Feb 21 '15 at 07:42
  • @MonkeyZeus - you should change your nick to MonkeyThor before smashing the coffee cup =D – warren Feb 23 '15 at 21:11
  • 2
    @TomTom: according to http://serverfault.com/a/642316/58957 "strict co-scheduling" was employed in versions prior to 3.x (more than 10 years ago!), yet the internet is still full of this. Still the recommendation to only increase the number of vCPUs as necessary is sound. – Nickolay Aug 10 '16 at 08:16
17

I can describe some of the experiences I've had in this area...

I don't believe that VMware does an adequate job of educating customers (or administrators) about best-practices, nor do they update former best-practices as their products evolve. This question is an example of how a core concept like vCPU allocation isn't fully understood. The best approach is to start small, with a single vCPU, until you determine that the VM requires more.

For the OP, the ESXi host server has two quad-core CPUs, yielding 8 physical cores.

The virtual machine layout being described is 15 total guests; 1 x 8 vCPU and 14 x 4 vCPU systems. That's way too overcommitted, especially with the existence of a single guest with 8 vCPUs. It makes no sense. If you need a VM that big, you likely need a bigger server.

Please try to right-size your virtual machines. I'm pretty certain most of them can live with 2 vCPU. Adding virtual CPUs does not make things run faster, so if that's a remedy to a performance problem, it's the wrong approach to take.

In most environments, RAM is the most constrained resource. But CPU can be a problem if there's too much contention. You have evidence of this. RAM can also be an issue if too much is allocated to individual VMs.

It's possible to monitor this. The metric you're looking for is "CPU Ready %". You can access this from the vSphere client by selecting a VM and going to Performance > Overview > CPU Graph.

  • Under 5% CPU Ready - You're fine.
  • 5-10% CPU Ready - Keep a close look at activity.
  • Over 10% CPU Ready - Not good.

Note the Yellow line in the graph below. enter image description here

Would you mind checking this on your problem virtual machines and reporting back?

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Just looked at the graph for an exchange server we have on that overcommitted host. My graph looks the inverse of yours. CPU Usage hovers around 25% and CPU Ready spikes as high as 200% but on average is around 100%. – Chuck Herrington Feb 24 '15 at 15:11
  • @ChuckHerrington Please reduce the resources of the 8 vCPU virtual machine and measure again. – ewwhite Feb 24 '15 at 15:13
  • The only concern with that is the 8 cpu guest is one of of main production sql server database servers. We had tried reducing it to 4 before and things went ... awry. Guess we better try again. – Chuck Herrington Feb 24 '15 at 15:14
  • You can't have an 8 vCPU virtual machine on a server with 8 total cores. – ewwhite Feb 24 '15 at 15:16
  • @ewwhite unfortunately you can, you shouldn't, but you can. – Rqomey Mar 02 '15 at 11:12
2

The 127,835.94 milliseconds is a summation and you need to divide by the sample time to get the correct %RDY values. It looks like you are already getting the correct %RDY readings now though. You can go quite high with the vCPU to physical cpu ratio but not the way you are doing it.

You have way too many quad vCPU VMs and even a 8 vCPU VM. There are some quality responses already discussing right-sizing and some ramifications of not consolidating cycles to less vCPUs. The one thing I did want to clarify is that while it is no longer the case that a VM must wait for the number of physical CPUs that is equal to it's number of vCPUs to become available before any instruction can be processed, it is very detrimental to have over-provisioning of this magnitude with the ratio of multi-vCPU VMs to physical cores. 64 vCPUs on 8 cores is way beyond the maximum of 4 to 1 ratio. I assume you have HT on these processors so you have 16 logical cores? That might be OK with 1 and 2 vCPU VMs that have light load but if you have a heavy load on the VMs it would be hard to accomplish.

FYI The HT processors are not used in the CPU % used calculations - meaning if you have 32 logical core running at 2.4 Ghz on a server, you are at 100% usage when you hit 38.4 GHz. So when you see the load averages showing more than 1.0, that is why.

Here is a ESXi Host that is running a 3.5 to 1 vCPU to physical CPU (including HT cores) ratio with an average %RDY of 3%.

11:13:49pm up 125 days  7:20, 1322 worlds, 110 VMs, 110 vCPUs; CPU load average: 1.34, 1.43, 1.37


  %USED    %RUN    %SYS   %WAIT %VMWAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD  %SWPWT 
  13.51   15.87    0.50  580.17    0.03    4.67   66.47    0.29    0.00    0.00    0.00 
  15.24   18.64    0.43  491.54    0.04    4.65   63.70    0.43    0.00    0.00    0.00 
  13.44   16.40    0.44  494.10    0.02    4.33   66.24    0.48    0.00    0.00    0.00 
  13.75   16.30    0.51  494.26    0.32    4.32   66.06    0.35    0.00    0.00    0.00 
  17.56   20.72    0.58  489.35    0.04    4.31   60.76    0.45    0.00    0.00    0.00 
  13.82   16.43    0.50  494.12    0.07    4.31   66.26    0.26    0.00    0.00    0.00 
  13.65   16.81    0.49  493.81    0.03    4.21   65.93    0.37    0.00    0.00    0.00 
  13.73   16.51    0.42  493.63    0.09    4.06   66.24    0.29    0.00    0.00    0.00 
  13.89   16.37    0.55  580.61    0.04    3.95   66.69    0.28    0.00    0.00    0.00 
  14.02   17.00    0.33  494.11    0.03    3.93   66.10    0.29    0.00    0.00    0.00 
  13.44   15.84    0.49  495.17    0.04    3.87   67.24    0.27    0.00    0.00    0.00 
  13.59   15.84    0.50  580.27    0.04    3.81   67.24    0.44    0.00    0.00    0.00 
  17.10   19.86    0.50  490.97    0.04    3.74   62.21    0.39    0.00    0.00    0.00 
  13.32   15.77    0.50  495.34    0.03    3.73   67.47    0.27    0.00    0.00    0.00 
  13.43   16.15    0.48  494.95    0.05    3.72   67.09    0.38    0.00    0.00    0.00 
  13.44   16.47    0.49  580.88    0.04    3.72   66.81    0.40    0.00    0.00    0.00 
  13.71   17.00    0.29  494.13    0.03    3.71   66.26    0.37    0.00    0.00    0.00 
  17.34   20.41    0.39  490.50    0.05    3.70   61.70    0.37    0.00    0.00    0.00 
  13.42   16.19    0.50  495.07    0.03    3.66   67.15    0.38    0.00    0.00    0.00 
  13.56   16.23    0.48  494.97    0.03    3.60   67.12    0.30    0.00    0.00    0.00 
  14.95   17.53    0.42  578.82    0.09    3.57   65.72    0.35    0.00    0.00    0.00 
  13.44   16.07    0.56  581.14    0.04    3.54   67.34    0.40    0.00    0.00    0.00 
  17.19   21.27    0.37  575.41    0.04    3.44   61.08    0.51    0.00    0.00    0.00 
  13.57   16.99    0.30  580.64    0.01    3.37   66.69    0.38    0.00    0.00    0.00 
  13.79   16.25    0.43  495.25    0.04    3.35   67.39    0.39    0.00    0.00    0.00 
  11.90   14.67    0.30  496.86    0.02    3.31   69.00    0.36    0.00    0.00    0.00 
  17.13   19.28    0.56  491.83    0.03    3.30   63.26    0.48    0.00    0.00    0.00 
  14.01   16.17    0.50  495.56    0.01    3.30   67.66    0.39    0.00    0.00    0.00 
  16.86   20.16    0.57  491.19    0.05    3.20   62.44    0.43    0.00    0.00    0.00 
  14.94   17.46    0.42  580.05    0.08    3.16   66.24    0.40    0.00    0.00    0.00 
  14.56   16.94    0.36  494.86    0.08    3.14   66.91    0.42    0.00    0.00    0.00

......
ewwhite
  • 194,921
  • 91
  • 434
  • 799
mhughesnp
  • 136
  • 2
1

We've since installed Veeam ONE which has shed quite a bit of light on where our performance issues are. By looking at the CPU Bottlenecks screen in Veeam ONE then using Troubleshooting a virtual machine that has stopped responding: VMM and Guest CPU usage comparison as a reference we've figured out where allot of our "unacceptable" contention is.

One little tip that I wanted to share specifically is that in one case I could not eliminate CPU contention until I removed the snapshot that was on the VM. Hope this helps someone.

Chuck Herrington
  • 517
  • 2
  • 7
  • 17