VMware lockup CPU spike

Question

After a CPU usage spike, the host server for VMWare ESXi 5.5 became unresponsive regarding the DRAC, Network, and cluster membership.

The host is a blade module is Dell PowerEdge M820 in a Dell M1000e chassis with 4 x Xeon E5-4620s and 128 GB RAM and local SSD's in RAID 6.

All VM's are Server 2008 R2. There is one SQL server that uses the SSD RAID for data. Otherwise the VM's are stored on a QNAP with a 10 Gbit link.

The resources are not over committed.

No hardware failures have ever been logged or indicated on the blade module or the QNAP.

The server needed to be cold reboot from the M1000e DRAC in order to become functional again.

This appears to be a VMWare failure of some sort that hard locked the hardware, however the logs pre-lockup are missing 3 month prior to kicking it.

Since the restart -VMWare and server hardware have not reported or indicated any issues.

Has anyone else experienced anything like this? Any ideas, thoughts, suggestions?

Can you tell us which network driver the Windows VMs are using? e1000? e1000e? vmxnet3? — ewwhite, Mar 07 '14 at 19:39
@mfinni [Yep](http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2059053). — ewwhite, Mar 07 '14 at 19:48

ewwhite · Answer 1 · 2014-03-30T19:26:37.433

5

This is likely a problem with your Windows VM(s). Can you tell us which network driver(s) the Windows VMs are using? Intel e1000? Intel e1000e? VMware vmxnet3?

If they're not using the VMware vmxnet3, you're running into an awful bug that manifests itself in host crashes (PSODs). See the corresponding Knowledge Base article #2059053

Here's a trace of a crash on a 5.5 ESXi host following heavy network activity between a Windows Server 2008R2 and a Windows Server 2012 virtual machine.

The fix is to migrate to the vmxnet3 driver. This bites many people because e1000/e1000e are the defaults when creating Windows virtual machines.

note the "e1000" references in the trace...

edited Mar 30 '14 at 19:26

answered Mar 07 '14 at 19:45

ewwhite

194,921
91
434
799

Thanks for the trace log. Every VM is using VMXNET3. We stayed away from the E1000 dues to its reputation. – Steven Walker Mar 12 '14 at 14:15
@StevenWalker Well, when this happens, you need to try to get the crash dump. What was on the screen of the server before you rebooted it? – ewwhite Mar 12 '14 at 14:17
We could not view it, that the first thing we attempted. The M820 and its DRAC was completely unresponsive. – Steven Walker Mar 12 '14 at 14:22
3

Call Dell. Your IPMI or out-of-band management shouldn't just lock-up on you. This isn't *Supermicro* gear ;) – ewwhite Mar 12 '14 at 14:24

mfinni · Answer 2 · 2014-03-07T20:25:39.077

In your position, I would open a ticket with Dell and run all the diagnostics. They will probably direct you to upgrade all of the firmwares to the latest version, if you're not already. This is generally a good idea.

I would also open a ticket with VMware for the same issue.

You may have run into an OS bug or hardware failure. Alternatively, you could simply flag this system as "possible problem" and wait to see if it ever happens again.

/Edit - or you could listen to Ed, and/or check the VMware KB.

VMware lockup CPU spike

2 Answers2

Linked