3

Yesterday our entire infrastructure crashed because all our ESXi hosts thought it would be an amazing idea to run updates at the same time. Edit: Or at least that's what we think happened, but nobody is really sure.

Normally we don't ever update the ESXi unless we have issues with them or somehow are informed of something that must be fixed.

Some information:

3x IBM x3650 M4 (7915D3G) configured in HA master/slave, ESXi version 5.5.0, IMM v. 3.73, Build 1331820

We're pretty baffled by the situation. Our support provided above cause of error and attached log files printing lines such as (the file is pretty huge, so I'll stick to this critical part):

2014-11-04T10:58:48.364Z [488A1B70 verbose 'VpxaHalCnxHostagent' opID=WFU-e04c5e84] [WaitForUpdatesDone] Starting next WaitForUpdates() call to hostd
2014-11-04T10:58:48.364Z [488A1B70 verbose 'VpxaHalCnxHostagent' opID=WFU-e04c5e84] [WaitForUpdatesDone] Completed callback
2014-11-04T10:58:48.406Z [488A1B70 verbose 'VpxaHalCnxHostagent' opID=WFU-e4a7ca00] [WaitForUpdatesDone] Received callback
2014-11-04T10:58:48.406Z [488A1B70 verbose 'VpxaHalCnxHostagent' opID=WFU-e4a7ca00] [VpxaHalCnxHostagent::ProcessUpdate] Applying updates from 3526 to 3527 (at 3526)
2014-11-04T10:58:48.406Z [488A1B70 verbose 'hostdvm' opID=WFU-e4a7ca00] [VpxaHalVmHostagent] 26: Config changed 'config.extraConfig["vmware.tools.internalversion"].value'
2014-11-04T10:58:48.407Z [488A1B70 verbose 'hostdvm' opID=WFU-e4a7ca00] [VpxaHalVmHostagent] 26: Config changed 'config.tools.toolsVersion'
2014-11-04T10:58:48.407Z [488A1B70 verbose 'hostdvm' opID=WFU-e4a7ca00] [VpxaHalVmHostagent] 26: Runtime changed 'guest.toolsVersion'

Nobody in our department has touched these servers on this level - we normally only manage the VMs, not the hosts. How can this happen on its own?

The servers crashed all three at the same time at 10:50 am withouth anyone doing anything specific. Our support team has been unable to find any log files indicating any kind of issue, which is very weird.

nickdnk
  • 133
  • 6

1 Answers1

3

VMware host servers do not automatically update without a deliberate action triggered from vCenter via Update Manager. Please provide the answers to:

  • What specific build numbers of ESXi do you have?
  • What time did the systems reboot?
  • What is shown in the Events log inside of vCenter for the affected hosts? It should be very clear what happened.
  • What do the IBM out-of-band management tools/logs say?

Your servers likely crashed and the IBM management appears to have automatically rebooted the systems, based on the information I see here.

You need to run updates. You're likely triggering a bug with the virtual NIC adapter in your Windows guests. It should be vmxnet3 instead of Intel e1000/e1000e. Build 1331820 of ESXi is ancient and full of problems. When running vSphere in a cluster, it's extremely important to keep things updated.

See:

Why is VMware ESXi 5.5 crashing?

VMware lockup CPU spike

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Thank you. This is very helpful. We do not normally meddle in the dark arts of administering ESXi. I will forward your answer to our support team and have them perform necessary updates. – nickdnk Nov 05 '14 at 09:37
  • Although I find it odd that this would cause all three hosts to fail at the same time. – nickdnk Nov 05 '14 at 09:48
  • All hosts *can* crash. Were your systems running Windows guests and conducting a large data transfer (e.g. backups)? – ewwhite Nov 05 '14 at 14:33
  • Not as far as I we know, no. – nickdnk Nov 06 '14 at 09:26
  • Okay - so for those curious. We updated all our hosts to build 2143827, reconfigured HA fault tolerance, upgraded Tools on all VMs and replaced E1000 ethernet with vmxnet3. Still no obvious reason *why* the servers crashed, but those are the steps taken to prevent future issues. – nickdnk Nov 06 '14 at 09:25
  • The servers crashed because of the bugs listed in the [answer above](http://serverfault.com/a/642119/13325), [VMware's knowledgebase warning](http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2059053) and probably just the right conditions and activity on your side. – ewwhite Nov 06 '14 at 10:45
  • Yeah - what I meant was there has not been found something in the logs to support this. Not that it's wrong, because I'm sure you're right. I also assume that taking the steps in this answer has fixed the issue. – nickdnk Nov 06 '14 at 11:16