Yesterday our entire infrastructure crashed because all our ESXi hosts thought it would be an amazing idea to run updates at the same time. Edit: Or at least that's what we think happened, but nobody is really sure.
Normally we don't ever update the ESXi unless we have issues with them or somehow are informed of something that must be fixed.
Some information:
3x IBM x3650 M4 (7915D3G) configured in HA master/slave, ESXi version 5.5.0, IMM v. 3.73, Build 1331820
We're pretty baffled by the situation. Our support provided above cause of error and attached log files printing lines such as (the file is pretty huge, so I'll stick to this critical part):
2014-11-04T10:58:48.364Z [488A1B70 verbose 'VpxaHalCnxHostagent' opID=WFU-e04c5e84] [WaitForUpdatesDone] Starting next WaitForUpdates() call to hostd
2014-11-04T10:58:48.364Z [488A1B70 verbose 'VpxaHalCnxHostagent' opID=WFU-e04c5e84] [WaitForUpdatesDone] Completed callback
2014-11-04T10:58:48.406Z [488A1B70 verbose 'VpxaHalCnxHostagent' opID=WFU-e4a7ca00] [WaitForUpdatesDone] Received callback
2014-11-04T10:58:48.406Z [488A1B70 verbose 'VpxaHalCnxHostagent' opID=WFU-e4a7ca00] [VpxaHalCnxHostagent::ProcessUpdate] Applying updates from 3526 to 3527 (at 3526)
2014-11-04T10:58:48.406Z [488A1B70 verbose 'hostdvm' opID=WFU-e4a7ca00] [VpxaHalVmHostagent] 26: Config changed 'config.extraConfig["vmware.tools.internalversion"].value'
2014-11-04T10:58:48.407Z [488A1B70 verbose 'hostdvm' opID=WFU-e4a7ca00] [VpxaHalVmHostagent] 26: Config changed 'config.tools.toolsVersion'
2014-11-04T10:58:48.407Z [488A1B70 verbose 'hostdvm' opID=WFU-e4a7ca00] [VpxaHalVmHostagent] 26: Runtime changed 'guest.toolsVersion'
Nobody in our department has touched these servers on this level - we normally only manage the VMs, not the hosts. How can this happen on its own?
The servers crashed all three at the same time at 10:50 am withouth anyone doing anything specific. Our support team has been unable to find any log files indicating any kind of issue, which is very weird.