I have a Hyper-V 2012 R2 cluster, 4 Dell PowerEdge R620 servers connected to a Dell PowerVault MD3600F storage array by means of FC connections; it's all pretty straightforward, all servers run WS2012R2, the cluster was freshly built a couple of months ago, all drivers and firmwares are up to date, Windows is updated to the latest available patches (even those released two days ago). There is also a SCVMM 2012 R2 server managing the whole thing, but this doesn't seem to really matter for the problem at hand.
There are several VMs running on this cluster; some of them are generation 1 VMs running Windows Server 2008 R2, while most of them are generation 2 VMs running Windows Server 2012 R2; those, too, include the latest available updates; they have actually been deployed from a template which was built soon after the cluster, and is periodically updated when Microsoft releases new patches.
Everything works pretty well, but sometimes (i.e. with no discernible reason or cause) a VM will fail to boot, crashing with the dreaded INACCESSIBLE_BOOT_DEVICE
error code; this only happens upon booting (or rebooting): no VM has ever crashed while running.
Whenever this happens, there is no way to make the faulty VM boot again; this happened the first time two weeks ago with a VM which was not running any production workload yet (it was freshly deployed); we were quite in a hurry to get it to work, thus we simply scratched it and deployed a new one; but the root cause of the problem was not found.
Then it happened again two days ago, when we rebooted several VMs after patching them; three of them didn't come back up, while some other ones booted without any problem.
The faulty VMs are unable to boot even in safe mode; however, when booting into Windows Recovery Environment (from the system itself, thus from the local (virtual) disk, not from a Windows DVD; meaning the virtual disk can indeed be accessed), everything seems to be ok: the boot manager correctly lists the system to be booted (the output of bcdedit /enum all /v
is actually identical to that of a working VM), all volumes are accessible, and even chkdsk
shows no error at all. The only anomaly is, when running bootrec /scanos
or bootrec /rebuildbcd
, the tool says it's unable to find any Windows installation (although the C: volume is there and it's perfectly readable).
This only happened (at least so far) with WS2012R2 generation 2 VMs, thus I'm assuming it's caused by some problem in the EFI emulation and/or the EFI bootloader; however, this is only an assumption on my part.
The reason I mentioned updates is because I'm aware this happened before, and KB2919355 was responsible for it; also, Microsoft recently released another mega-update, KB3000850, and this also was applied both to the hosts, the virtual machines and the WS2012R2 template.
(Coincidentally, the day after this update was released, Microsoft experienced a worldwide crash of the whole Azure cloud platform, which bears some striking resemblances to what's happening to our cluster; but I'm just throwing guesses around here).
I've already opened a support case with Microsoft, but I'm also posting here, maybe someone can help; of course, if Microsoft provides a solution, I'll post it as soon as the VMs are back online.