6

I have a Hyper-V 2012 R2 cluster, 4 Dell PowerEdge R620 servers connected to a Dell PowerVault MD3600F storage array by means of FC connections; it's all pretty straightforward, all servers run WS2012R2, the cluster was freshly built a couple of months ago, all drivers and firmwares are up to date, Windows is updated to the latest available patches (even those released two days ago). There is also a SCVMM 2012 R2 server managing the whole thing, but this doesn't seem to really matter for the problem at hand.

There are several VMs running on this cluster; some of them are generation 1 VMs running Windows Server 2008 R2, while most of them are generation 2 VMs running Windows Server 2012 R2; those, too, include the latest available updates; they have actually been deployed from a template which was built soon after the cluster, and is periodically updated when Microsoft releases new patches.

Everything works pretty well, but sometimes (i.e. with no discernible reason or cause) a VM will fail to boot, crashing with the dreaded INACCESSIBLE_BOOT_DEVICE error code; this only happens upon booting (or rebooting): no VM has ever crashed while running.

Whenever this happens, there is no way to make the faulty VM boot again; this happened the first time two weeks ago with a VM which was not running any production workload yet (it was freshly deployed); we were quite in a hurry to get it to work, thus we simply scratched it and deployed a new one; but the root cause of the problem was not found.

Then it happened again two days ago, when we rebooted several VMs after patching them; three of them didn't come back up, while some other ones booted without any problem.

The faulty VMs are unable to boot even in safe mode; however, when booting into Windows Recovery Environment (from the system itself, thus from the local (virtual) disk, not from a Windows DVD; meaning the virtual disk can indeed be accessed), everything seems to be ok: the boot manager correctly lists the system to be booted (the output of bcdedit /enum all /v is actually identical to that of a working VM), all volumes are accessible, and even chkdsk shows no error at all. The only anomaly is, when running bootrec /scanos or bootrec /rebuildbcd, the tool says it's unable to find any Windows installation (although the C: volume is there and it's perfectly readable).

This only happened (at least so far) with WS2012R2 generation 2 VMs, thus I'm assuming it's caused by some problem in the EFI emulation and/or the EFI bootloader; however, this is only an assumption on my part.

The reason I mentioned updates is because I'm aware this happened before, and KB2919355 was responsible for it; also, Microsoft recently released another mega-update, KB3000850, and this also was applied both to the hosts, the virtual machines and the WS2012R2 template.

(Coincidentally, the day after this update was released, Microsoft experienced a worldwide crash of the whole Azure cloud platform, which bears some striking resemblances to what's happening to our cluster; but I'm just throwing guesses around here).

I've already opened a support case with Microsoft, but I'm also posting here, maybe someone can help; of course, if Microsoft provides a solution, I'll post it as soon as the VMs are back online.

Massimo
  • 68,714
  • 56
  • 196
  • 319
  • When the KB2919355 issue popped up, there was a solution that worked for our hyper-v farm. We had to export then import the machines and they would boot correctly. Have you tried this? – Reaces Dec 11 '14 at 11:59
  • Yes, and it didn't work; I've also tried attaching the virtual disk(s) to a new VM, but it fails to boot in the exact same way. – Massimo Dec 11 '14 at 12:12
  • Scary thing to try...but is the problem limited to VMs or are the hosts affected too? Have you tried rebooting one of the cluster nodes a few times to make sure it comes back up? May be better to find out now than Christmas morning... – Grant Dec 11 '14 at 13:53
  • Also I would build a vm, apply all except the latest updates and reboot a few times to see if it breaks. Then try again with newer updates one at a time. If it is an update breaking it you might narrow down which one that way (then tell us so we can remove it from our servers!). I was bitten by the KB2919355 fiasco. I dont want a repeat! – Grant Dec 11 '14 at 13:56
  • The hosts seem to be fine, they have been rebooted several times; the problem only affects the VMs. And it doesn't seem to be related to any specific update, because there are machines *with the exact same updates applied*, of which some are affected and some aren't. – Massimo Dec 11 '14 at 16:23
  • One though...does the amount of ram differ between working and broken vms? Try cloning a working vm then give it the vhd of a broken one. That would eliminate any weird hyperv config issues. Or just make sure every setting even ram size and such is identical to an ok vm. – Grant Dec 11 '14 at 16:28
  • Tried this, too; also tried changing the virtual hardware config of the faulty VMs, wihout any result. There seem not to be any relationship between the virtual hardware config and the problem happening. The faulty VMs have various configs, which are in some cases identical to those of working VMs. – Massimo Dec 11 '14 at 16:55
  • Also, the faulty VMs have definitely been rebooted before, several times. The problem seem to happen in a completely random way; but once it happens, a VM just won't boot anymore. – Massimo Dec 11 '14 at 16:56

2 Answers2

2

We escalated the problem up to Microsoft Premier Support and got a kernel debug specialist working on it; he discovered that something uninstalled all Hyper-V drivers from the guest VMs, thus rendering them completely unable to boot; he managed to get one of them to boot by manually injecting the drivers in the file system and Registry of the VM, and we were able to get back some critical data (it was a Certification Authority); however, the VM was now in a completely unsupported state, and thus we decided to rebuild it; we also rebuilt all the other VMs, which had no critical data on them.

As for what actually caused the driver uninstallation, the case is still opened, and the cause has not been found yet; the problem was latent in the template we used, because it sooner of later affected all the VMs which had been deployed using that template; we built another template, and this one didn't show the same issue, so we are running fine now... but we still don't know what caused the problem in the first place.


Update:

After a while, we FINALLY found what happened (I just forgot to update this answer before).

It looks like someone or something forcibly updated the Hyper-V Integration Services in the base template, which already had them, being based on the exact same O.S. release of the hosts; this caused a latent issue in the guest system where those drivers would be marked as duplicate and/or superseded, and thus in need of being removed; but this event would only trigger after a variable time interval, when Windows performs some periodical automated cleanup process. This eventually led to the complete uninstallation of all Hyper-V drivers on each VM instantiated from that template, rendering it completely unable to boot.

As for who or what performed this update (which can't be done by inserting the Integration Services setup disk and running its setup, because the installer correctly detects the drivers are already installed and exits), we still have no clue. Either someone who should have known better did it manually using PowerShell or DISM, or SCVMM was the culprit.

Massimo
  • 68,714
  • 56
  • 196
  • 319
-1

Export the VM and attached in separate Hyper-V host

boot this vm in new Hyper v host then boot and check everything is working fine or not?

we got success in our case.

try.