1

System: Windows Server 2016 Standard, fully patched, both host and guest(s).

A strange thing happened today when I rebooted one of my Hyper-V guests (by RDP-ing to the guest and manually triggering a reboot). Here are the relevant parts of the event log:

Guest:

11:38:32 The operating system is shutting down at system time ‎2020‎-‎07‎-‎09T09:38:32.812302400Z.
11:41:00 The operating system started at system time ‎2020‎-‎07‎-‎09T09:40:59.495420000Z.

Host:

11:40:39 The operating system started at system time ‎2020‎-‎07‎-‎09T09:40:38.490643100Z.
11:40:39 The last shutdown's success status was false. The last boot's success status was true.
11:40:51 The previous system shutdown at 11:38:05 on ‎09.‎07.‎2020 was unexpected.
11:40:42 The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
         (BugcheckCode 0, BugcheckParameter1 0x0, ...)

It appears that the guest successfully rebooted, and, in the short instance of time that the guest was shut down, the host decided to power cycle, which looks ... strange. The host and the guest's clocks are perfectly synchronized.

Is this a known issue or an indication of some hardware fault? If the latter, educated guesses as to the cause (SSD, RAM or system board) are welcome.


Some additional background (don't know if it's relevant, I'll include it just in case): The host does BSOD every few months(!) with Bugcheck Codes that point to faulty hardware (although I haven't been able to determine the culprit by examining the minidumps yet, there are no device drivers in the stack trace or other obvious clues). The last time (two months ago) I switched the order of the RAM chips, and there has been no BSOD since. Since those BSODs occur so rarely, they are almost impossible to debug with the usual technique (I can't, for example, run the host with half the RAM for half a year just to see if that fixes the issue). Memtest86 reported no errors, but I know that this does not necessarily mean the RAM is perfectly OK. It's a HP Microserver Gen10 with ECC RAM.

Heinzi
  • 2,138
  • 5
  • 30
  • 51
  • 1
    1. Make sure the Integration Services in the guest are up to date. 2. Download and install the Debugging Tools for Windows on your workstation and configure the host to capture a mini memory dump. The next time it BSOD's grab the memory dump and analyze it with the debugging tools. - https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/debugger-download-tools – joeqwerty Jul 09 '20 at 13:29
  • @joeqwerty: Ad 1: Server 2016 has integration services built-in and my systems are fully patched. Ad 2: That's exactly what I did, but without success (*"there are no device drivers in the stack trace or other obvious clues"*). – Heinzi Jul 09 '20 at 13:45
  • Yes, Integration Services in Windows Server 2016 are updated via Windows Update. I was suggesting that you verify that they are in fact up to date. – joeqwerty Jul 09 '20 at 14:09
  • 1
    @joeqwerty: Makes sense. `HKLM\Software\Microsoft\Virtual Machine\Auto\IntegrationServicesVersion` says `10.0.14393` (which is just the Server 2016 build number) and there are no pending Windows Updates, so I guess they are up-to-date. – Heinzi Jul 09 '20 at 14:15
  • Voting to close. Seriously, you ahve a broken machine and you ask us how to fix it? Here is a hint: Windows servers DO NOT BSOD EVERY COUPLE OF MONTHS. Either your hardware is off, or your drivers, or your bios. Fix that. Until you have a working machine, what are you complaining? My broken machine is broken? – TomTom Jul 09 '20 at 14:22
  • And Memtest on a ECC RAM server is 100% reliable to NOW show errors - but you do get every correction in the event log so you can check whether he memory is good or bad. – TomTom Jul 09 '20 at 14:23
  • @TomTom: "*Seriously, you ahve a broken machine and you ask us how to fix it?*" Indeed, I did. Are questions about how to fix broken machines no longer on-topic on SF? I am aware that "hardware being off" is the most likely cause (BIOS and drivers are up-to-date), and nothing would make me happier than finally finding the faulty component and "fixing it" (by replacing). – Heinzi Jul 09 '20 at 14:50
  • @TomTom: Re. ECC RAM: That's a helpful hint, thanks. Do you happen to know which Event Log message is logged when an ECC error correction occurs? – Heinzi Jul 09 '20 at 14:53
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/110413/discussion-between-heinzi-and-tomtom). – Heinzi Jul 09 '20 at 15:47
  • Ah, no? I am not here to give free consulting. I suggest asking or hiring an IT technician. – TomTom Jul 09 '20 at 17:47
  • @TomTom: That's ok, everyone has their own reasons for being here. Just remind me to show the same level of condescension the next time [you have a question and I know the answer](https://stackoverflow.com/q/8062946/87698). – Heinzi Jul 09 '20 at 19:00
  • Sure. Because in that case it was not an open off topic question. Server fault paer the site rules you love to ignore is not a place for people not knowing how to debug their hardware. THe site rules here are - interesting. Not waht I would do, but hey, as a GUEST here I actually FOLLOW them. Try that. – TomTom Jul 09 '20 at 19:05
  • @TomTom: (1) I sense a level of aggression in this discussion which seems ... out of proportion. Did I do something to upset you personally? If I did, I apologize - I didn't mean to. (2) My original question was that thing in bold ("Is this a known issue..."), which is on-topic, as far as I can tell. (3) Finding the faulty component that caused a BSOD (which was *not* my question, but which developed as a side track in the comments) *would* be on-topic here as well, see [this example](https://serverfault.com/q/238/17182) or [the help center](https://serverfault.com/help/on-topic). – Heinzi Jul 09 '20 at 20:10

0 Answers0