Our new server has been running basically fine for a few months. Twice, however, it shut itself down for no apparent reason.
The most recent occurrence was at 11:41pm a few days ago. The event logs show nothing untoward, and the last entry is a fairly mundane audit entry in the Security log. The UPS log shows no power issues. Nothing in particular was running, as it was after hours. Except of course the nightly backup, which starts at 10pm. The backup log also shows nothing interesting and just stops in the middle of the backup. Although the server is configured to write a kernel dump and restart, there is no memory dump and the system did not restart. It's an HP Proliant ML330 G6 Series server.
When the server was restarted manually the following morning, the following events were logged:
Log Name: System
Source: EventLog
Date: 4/16/2011 8:20:22 AM
Event ID: 6008
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: XXXXXXXX.xxxxxxxxxxxxxxxx.local
Description:
The previous system shutdown at 11:41:26 PM on 4/15/2011 was unexpected.
and
Log Name: System
Source: Microsoft-Windows-Kernel-Power
Date: 4/16/2011 8:20:00 AM
Event ID: 41
Task Category: (63)
Level: Critical
Keywords: (2)
User: SYSTEM
Computer: XXXXXXXX.xxxxxxxxxxxxxxxx.local
Description:
The system has rebooted without cleanly shutting down first. This error could be
caused if the system stopped responding, crashed, or lost power unexpectedly.
and
Log Name: System
Source: USER32
Date: 4/16/2011 8:22:34 AM
Event ID: 1076
Task Category: None
Level: Warning
Keywords: Classic
User: XXXXXXXXXXXXXXX\Administrator
Computer: XXXXXXXX.xxxxxxxxxxxxxxxx.local
Description:
The reason supplied by user XXXXXXXXXXXXXXX\Administrator for the last unexpected
shutdown of this computer is: Other Failure: System Unresponsive
Reason Code: 0x8000005
Problem ID:
Bugcheck String:
Comment:
I've spent some time researching this and found very little of use. Anyone have any ideas?
UPDATE: Here are the relevant portions of the iLO2 log:
305 04/15/2011 23:42:00 Server reset.
306 04/15/2011 23:42:00 Server power removed.
307 04/15/2011 23:42:00 iLO 2 network link down.
308 04/15/2011 23:42:00 iLO 2 network link up at 100 Mbps.
309 04/16/2011 08:17:00 Server power restored.
UPDATE: I increased the size of the paging file to allow for full kernel dumps, so if it's really a Windows crash, I'll be able to see what happened - the next time it happens.
UPDATE: The server firmware was already up to date.
UPDATE: There were a lot of updates available for drivers and system software. I've installed most of them and now I'm just waiting to see if the problem happens again.
UPDATE 2018Jun06: after six years of trouble-free operation, this problem has returned, occurring twice in the last week or so. I'm looking into the possibility that the front panel and its wiring are faulty.
UPDATE 2018Nov30: Finally swapped out the front panel cable assembly, but the problem still occurs. Next up is the power supply.