6

Our new server has been running basically fine for a few months. Twice, however, it shut itself down for no apparent reason.

The most recent occurrence was at 11:41pm a few days ago. The event logs show nothing untoward, and the last entry is a fairly mundane audit entry in the Security log. The UPS log shows no power issues. Nothing in particular was running, as it was after hours. Except of course the nightly backup, which starts at 10pm. The backup log also shows nothing interesting and just stops in the middle of the backup. Although the server is configured to write a kernel dump and restart, there is no memory dump and the system did not restart. It's an HP Proliant ML330 G6 Series server.

When the server was restarted manually the following morning, the following events were logged:

Log Name:      System
Source:        EventLog
Date:          4/16/2011 8:20:22 AM
Event ID:      6008
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      XXXXXXXX.xxxxxxxxxxxxxxxx.local
Description:
The previous system shutdown at 11:41:26 PM on ‎4/‎15/‎2011 was unexpected.

and

Log Name:      System
Source:        Microsoft-Windows-Kernel-Power
Date:          4/16/2011 8:20:00 AM
Event ID:      41
Task Category: (63)
Level:         Critical
Keywords:      (2)
User:          SYSTEM
Computer:      XXXXXXXX.xxxxxxxxxxxxxxxx.local
Description:
The system has rebooted without cleanly shutting down first. This error could be
caused if the system stopped responding, crashed, or lost power unexpectedly.

and

Log Name:      System
Source:        USER32
Date:          4/16/2011 8:22:34 AM
Event ID:      1076
Task Category: None
Level:         Warning
Keywords:      Classic
User:          XXXXXXXXXXXXXXX\Administrator
Computer:      XXXXXXXX.xxxxxxxxxxxxxxxx.local
Description:
The reason supplied by user XXXXXXXXXXXXXXX\Administrator for the last unexpected 
shutdown of this computer is: Other Failure: System Unresponsive
Reason Code: 0x8000005
Problem ID: 
Bugcheck String: 
Comment: 

I've spent some time researching this and found very little of use. Anyone have any ideas?

UPDATE: Here are the relevant portions of the iLO2 log:

305 04/15/2011 23:42:00 Server reset. 
306 04/15/2011 23:42:00 Server power removed. 
307 04/15/2011 23:42:00 iLO 2 network link down. 
308 04/15/2011 23:42:00 iLO 2 network link up at 100 Mbps. 
309 04/16/2011 08:17:00 Server power restored. 

UPDATE: I increased the size of the paging file to allow for full kernel dumps, so if it's really a Windows crash, I'll be able to see what happened - the next time it happens.

UPDATE: The server firmware was already up to date.

UPDATE: There were a lot of updates available for drivers and system software. I've installed most of them and now I'm just waiting to see if the problem happens again.

UPDATE 2018Jun06: after six years of trouble-free operation, this problem has returned, occurring twice in the last week or so. I'm looking into the possibility that the front panel and its wiring are faulty.

UPDATE 2018Nov30: Finally swapped out the front panel cable assembly, but the problem still occurs. Next up is the power supply.

boot13
  • 185
  • 1
  • 1
  • 9
  • dang do you have ASR enabled? – tony roth Apr 18 '11 at 20:43
  • @tony: Do you you mean the Automated System Recovery built into Windows or the HP Automatic Server Recovery? – boot13 Apr 18 '11 at 21:51
  • the intergrated log management should show the ASR, post that log if you can. If it doesn't show anything I still suspect the ASR process. do you see event id 57 in your system log? – tony roth Apr 19 '11 at 02:11
  • @tony: Sorry, still not following. By ASR do you mean the HP 'Automatic Server Recovery'? If so, where can I find it? When you refer to the 'integrated log management' do you mean the 'Integrated Management Log' in the System Management Homepage? I _can_ confirm that there are **no** events with ID 57 in the Windows System event log. – boot13 Apr 19 '11 at 18:10
  • Sorry I wasted your time but I think you are correct "23:42:00 Server power removed" I think means that an external source distrubted the power! – tony roth Apr 19 '11 at 20:27
  • ok looked at a hp server looked at the iml log via the system management homepage and it was blank! Looked directly at the log via the iml log viewer it showed errors. So try looking at it via the log view application "C:\Program Files\Compaq\Cpqimlv\cpqimlv.exe" – tony roth Apr 19 '11 at 20:41
  • @tony: Right. I see the log okay in the System Management Homepage, and it shows the same entries as the external viewer. Unfortunately, there's nothing at all interesting around the time of the problem. Hmmm. – boot13 Apr 19 '11 at 21:16
  • @tony: Finally found the ASR configuration, and it shows that ASR is definitely enabled, last reset was manual, and the ASR reset count is zero. So whatever happened appears to have skirted around ASR, or ASR is not working properly. – boot13 Apr 19 '11 at 21:18
  • w2k8r2 does support power management features wonder if a setting is out of wack! – tony roth Apr 20 '11 at 18:17
  • @tony: I checked and most of it is disabled, with the only exception being the monitor turning off after a few minutes. – boot13 Apr 20 '11 at 22:21
  • Where are the minidumps? – SLY May 03 '11 at 19:40
  • @SLY: I'm not sure what you're asking. – boot13 May 03 '11 at 19:43
  • Do you see a folder C:\Windows\Minidump with .dmp files in it? – SLY May 03 '11 at 20:26
  • @SLY: Nope. FYI, the server was configured for full kernel dumps, but not enough virtual memory was assigned to allow a full kernel dump to work. When I realized this, I increased the paging file size, but the problem has not recurred since then. – boot13 May 03 '11 at 20:59
  • Is the server powered by a UPS with logs of power outages? This would be useful to know to eliminate power loss as a cause. – John Auld Jun 30 '14 at 07:53
  • @JohnAuld: Yes, as I said in the original question, it's on a UPS and I checked the UPS log. – boot13 Jun 30 '14 at 12:12

6 Answers6

5

It's most likely a faulty power switch/LED cable kit. My ML310 G5 was doing the same thing, and that is what fixed the problem. Apparently, it is a known issue with HP.

459186-001-02 HEWLETT-PACKARD PROLIANT ML310 G5 SYSTEM FRONT LED TO SYS/BRD CABLE P/N: 459186-001-02 - HEWLETT-PACKARD ORIGINALS

Cole
  • 51
  • 1
  • 2
  • I realize that your answer is from six years ago, but I was wondering if you can recall where you heard that this was a known issue with HP. After a long hiatus, this problem has returned, and I'm looking at the front panel/cables as the possible cause. – boot13 Jun 06 '18 at 12:15
2

I had this EXACT issue happening on my Server 2008 R2 box. It turns out that the Xeon 5000 series CPUS, which your machine does use, have an issue with 2008 R2 and Hyper-V role. I'm going out on a limb here and assuming you have the Hyper-V role installed, based on the issue being identical to the one I was having.

There is a hotfix from Microsoft available HERE. I installed it on my system, and it has been trouble free since then.

DanBig
  • 11,393
  • 1
  • 28
  • 53
  • Dang. I thought you were onto something there, but we don't have that role installed. Can you describe the process you went through to determine that the Hyper-V role was the problem? It might help me to know where you looked. – boot13 May 03 '11 at 20:49
  • The link I provided was the answer for me. I was researching the issue, and that article came up. However, it still may be worthwhile to install the hotfix, even though it mentions the Hyper-V role and you dont have it installed. – DanBig May 03 '11 at 20:50
  • Okay, thanks. I'll keep that one in mind in case the firmware and driver updates don't fix it. – boot13 May 03 '11 at 22:24
2

I'm going to go waaaaaaay out on a limb here, and say that you might need a firmware update. Source. We had something similar with our DL380 G6 a while back.

Holocryptic
  • 5,665
  • 2
  • 28
  • 37
  • That looks promising. I'll be installing that firmware update shortly. Thanks! – boot13 May 03 '11 at 22:26
  • 1
    Dang. Finally scheduled a time to do this, and it turns out we're already running the latest firmware, including the SPLD. Back to the other suggestions... – boot13 May 12 '11 at 16:23
1

Is the machine overheating? Check the fans and vents for dust bunnies.

ed209
  • 392
  • 1
  • 5
  • @ed: Nope. Brand new server, fans all running fine, temps all nominal. – boot13 Apr 18 '11 at 21:52
  • 1
    Hmm. Perhaps run a stress test tool, like Intel Burn Test to make sure the server isn't faulty first. Google Intel Burn Test. – ed209 Apr 19 '11 at 13:52
  • @ed: Okay, thanks for the tip. I'll run that as soon as I can find a window of opportunity and post the results. – boot13 Apr 19 '11 at 18:14
1

Do you have the HP management agent software installed? You mention Windows event logs and backup logs but not the "hardware" logs. You need to look there too because spontaneous shutdowns might be related to a hardware issue that you won't be able to see info about anywhere else.

icky3000
  • 4,718
  • 1
  • 20
  • 15
  • @icky: Okay, I looked at the System Management Home Page and the logs therein: System Management Homepage Log, Integrated Management Log, HP Version Control Agent Log and Integrated Lights-Out 2 Log. The only one that shows anything interesting is the iLO2 log, excerpts of which I will add to my question. – boot13 Apr 18 '11 at 23:58
  • My theory was/is that the absence of a kernel dump means that it wasn't Windows that crashed so it must be hardware or power related. It's unfortunate there's nothing useful in the HP logs. How about tracing the power back - dual power supplies? cord in a place where it could snag when someone walks by? connected to a PDU that other servers are on or ? on the same circuit as other things? That kind of stuff. – icky3000 Apr 19 '11 at 03:23
  • @icky: I think you're probably right that Windows wasn't the culprit. Still, I will increase the size of the paging file as I understand that at its current size it may not allow for full kernel dumps. As for the power side, there's only one power supply, so that's a possibility; no cord issues as the server is in a (well-ventilated) closet and things are tucked away appropriately; and the server is on a solid, new UPS that is functioning perfectly and shows nothing untoward in its history. – boot13 Apr 19 '11 at 18:25
  • There's probably a web interface for the UPS too and you should look into that and its logs. There might be something in there like it purposely shut down that power connection for some reason. – icky3000 Apr 19 '11 at 19:28
  • @icky: Yeah, I checked that. It has logs and some nice graphs of the power coming into and leaving the UPS and throughout the period in question everything was smooth. Kind of disappointing, really: the previous time this occurred we didn't yet have the UPS so I pretty much assumed it was a power glitch. It could still be the server's power supply, though. – boot13 Apr 19 '11 at 21:10
  • It could be. I really would expect you'd see something in the HP logs then though. Not sure where else to suggest you look at this point. Hard to troubleshoot further without being on the actual box. – icky3000 Apr 19 '11 at 22:58
0

If that really was a system crash, you would have found an event such as this in the System log:

Level: Error
Source: Bugcheck
Event ID: 1001
Text: The computer has rebooted from a bugcheck.  The bugcheck was: [...]

Also, being configured to save a kernel dump and then reboot, the server would have done just that.

The absence of such an event log and of a subsequent reboot means the shutdown was caused by an external event (power missing, hardware fault...). Also, your ILO logs seem to confirm that a power failure was the actual reason.

Massimo
  • 68,714
  • 56
  • 196
  • 319
  • Power does seem to have been the issue, but since the UPS didn't report anything unusual, I'm going to assume it was a faulty power switch assembly as suggested by Cole. Thanks! – boot13 Oct 21 '12 at 13:30