45

I have a new HP ProLiant DL360 G7 system that is exhibiting a difficult-to-reproduce issue. The server randomly hangs at the "Power and Thermal Calibration in Progress..." screen during the POST process. This typically follows a warm-boot/reboot from the installed operating system.

enter image description here

The system stalls indefinitely at this point. Issuing a reset or cold-start via the ILO 3 power controls makes the system boot normally without incident.

When the system is in this state, the ILO 3 interface is fully accessible and all system health indicators are fine (all green). The server is in a climate-controlled data center with power connections to PDU. Ambient temperature is 64°F/17°C. The system was placed in a 24-hour component testing loop prior to deployment with no failures.

The primary operating system for this server is VMWare ESXi 5. We initially tried 5.0 and later a 5.1 build. Both were deployed via PXE boot and kickstart. In addition, we are testing with baremetal Windows and Red Hat Linux installations.

HP ProLiant systems have a comprehensive set of BIOS options. We've tried the default settings in addition to the Static high-performance profile. I've disabled the boot splash screen and just get a blinking cursor at that point versus the screenshot above. We've also tried some VMWare "best-practices" for BIOS config. We've seen an advisory from HP that seems to outline a similar issue, but did not fix our specific problem.

Suspecting a hardware issue, I had the vendor send an identical system for same-day delivery. The new server was a fully-identical build with the exception of disks. We moved the disks from the old server to the new. We experienced the same random booting issue on the replacement hardware.

I now have both servers running in parallel. The issue hits randomly on warm-boots. Cold boots don't seem to have the problem. I am looking into some of the more esoteric BIOS settings like disabling Turbo Boost or disabling the power calibration function entirely. I could try these, but they should not be necessary.

Any thoughts?

--edit--

System details:

  • DL360 G7 - 2 x X5670 Hex-Core CPU's
  • 96GB of RAM (12 x 8GB Low-Voltage DIMMs)
  • 2 x 146GB 15k SAS Hard Drives
  • 2 x 750W redundant power supplies

All firmware up-to-date as of latest HP Service Pack for ProLiant DVD release.

Calling HP and trawling the interwebz, I've seen mentions of a bad ILO 3 interaction, but this happens with the server on a physical console, too. HP also suggested power source, but this is in a data center rack that successfully powers other production systems.

Is there any chance that this could be a poor interaction between low-voltage DIMMs and the 750W power supplies? This server should be a supported configuration.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • 2
    Any way to eliminate the disks as a possible cause? Any chance you can test with some alternate SAS or SATA disks? – ErnieTheGeek Jan 10 '13 at 19:16
  • Yes, tested with a known-good set of disks in the second system. They're running in parallel. – ewwhite Jan 10 '13 at 19:20
  • 1
    The only time I've ever seen this was in a system (also a DL360 G7) where I was trying to use a non-HP card to provide storage. When I had both the SmartArray card and this other one in there, it did that. When I took either out, it passed. This is not your problem, but I pass on what I ran into. – sysadmin1138 Jan 10 '13 at 19:36
  • 1
    Possibly something network related? Try to duplicate without being connected to the network. – ErnieTheGeek Jan 10 '13 at 19:43
  • @ErnieTheGeek Unplug the host networking? What about ILO? We need that. – ewwhite Jan 10 '13 at 19:50
  • It just clicked that you aren't local to the servers. Ideally I'd say try to duplicate without the ilo connected. Maybe look into the logs on the switch and see if theres anything in there that might provide a hint. – ErnieTheGeek Jan 10 '13 at 20:15
  • I'm local to the machines. I've just obtained a THIRD unit. We've had the issue without using ILO as well. Network is necessary because we are PXE booting the OS. – ewwhite Jan 10 '13 at 20:37
  • Try this and if it works I'll change it to an "answer": Dynamic Power Capping Functionality (Default = Enabled): This BIOS option allows the user to disable the System ROM Power Calibration feature that is executed during the boot process. When disabled, the user can expect faster boot times but will not be able to enable a Dynamic Power Cap until this feature is re-enabled. – TheCleaner Jan 11 '13 at 16:18
  • 1
    @TheCleaner Disabling Dynamic Power Capping is not an option on G7 servers. It was introduced for the Gen8 ProLiant series. – ewwhite Jan 11 '13 at 16:24
  • What is the BIOS version? Do you have a case # with HP? –  Jan 11 '13 at 19:59
  • The BIOS for the G7 ProLiant series has been locked at the 5/2011 revision, so the core server BIOS is at the most recent version available: 2011.05.05 (A) – ewwhite Jan 11 '13 at 20:02
  • Any way to downgrade temporarily to a previous BIOS (the "locked" bit makes me think maybe not) and see if the issue still persists? – nedm Jan 11 '13 at 21:29
  • @nedm The BIOS on ProLiant systems can be downgraded to a backup BIOS. When I said "locked", I was referring to the fact that there haven't been any G7 HP BIOS updates since 2011. – ewwhite Jan 11 '13 at 21:36
  • Guess I'd give the downgrade a shot then, if only to rule it out. – nedm Jan 11 '13 at 21:40
  • when i got this error on my proliant DL 380G8 it was after a bios update and all i needed to do was to turn off and turn on the server. a reset using ilo was not enough and always stuck at 90% –  Feb 08 '14 at 09:14

1 Answers1

48

So, after bringing a third system into the mix, and experiencing the same issue, we began to question the environment. I dug up a copy of the HP ProLiant Servers Troubleshooting Guide and found the POST problems flowchart shown below.

enter image description here

Carefully running through the steps in the chart, we realized that the one constant across all of the servers was a KVM switch attached to the data center crash cart. This was a consumer-class USB-enabled KVM. As per the highlighted node in the flowchart, Do you have known good KVM?, I could not answer conclusively.

So, we unplugged the servers from the KVM switch and ran an automated boot, sleep 300; reboot sequence in rc.local. The servers had no issues with this, regardless of the normal DIMM, low-voltage DIMMs, PSU wattage, etc.

This was all the result of a poor interaction with a USB KVM switch. By virtue that this was the console, it ensured we'd see the failure if we were looking for it. Self-fulfilling...

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • 2
    Wow, that's a good one! Glad you sussed this out. – nedm Jan 11 '13 at 22:53
  • 7
    Holy crow. +1 to question and answer. Good work; I probably would have overlooked that. "Known good" ? Of course it's known good - it's working, ain't it? – mfinni Jan 14 '13 at 16:11
  • 1
    Thank You Very Much!!! it was definitely the KVM. Just disconnect the Video and plug the Monitor directly and the server runs smoothly again. After the O.S. load up I plugged the KVM back. I Think the problem was caused when I accidently touched the cables in the back of the server. The System halted and only react to this advise. –  May 23 '13 at 17:13
  • 1
    Any idea how a KVM would cause this? – TheLQ Oct 03 '13 at 13:23
  • @TheLQ A cheap consumer-level KVM device was the cause here. There may have also been a problem with the keyboard. – ewwhite Oct 10 '13 at 23:06
  • 1
    Same experience with DL380 G8 and a TrendNet KVM. Don't use KVM, USB works and installation proceeded accordingly. – Kevin_Kinsey Mar 29 '22 at 18:56