I have a new HP ProLiant DL360 G7 system that is exhibiting a difficult-to-reproduce issue. The server randomly hangs at the "Power and Thermal Calibration in Progress..." screen during the POST process. This typically follows a warm-boot/reboot from the installed operating system.
The system stalls indefinitely at this point. Issuing a reset or cold-start via the ILO 3 power controls makes the system boot normally without incident.
When the system is in this state, the ILO 3 interface is fully accessible and all system health indicators are fine (all green). The server is in a climate-controlled data center with power connections to PDU. Ambient temperature is 64°F/17°C. The system was placed in a 24-hour component testing loop prior to deployment with no failures.
The primary operating system for this server is VMWare ESXi 5. We initially tried 5.0 and later a 5.1 build. Both were deployed via PXE boot and kickstart. In addition, we are testing with baremetal Windows and Red Hat Linux installations.
HP ProLiant systems have a comprehensive set of BIOS options. We've tried the default settings in addition to the Static high-performance profile. I've disabled the boot splash screen and just get a blinking cursor at that point versus the screenshot above. We've also tried some VMWare "best-practices" for BIOS config. We've seen an advisory from HP that seems to outline a similar issue, but did not fix our specific problem.
Suspecting a hardware issue, I had the vendor send an identical system for same-day delivery. The new server was a fully-identical build with the exception of disks. We moved the disks from the old server to the new. We experienced the same random booting issue on the replacement hardware.
I now have both servers running in parallel. The issue hits randomly on warm-boots. Cold boots don't seem to have the problem. I am looking into some of the more esoteric BIOS settings like disabling Turbo Boost or disabling the power calibration function entirely. I could try these, but they should not be necessary.
Any thoughts?
--edit--
System details:
- DL360 G7 - 2 x X5670 Hex-Core CPU's
- 96GB of RAM (12 x 8GB Low-Voltage DIMMs)
- 2 x 146GB 15k SAS Hard Drives
- 2 x 750W redundant power supplies
All firmware up-to-date as of latest HP Service Pack for ProLiant DVD release.
Calling HP and trawling the interwebz, I've seen mentions of a bad ILO 3 interaction, but this happens with the server on a physical console, too. HP also suggested power source, but this is in a data center rack that successfully powers other production systems.
Is there any chance that this could be a poor interaction between low-voltage DIMMs and the 750W power supplies? This server should be a supported configuration.