8

I have an HP ProLiant DL380p Gen8 that is running VMWare ESXi 5.5. It has been rebooting itself at seemingly random intervals for the past 24 hours. There is only a single VM running, and even if I shut it down the host will still reboot. The server is not running out of memory or disk space, and as far as I can tell is not overheating. I've tried looking through log files, but there is just so much to look at.

What are the most important steps in diagnosing this problem (including which settings to check, what files to look at, what specific message would indicate trouble, should I start pulling memory, is there a diagnostic CD that does all this for me, etc)?

I know this is a very broad question. I'm happy to provide log files if necessary to make this more specific to my situation.

nachito
  • 245
  • 1
  • 4
  • 11

1 Answers1

9

Here are a few suggestions.

  • Is your ILO connected and configured? It will tell your exactly what's happening with the system. Please review the ILO4 log.

  • View the system's IML log (available via ILO or vSphere "hardware" tab)

  • Are there any indicators or error messages on the screen during crash or at POST?

  • Are you using the HP-specific install of ESXi (includes additional drivers and tools)

  • What version and build number of ESXi are you running?

  • If the virtual machine you're running is a Windows 2012 or 2008 guest, you may be running into a NIC driver bug.

  • Check your power connections. Do you have dual power supplies? Re-seat the power cables one at a time.

  • Look at the System Insight LED array on the front of the server to determine if there's an internal health problem.

enter image description here

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • 2
    And CALL THE VENDOR FOR SUPPORT, too. You can and should spend some time investigating yourself, but if this is an important server, it should be under a support agreement. – mfinni Aug 27 '14 at 13:42
  • I had not setup ILO, thank you very much for the suggestion. Once it was setup, I checked the log and found this `System Overheating (Temperature Sensor 1, Location Ambient, Temperature 46C)`. I'll get it fixed straightaway. – nachito Aug 27 '14 at 14:13
  • This means that your server room or environment is too warm. This would also result in a RED light on the temperature LED in the image above. Depending on when you deployed this server, you may also want to run firmware updates on the system. – ewwhite Aug 27 '14 at 14:15
  • I think what's happening is the exhaust from another rack is too close to the intake for this machine, since the room itself is a cool 72F. When I had my eye on the machine as it rebooted I did see the OverTemp flash for a fraction of a second. Not surprised I never saw that before, if you blink at the wrong moment you miss it completely – nachito Aug 27 '14 at 14:55
  • 3
    @nachito I hope you know that the ILO and server can email you health alerts, like this temperature condition... – ewwhite Aug 27 '14 at 14:56
  • Or you could install, HP SIM, which is free. – mfinni Aug 27 '14 at 19:57