9

As part of provisioning servers we run HP's Insight Diagnostics to test the hardware. This is a manual process. Is there a way automate the running of Insight Diagnostics?

There is the hpdiags software with the option "-rd:" "Run a diagnosis of all diagnosable devices." From my testing this doesn't do much (it just reads the SMART info from the disks). Has anybody had better luck with it?

Hardware: BladeCenter c7000 with HP ProLiant BL460c blades, DL360s.

OS: ESXi and Ubuntu.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
Mark Wagner
  • 17,764
  • 2
  • 30
  • 47
  • 2
    Short answer is that I don't bother to do this in large environments. The monitoring and onboard diagnostics are enough. But can you provide some information about the server models you're using? And maybe the operating systems involved. – ewwhite Feb 11 '15 at 18:58
  • I updated the ticket with the requested info. – Mark Wagner Feb 11 '15 at 21:04
  • Are you installing HP-specific versions of ESXi? Are you install HP Management Agents on the Ubuntu systems? Which generation(s) are the servers? G6? G7? Gen8? – ewwhite Feb 11 '15 at 21:09
  • The HP management agents are installed on both ESXi and Ubuntu. The servers are Gen8 and will be Gen9. – Mark Wagner Feb 11 '15 at 21:15
  • 9
    `I updated the ticket with the requested info` - That made me laugh. This isn't the helpdesk. – joeqwerty Feb 12 '15 at 03:43
  • @joeqwerty sometimes it is much better – andreikashin Oct 11 '20 at 14:56

1 Answers1

8

So, I'll pose another question:

Why is it necessary to run HP Insight hardware diagnostics on servers prior to provisioning?

In my comment above, I indicated that there's little to gain by doing this preemptively in large HP ProLiant environments. I should clarify my thoughts on that...

In order of descending frequency, let's look at the types of issues you'll typically encounter:

  • Storage array and disks: The RAID controller will report to the OS, logs, SNMP, email, ILO and light up pretty lights to indicate health.

  • RAM: The POST process will detect RAM status, as well as the system reporting to the OS, logs, SNMP, email, ILO and lighting up an LED indicator on the front panel Systems Insight Display (SID). Also, I'm not a fan of RAM burn-in processes because the error detection of these systems is already robust.

  • Thermal and fans: Server temperature and fan speed are regulated by the ILO. There are 30+ temperature sensors on these systems, so the cooling system is extremely efficient. This still reports to the OS, logs, SNMP, email and on the SID.

  • Power Supply: PSU status is reported to the OS, logs, SNMP, email and on the SID, as well as an actual indicator light on the actual power supply unit.

  • Overall health: This is easy to assess from a glance with the SID display, in addition to the Internal Health and External Health LED. This is also reported to the server's logs, SNMP, email and ILO.

enter image description here

I can't think of any conditions that would be found pre-deployment that wouldn't/couldn't be reported during runtime or post OS install.

The diagnostics loop usually won't find anything when run on a system with no obvious prior issues. This is mainly because the server needs to POST and boot into the utility or Intelligent Provisioning firmware in order to run the utility.

Put another way, any item that would be a serious "SPOF" for the server would probably prevent the system from running its self-diagnostics.

The most common failure items are still fairly robust; disks should be in RAID and are hot-swappable. Fans and power supplies are also hot-swappable. Your RAM has ECC thresholds and there are online spare options for most ProLiant platforms. There's nothing you'll be able to do to induce failure in these components by running diagnostics. Add the fact that you're using HP C7000 Blade enclosures, which have internal redundancies, and your incidence of failure should be pretty low.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • The problem is if (a) fault is detected post OS install (i.e. server is in production), (b) repair cannot be done online or the failed component is a SPOF for the server, and (c) server is a SPOF, then you will experience downtime (either immediately or when system is taken down to repair). To prevent the conclusion you need to prevent one of the conditions. I was going for (a) by detecting the fault before production. I appreciate your thoroughness in detailing the reporting abilities but I'm looking to prevent the need to report them in the first place because they don't happen. – Mark Wagner Feb 16 '15 at 01:29
  • An HP diagnostics loop likely won't find anything, considering the server needs to [POST](http://en.wikipedia.org/wiki/Power-on_self-test) and boot into the utility or Intelligent Provisioning in order to run diagnostics. The most common failure items are pretty robust; disks, fans and power supplies are hot-swappable, RAM has ECC thresholds. There's nothing you'll be able to do to induce failure in these components. – ewwhite Feb 16 '15 at 01:57