2

I have a server with HP Smart Array hardware RAID controller. To monitor its status, I use cpqarrayd. /etc/default/cpqarrayd contains DAEMON_OPTS="-t localhost:162" to send SNMP traps when something happens. Traps are handled by snmptrapd, /etc/snmp/snmptrapd.conf contains

disableAuthorization yes
traphandle default mailx -s "SNMP Trap" admin@example.com

The e-mails recieved this way contain SNMP Traps, but they are not human-readable, and it's impossible to tell what they are about, or whether they were issued by cpqarrayd or not. Is it possible to send human-readable e-mails when RAID status changes?

Solution

The following script placed in cron.hourly:

#!/bin/sh

CCISS_DEVICE=/dev/cciss/c0d1
STATUS_FILE=/var/cciss_vol_status
TMP_FILE=$TMPDIR/status-$$.$RANDOM

mv $STATUS_FILE $TMP_FILE
cciss_vol_status $CCISS_DEVICE >$STATUS_FILE

if ! cmp -s $STATUS_FILE $TMP_FILE ; then
    mailx -s "CCISS status changed" admin@example.com <$STATUS_FILE
fi

rm $TMP_FILE
Michael Ivko
  • 151
  • 7

2 Answers2

1

First, see: How do I get my HP servers to email me when a drive fails?

In short, the HP SNMP Management Agents that are installed as part of the Service Pack for ProLiant or Management Component Pack (Debian) will provide you the proper alerts for the system's health. This includes traps for disks, array controller, fan, temperature, power supplies, ILO, NICs, etc.

This is fully supported under Debian. You will find the downloads in the HP Software Delivery Repository.

Two parts to this (configured automatically by the installer):

In your snmpd.conf file:

# Following entries were added by HP Insight Management Agents at
#      Thu Mar 18 04:14:43 PDT 2010
dlmod cmaX /usr/lib64/libcmaX64.so

That registers the HP health agents with SNMP.

And the /opt/hp/hp-snmp-agents/cma.conf file:

############################################################
#
# cma.conf: HP Insight Management Agents configuration file
#
############################################################

########################################################################
# trapemail is used for configuring email command(s) which will be
# executed whenever a SNMP trap is generated.
# Multiple trapemail lines are allowed.
# Note: any command that reads standard input can be used. For example:
#             trapemail /usr/bin/logger
#       will log trap messages into system log (/var/log/messages).
########################################################################
trapemail /bin/mail -s 'HP Insight Management Agents Trap Alarm' alerts@brazzers.com

Typical RAID alert emails will look like:

Trap-ID=3040

Accelerator Board Battery status change, slot number: 1.
Battery failed. Status: Failed..

or

Trap-ID=3034

Logical Drive Status Change: Slot 1, Drive: 2.Status is now Rebuilding.

or

Trap-ID=3034

Logical Drive Status Change: Slot 1, Drive: 1.Status is now OK.

EDIT:

It appears you're having difficulty with a 100-series ProLiant, HP Health agents and Debian. This is a supported solution, but depending on how you've installed and configured the solution, you may have problems. Given that, you can probably just install the cciss_vol_status utility and run a periodic check via cron.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Installing hp-health fails for me with `Error: No supported management controller found`. Probably, dl180 is not supported. What's funny, uninstalling it fails with the same eroor. – Michael Ivko Apr 28 '14 at 06:36
  • @MichaelIvko Please provide the specific server model and generation, plus your OS distribution and version. – ewwhite Apr 28 '14 at 10:57
  • HP Proliant DL180, Debian Wheezy – Michael Ivko Apr 28 '14 at 11:14
  • @MichaelIvko It doesn't matter. See [the following link](http://gaganonthenet.com/2012/08/22/fix-for-hp-health-on-dl100-series-running-centos6/) and adapt for your Debian purposes. – ewwhite Apr 28 '14 at 11:31
  • I have a different error, same as in [this](http://serverfault.com/questions/576999/hp-health-install-doesnt-work-on-debian-7-wheezy) unanswered question. There's no segfault, and mcelogd isn't running. And I am wary of relying on a software that cannot even be uninstalled without modifying an initscript by hand. I'll probably have to figure out another solution. – Michael Ivko Apr 28 '14 at 12:42
  • @MichaelIvko See my update. – ewwhite Apr 28 '14 at 13:06
  • Apparently hp-health works only for servers of Generation 6 or higher. In my case `sudo dmidecode | grep "Product Name"` for whatever reason doesn't output a generation, so the script fails. – Michael Ivko Apr 28 '14 at 13:08
  • @MichaelIvko I just tried on a G5 and G6 DL180... `Product Name: ProLiant DL180 G5` and `Product Name: ProLiant DL180 G6`. Hmm. – ewwhite Apr 28 '14 at 13:12
  • Well, it is an ancient server, maybe it's unsupported after all. Anyway, running cciss_vol_status in cron does the job. – Michael Ivko Apr 29 '14 at 07:23
0

snmptt (SNMP Trap Translator) is a great little tool for this. You can teach it typical OIDs and messages and translate them to some sensible message. Take a look and see if it's any good for your needs.

EDIT: Oh, if you don't already have, go and download an SNMP MIB for your device and put it in /usr/share/snmp/mibs directory. Then restart the snmpd and snmptrapd.

Janne Pikkarainen
  • 31,454
  • 4
  • 56
  • 78