nagios check_crash || how to detect when a server has crashed and rebooted?

Question

Thanks to the Intel TCO watchdog some servers i manage now reboot on a kernel or hardware crash and init scripts are now even 'rebootsafe'. Sadly this means that i no longer get a notification from nagios when a machine has crashed because the service is simply back up before the checks fire for enough times to send a notification.

Is there a reliable script or nagios check out there that will let me get notified if say the machine has crashed say 3 times during the last 48 hour period?

score 1 · Accepted Answer · answered Feb 20 '12 at 15:16

1

How about you write one? An easy way would be to run uptime in the script. A slightly better way would be to add an initscript that echos the time to a rotating logfile. Grab the last three entries in the file, and check the elapsed time since the first.

answered Feb 20 '12 at 15:16

Michael Lowman

3,584
19
36

The initscript method actually sounds worthwhile. I shall give it a try. – ZaphodB Feb 21 '12 at 08:53

score 1 · Answer 2 · answered Feb 20 '12 at 19:24

1

There are a number of "check_uptime" variants on Nagios Exchange. These allow you to catch quick reboots, without setting max_check_attempts to 1 or 2 for the host check (therefore preventing false positives).

This one, for example, can be run via NRPE (uses uptime), but can also check via SNMP (Linux, Windows, etc.).

answered Feb 20 '12 at 19:24

Keith

4,627
14
25

ah yes, i should have searched for 'check_uptime' rather then for nagios/uptime/crash thank you. – ZaphodB Feb 21 '12 at 08:50

nagios check_crash || how to detect when a server has crashed and rebooted?

2 Answers2