0

Thanks to the Intel TCO watchdog some servers i manage now reboot on a kernel or hardware crash and init scripts are now even 'rebootsafe'. Sadly this means that i no longer get a notification from nagios when a machine has crashed because the service is simply back up before the checks fire for enough times to send a notification.

Is there a reliable script or nagios check out there that will let me get notified if say the machine has crashed say 3 times during the last 48 hour period?

ZaphodB
  • 653
  • 3
  • 9

2 Answers2

1

How about you write one? An easy way would be to run uptime in the script. A slightly better way would be to add an initscript that echos the time to a rotating logfile. Grab the last three entries in the file, and check the elapsed time since the first.

Michael Lowman
  • 3,584
  • 19
  • 36
1

There are a number of "check_uptime" variants on Nagios Exchange. These allow you to catch quick reboots, without setting max_check_attempts to 1 or 2 for the host check (therefore preventing false positives).

This one, for example, can be run via NRPE (uses uptime), but can also check via SNMP (Linux, Windows, etc.).

Keith
  • 4,627
  • 14
  • 25
  • ah yes, i should have searched for 'check_uptime' rather then for nagios/uptime/crash thank you. – ZaphodB Feb 21 '12 at 08:50