4

Nagios check notification intervals must be >= to a check interval because this prevents Nagios from sending out false alarm notifications should a service return to an UP status between checks. I understand the reasoning behind that.

We have a number of checks that run every 30 minutes. This means that if a check fails only one notification is sent out each time the service is checked after the retries are used up.

What I need is to be able to keep pestering the duty admin pager every two minutes after a check has gone HARD DOWN/CRITICAL. I can't do this because the next notification will only go out on the next check i.e. in another 30 minutes.

A feature we had on our old monitoring system was to set a new lower check interval as soon as the check had gone HARD DOWN/CRITICAL. This meant we could keep rechecking every two minutes (and sending alerts) until the alert was acknowledged by a human or changed its status to UP, after which the check interval would revert to 30 minutes.

Is there a way to facilitate this on Nagios?

I've had some thoughts about writing an event handler which will reschedule a check for two minutes in the future after a check has gone HARD DOWN/CRITICAL (by directly sending a command to Nagios).

I'm wondering if anyone else has had to do a similar thing?

I'm running Nagios Core 3.2.3.

Kev
  • 7,777
  • 17
  • 78
  • 108

1 Answers1

5

You can do it by using CHANGE_NORMAL_SVC_CHECK_INTERVAL and CHANGE_NORMAL_HOST_CHECK_INTERVAL.

Add an event handler for your service:

define service {
    host_name              ...
    service_description    ...
    check_command          ...
    contact_groups         ...
    event_handler          change_check_interval
}

The change_check_interval was defined in commands.cfg:

define command {
    command_name    change_check_interval
    command_line    $USER1$/eventhandlers/change_check_interval.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTADDRESS$
}

The content of change_check_interval.sh:

#!/bin/bash

now=`date +%s`
commandfile='/usr/local/nagios/var/rw/nagios.cmd'

case "$1" in
    OK)
        ;;
    WARNING)
        ;;
    UNKNOWN)
        ;;
    CRITICAL)
        /bin/printf "[%lu] CHANGE_NORMAL_SVC_CHECK_INTERVAL;host1;service1;2\n" $now > $commandfile
        ;;
esac

exit 0

Make sure that external commands is enabled in nagios.cfg:

check_external_commands=1
quanta
  • 50,327
  • 19
  • 152
  • 213
  • This is what I was thinking about when I toddled off to bed after asking that question last night. Will go and try this now and report back. – Kev Sep 07 '11 at 07:55
  • With the inclusion of a couple of macro values in my service templates - `ALERTINTERVAL` and `ORIGINALINTERVAL` - which are passed along with the host/service names - this is working out very nicely. I reset the check interval back to the original value (`ORIGINALINTERVAL`) when the service goes HARD/OK. Nagios is very cool. – Kev Sep 07 '11 at 17:10