1

This might be a very basic question, but I am not very familiar with the exact features of Nagios versus Munin versus other monitoring tools.

Let's say we have a process that needs to run daily for some very important infrastructure reasons. We've had cases where the process did not run or was otherwise down for a number of days before anyone noticed.

I'd like to set up a system that will enable me to easily know when the daily run did not take place for some reason.

I can set up this process to send an email on every successful run (or every failed run), but I do not trust that the people receiving this email would notice an absence of an "I'm OK" message.

What I am envisioning is some type of "tripwire" service which this V.I.P. (very-important-process) can send a status message to each time it runs, whether successfully or not; and if the "tripwire" service has not received any word from the VIP within a configurable amount of time, it can then send an alert to someone.

(The difference between what I envision and the first approach I outlined is a service that sends a message only in abnormal conditions, rather than a service that sends messages each day that the status is normal/OK).

Can Nagios be set up to send an alert like this, if it has not heard from a certain service/device/process in N days? Is there another tool out there which does have this feature?

matt b
  • 113
  • 5

6 Answers6

6

Nagios supports exactly what you want. Take a look at passive checks and freshness. Basically, you define a host and service for your job, and tell Nagios that the service is passive and has a specific freshness threshold (e.g. 26 hours.) Whenever your process runs, have it submit a "OK" result to Nagios. Nagios keeps track of when the OKs are submitted, and if none is posted for 26 hours, it'll post a notification.

There's an example at that page.

jon
  • 890
  • 5
  • 15
2

Nagios just runs a command and looks at the result code. This means that Nagios can monitor just about anything, assuming you can write a command that will return the appropriate status.

In your case, if your process can write to a file, you can use the stock Nagios check_file_age plugin, which will alert if a file has not been modified in a certain amount of time.

You could, of course, also have Nagios check a mailbox and generate an alert if a message wasn't received periodically.

larsks
  • 41,276
  • 13
  • 117
  • 170
2

To elaborate on what Jon said, you can use Nagios "Passive" mode service checks along with a freshness check to accomplish this. Passive mode service checks are analogous to an SNMP traps in that they can be asynchronously sent to the Nagios server.

There is a NSCA (Nagios Service Check Acceptor) addon for Nagios to send/receive these passive service checks from remote hosts: http://exchange.nagios.org/directory/Addons/Passive-Checks/NSCA-%252D-Nagios-Service-Check-Acceptor/details

Your VIP at the end of a successful run could be setup to run send_ncsa with a tab delimited message like:

printf "VIP_Host_Name\tVIP_Health\t0\tOK\n" | send_ncsa -H nagios_server 

If your VIP had some kind of issue then you could instead send in an appropriate alert:

printf "VIP_Host_Name\tVIP_Health\t1\tUseful Warning Message\n" | send_ncsa -H nagios_server

On your Nagios Server have some configuration like:

define service {
    service_description     VIP_Health
    active_checks_enabled   0
    passive_checks_enabled  1
    host_name               VIP_Host_Name
    check_freshness         1
    freshness_threshold     99000
    check_command           vip_overdue
}

To raise an alert unconditionally whenever the freshness_threshold is exceeded (number of seconds since the last time any information was received for that service) configure a new Nagios check_command called vip_overdue that will always exit with a CRITICAL status and relevant error message such as:

#!/bin/bash
echo "CRITICAL: VIP is overdue"
exit 2
1

Yes, Nagios can. Just have a plugin that checks for an output from the important-process that's less than a day old. Not present? Service degraded!

Michael Lowman
  • 3,584
  • 19
  • 36
1

Just to give you an alternative to Nagios, take a look in Zabbix. It uses the same approach described by larsks to Nagios, but I think that Zabbix is more user friendly and easier to configure.

Take a look at this post and this one.

Bob Rivers
  • 506
  • 4
  • 13
0

You can also setup monit to do this exactly.

Monit allows the testing of the contents of a file[1] or its timestamp[2].

You can then monitor this via it's web interface and also receive emails if the test failed.

In my opinion, monit is quite easy to setup compared to nagios. I just did what I described for a process I run and it took me no more than 15 minutes, from downloading monit to having it running.

1.- (http://mmonit.com/monit/documentation/monit.html#file_content_testing)

2.- (http://mmonit.com/monit/documentation/monit.html#timestamp_testing)

edmz
  • 205
  • 2
  • 6