Tool to monitor for success or failure within a given amount of time

Question

My team is trying to solve a monitor challenge when it comes to backups.

The backup is running fine. Our current challenge is to monitor these backups so that they actualy do happen.

We can send a mail in case of failure and success. We now want to check for these mails and

alert if the mail reports a failure
alert if the success mail wasn't received for let's say a day (to be configured)

This way we are in the known if the backup failed or if the mail could not be send at all. That's is why we also send the success mail, to prove the mail is actually send.

I imagine this idea to be somewhat like a heartbeat that is being actually checked instead of passively waiting for failures.

Which tool can help us?

I suspect this kind of tool allows us to enter expectations that need to happen, for example a mail should be received in the last day, be it success or failure.

The tool would be even better if it could directly go to the disk and check for the presence of the backup files but we would like to support the mail case as well as currently other systems report this way.

It seems that you already have an alert system in that emails are sent out by your backup system reporting what happened - why go further? Otherwise, can you specify what does your backups? If you want to monitor it by nagios, most likely forget the built in email, and see if it talks SNMP etc. You tagged this with a bunch of monitoring systems - if you are trying to ask which is best, this question will get closed here. — dunxd, Apr 04 '13 at 16:22

score 0 · Answer 1 · answered Apr 04 '13 at 15:48

This is perilously close to a shopping question, but I'll bite anyway.

I use NAGIOS a lot to do that sort of thing (because I use NAGIOS a lot anyway, so it's nice to have all my status and notifications in the same place). I have agents report in using send_nsca, and the services are configured to go STALE and alert if they receive no updates for, say, 36 hours.

Services that detect failure can report it using send_nsca; those that are sure they've succeeded can report that. Services that fail so badly they report nothing get caught by the freshness test above.

score 0 · Answer 2 · answered Apr 04 '13 at 16:29

This sounds a bit odd to me;

So you get an email to say if the backup was successful or not. Now you want to check if you get an email, and be alerted if the backup was a failure or even it it was a success but the mail didn't come through.

Sounds to me like you should drop the email part and just use a direct monitoring solution. You could script this, I have seen it many times before. However, how would you have it alert you, via email? You already have a monitoring solution in place that does that!

The problem here seems to be that you need to monitor if the email came through or not, so you are attaching a monitoring system to your monitoring system. If emails aren't reliable don't report the success of the backup via email in the first place.

It's hard to comment on a recommendation, without knowing what you are backing up or how, but it seems to me like you have the order/logic of the situation all muddled up here.

score 0 · Answer 3 · answered Apr 04 '13 at 16:43

Agreed that the best solution is one that you trust to work every time and that only alerts you when there is something you need to fix. Alerting for success causes system administrator email overload and is unsustainable as you get more systems and sys admins.

The tool would be even better if it could directly go to the disk and check for the presence of the backup files

Yes, you already know the right solution. That's the way it's normally done.

As for your email problems maybe you could dig into what's going on there and fix them separately so you're not trying to fix broken email with your backup monitoring system.

score 0 · Answer 4 · answered Apr 04 '13 at 16:59

Nagios freshness check.

http://nagios.sourceforge.net/docs/3_0/freshness.html

An example of a service that might require freshness checking might be one that reports the status of your nightly backup jobs. Perhaps you have a external script that submit the results of the backup job to Nagios once the backup is completed. In this case, all of the checks/results for the service are provided by an external application using passive checks. In order to ensure that the status of the backup job gets reported every day, you may want to enable freshness checking for the service. If the external script doesn't submit the results of the backup job, you can have Nagios fake a critical result by doing something like this...

Here's what the definition for the service might look like (some required options are omitted)...

define service{

    host_name       backup-server

    service_description ArcServe Backup Job

    active_checks_enabled   0       ; active checks are NOT enabled

    passive_checks_enabled  1       ; passive checks are enabled (this is how results are reported)

    check_freshness     1

    freshness_threshold 93600       ; 26 hour threshold, since backups may not always finish at the same time

    check_command       no-backup-report    ; this command is run only if the service results are "stale"

    ...other options...

    }

Notice that active checks are disabled for the service. This is because the results for the service are only made by an external application using passive checks. Freshness checking is enabled and the freshness threshold has been set to 26 hours. This is a bit longer than 24 hours because backup jobs sometimes run late from day to day (depending on how much data there is to backup, how much network traffic is present, etc.). The no-backup-report command is executed only if the results of the service are determined to be stale. The definition of the no-backup-report command might look like this...

Tool to monitor for success or failure within a given amount of time

4 Answers4