Best tool for monitoring backups, etc. and trending statstics from that data

Question

I have done some research on nagios, opennms, and zenoss but am not confident that I have found what I am looking for.

The main driving force for me right now is being able to monitor backups. This includes mysql, mssql, and eventually some file system backups.

We have a tool that wraps the backup process for these different systems and collects statistics. So, items like:

number of databases backed up
size of db backup file
size of db backup file compressed
time to make backup
time to zip file

I want to be able to A) have notifications if the jobs are not run according to schedule B) be able to set thresholds on the statistics which would trigger notifications C) I want to be able to trend and graph the statistics

I am planning on sending this information to the monitoring application through an HTTP POST. Or, the monitoring application could pull it from a log file as well.

However, we will have other processes with other "arbitrary" (from the monitoring system's perspective) statics that will want to monitor and trend, so flexibility is very important.

The tool or tools should also be able to do general monitoring and trending of network interfaces, server load, etc. Once we get the backup monitoring in place, we will want to include those items as well.

Thanks.

Follow-up:

I have decided to try the following in the given order:

Zabbix: seemed more of a "one stop shop" than the others and was easy to install in Ubuntu Lucid RC
opsview
Nagios w/ nagvis, pnp4nagios, nagiosgraph
cacti w/ npc plugin
Munin: a little scarred of the simplicity, but this might prove to be a blessing in the long run

Will post back once I have made a decision, it may be a while until that happens.

score 4 · Answer 1 · answered Apr 23 '10 at 16:45

Rather than writing your own monitoring solution, I strongly recommend that you use an existing tool so that all the basic monitoring and alerting functionality is already implemented. If you pick Nagios, you'll get the basic monitoring of server and network resources for free, and the following plugins should give you most of the rest of what you need:

check_file_ages_in_dirs will tell you whether the backup files exist; here's a blog post I wrote with some basic examples.

check_file can monitor file size and contents (using regexes), so you can output your backup statistics to a file and monitor them.

The one thing you won't get from Nagios is trending and graphing; I recommend looking at Munin for that, as it's simple to set up and, like Nagios, has stacks of contributed plugins.

Just for clarification, I wouldn't be writing my own monitoring tool. The question is to get recommendation for monitoring/trending tools that will integrate with the backup/script-running framework I have built. — Randy Syring, Apr 26 '10 at 19:11

score 4 · Answer 2 · answered Apr 30 '10 at 07:49

this should be pretty easy to set up with zabbix.

setting custom (and very powerful) thresholds is easy - you can write any expression you like, so something like "notify me if more than 3 of these 5 servers did not have a successful backup" is possible. you can also use 6 different severity levels and escalations to achieve flexible notification and alerting.

zabbix has bunlded data storage and visualisation capabilities - all data is stored in a database, and to graph a single metric you do not need any configuration - you just get a graph for it "for free". for long term storage & trending one hour averages are computed.

as for getting your data about backups into zabbix, there are multiple possibilities. you can read it from files, you can launch custom commands, you can push it from the monitored machine using commandline utility zabbix_sender... and there might be few more possible approaches.

extending is easy - any custom command that returns data can be used to gather, store and visualise that data.

of course, general monitoring of operating systems, applications, snmp and ipmi devices and so on is possible.

score 1 · Answer 3 · answered Apr 23 '10 at 18:20

execution

backups get orchestrated by backupninja. i use it just a wrapper for my bash scripts - to have single backup log. each script starts with

 function handle {
         echo Error
         error problem occured
 }
 set -e
 trap handle ERR

so i get error in logs whenever any of the commands [ eg mysqldump or rsync ] fails.

all backups end up in rdiff repository so i have n days of increments.

all backups are transmitted using rsync to central storage server.

on storage server all backups are verified daily and after successful verification of data on local disk they get copied to external usb drive.

verification

backupninja.log on all servers is monitored by nagios. i check if they contain only DEBUG and INFO messages. anything else triggers alert.

every backup 'touches' a test file, presence and freshness of which is monitored on central backup repository server with nagios.

additionally more critical sql dumps get checked for their size [not just freshness] and completeness [eg at the end of mysql dumps i expect fresh timestamp in

-- Dump completed on 2010-04-22 23:21:02

all rdiff archives are verified daily before data gets synced to USB drive and then again after they get synced. so even if nightly transfer is interrupted i will have consistent repository just on USB disk. result of checking is logged to file which content and freshness is checked by nagios.

usb disks get rotated weekly and are stored offline, just in case. this might be overkill for bigger amounts of data, but works fine for ~300GB of slowly changing files/dumps.

trends

i use simple custom munin plugin to plot size of diff/data for each rdiff repository.

time it takes to execute can be checked in backupninja logs but for now i dont bother about it.

Thanks for the answer. I already have a framework that handles running backups (and other tasks), which collects statistics, so backupninja would be overkill. Nagios seems to be a consensus and then munin or cacti to trend. — Randy Syring, Apr 26 '10 at 19:28

score 1 · Answer 4 · answered Apr 27 '10 at 19:26

nagios can do trending, but you need to output perfdata (http://nagios.sourceforge.net/docs/1_0/perfdata.html) in your plugin. If you use a pnp4nagios http://docs.pnp4nagios.org/pnp-0.4/start then everything will be graphed for you.

I have found that using opsview http://www.opsview.org/ is way easier than configuring nagios and pnp4nagios. Specially if you are the only linux savvy admin at work. Opsview is nagios with a great webui that allows almost all actions from the web browser. Because it is nagios, you can use all the nagios plugins you have been using in the past. Great tool.

Thanks for the comment, I think I had ruled out opsview for some reason, but based on your recommendation, I may end up trying it before I jump into nagios proper. — Randy Syring, Apr 27 '10 at 19:32

score 1 · Answer 5 · answered May 13 '10 at 19:26

I recommend OpenNMS. The package is completely open source, actively supported and regularly enhanced. For reference, I found on their wiki configuration info to monitor Symantec Backup Exec.

From their website ..

OpenNMS is the world's first enterprise grade network management platform developed under the open source model. It consists of a community supported open-source project as well as a commercial services, training, and support organization.

Disclosure: I have no commercial interest here, but the owner of The OpenNMS Group, the "commercial services, training and support organization" mentioned above is a friend of mine.

solefald · Answer 6 · 2010-04-23T18:10:10.587

0

Nagios for alerting, and Cacti for graphing plus some shell or perl scripts will do exactly what you want. With combination of them together, you could do pretty much anything, depending on the amount of effort you are willing to put in.

edited Apr 23 '10 at 18:10

answered Apr 23 '10 at 16:59

solefald

2,303
15
14

Do you think it would be better to "push" stats to nagios over HTTP or let it pull stats from log files? – Randy Syring Apr 26 '10 at 19:29

score 0 · Answer 7 · answered May 14 '10 at 00:22

0

This could be done easily with Circonus (http://circonus.com/). We routinely import metrics like this with the Resmon XML DTD.

answered May 14 '10 at 00:22

obfuscurity

761
3
7

Best tool for monitoring backups, etc. and trending statstics from that data

7 Answers7