9

I've got a nagios server setup for monitoring ~ 30 Windows servers. I want to add some trending charts. I've read that nagios graphing plugins are simple and many people use seperate, standalone charting/trending tools.

What are the restrictions of the nagios graphing plugins vs standalone products like ganglia/munin/cacti?

I'm interested in specific features and advantages that standalone packages offer and nagios graphing plugins don't.

sumek
  • 213
  • 1
  • 2
  • 6
  • You should also consider Zabbix... http://serverfault.com/q/109595/2039 – sebthebert Jan 25 '11 at 17:45
  • Try opsview community edition, based in nagios. You can install in different linux flavors or download VM. http://www.opsview.com/downloads/download-opsview-community – Matias Dominoni Feb 04 '11 at 17:26
  • For the record: I've tried out nagiosgraph and then stuck with it. I'm pretty happy with what it offers – sumek Feb 05 '11 at 20:45

6 Answers6

13

I concur with lynxman. NAGIOS is for immediate qualitative data (is X OK or not?); munin is for historical quantitative data (how full is X now, and how full has it been this year?). All my NAGIOS installations, some of which monitor several hundred services, are linked to munin systems to do the quantitative monitoring.

Note also that munin has specific hooks for feeding data into NAGIOS. It understands the concept of WARNING and CRITICAL thresholds, and where notification (and a view on the NAGIOS "big board") is required it's very very easy to have a single munin variable inform the state of a single NAGIOS service.

The usual workflow is that noone looks at the munin graphs until NAGIOS alerts that a threshold has been breached, but then the munin graphs become invaluable for finding out whether something has been slowly ramping up over time, or this is an out-of-the-blue increase, or we have a weekly up-and-down cycle which is slowly increasing in amplitude, or what.

As lynxman says, the UNIX way is "one task, one tool". Making a toolchain of munin and NAGIOS works very well for me to provide quantitative and qualitative monitoring as well as notifications. It also has the distinct advantage of keeping the interfaces clean: when you look at NAGIOS, you see a simple view of how well things are working right now, with no historical data cluttering up the view; when you look at munin, you see historical information pertinent to the issue ready for your analysis, without "host is down" or "sshd won't talk to me" errors cluttering the view.

MadHatter
  • 78,442
  • 20
  • 178
  • 229
7

given that you already have a nagios installation, consider nagiosgraph or pnp4nagios.

nagiosgraph and pnp4nagios do a pretty nice job of plotting nagios performance data. nagiosgraph has a parameter-based approach to configuration, pnp4nagios has a template-based approach.

  • both automatically detect new hosts/services whenever the nagios configuration changes
  • both do graph zooming
  • both provide graphs when you mouseover specific hosts/services
  • both provide many ways to slice and dice your data
  • both detect and graph the critical and warning levels you have already defined in nagios
  • both can be embedded directly into the nagios frame for seamless, uncluttered navigation from current status to history and back

slicing and dicing the data are pretty important, imho. for example, you can view all services on a single host, or view all hosts with a specific service, or view arbitrary collections of graphs for arbitrary hosts and services.

installation is not trivial, but not difficult. a lot depends on how much you want to customize things. for example, nagiosgraph is 'install.pl' or 'rpm -i nagiosgraph.rpm' or 'dpkg -i nagiosgraph.deb'. pnp4nagios is './configure; make; make install'.

n2rrd can do some of these things as well, but it is not as polished and requires more work to configure.

rrdtool has quirks wrt data storage, and any system will have sampling issues. rrdtool does some data smoothing by default, but you can capture (and graph) maximums and/or minimums in addition to averages if necessary.

every rrdtool-based approach suffers from data/graph staleness since the schema in each rrd file is static and most systems use the rrd filename to identify the data. data are typically never lost when a hostname or service name changes; the rrd files still exist on disk. but some user interfaces provide ways to see 'stale' rrd files, others require manual housekeeping via command line. on many installations this is only an issue when initially configuring the system, but in dynamic environments (e.g. monitoring virtual machines whose lifetime is only a few months) it can become tedious.

one final note. there are actually two parts to trending: data collection and data display. if you go with a standalone graphing system rather than extending your existing nagios installation, then you might have to install additional components on your windows machines in order to collect the data.

3

Nagios graphing plugins as you say are very restricted, they offer a very basic rrdtool interface and the UI design is a bit counter intuitive, it's basically a hack over nagios, tried to use that just for fun but it broke several times without warning.

Going for a standalone product (especially munin or ganglia) offers you a big range of services that nagios can't accomplish, as the unix mantra it's better to be good at just one thing than try to be good at many, nagios is amazing for monitoring and munin/ganglia/cacti are amazing at graphing.

lynxman
  • 9,157
  • 3
  • 24
  • 28
  • So what is inside this _a big range of services that nagios can't accomplish_? This is what I'm interested in. – sumek Jan 21 '11 at 11:16
  • With nagios it's very easy to lose your graph data, it's also very easy for the plugin to stop graphing data at any given time, it doesn't give you any possibility of zooming into a specific time (which all the others do), it doesn't give you the possibility to do complicated aggregative graphs, and that's just for starters :) – lynxman Jan 21 '11 at 11:24
  • What do you mean by _easy to loose your graph data_? A quick google shows that all 5 mentioned solutions(ganglia, munin, cacti, pnpgraph, nagiosgraph) use rrdtool for storing graph data. – sumek Jan 21 '11 at 11:34
  • Yes sumek, what I'm referring is that again the graphing tool on nagios is a hack, and whenever there's a mismatch between the rrd file name and the graph info it'll break, if you want try it, suffer the pain as I did and then move to a real solution like munin :) – lynxman Jan 21 '11 at 12:30
2

At Stack Overflow we use n2rrd which is a Nagios plugin for graphing performance data. To an extent I would agree with lynxman that it does have a big of a hackish feel.

However:

  • With n2rrd you can have Cacti do the graphing based of the data instead of the rrd2graph.cgi that comes with n2rrd
  • n2rrd with the rrd2graph.cgi does support zooming
  • As far as complicated aggregate graphs -- you basically manipulate the rrd graphs by hand and can do whatever you want with them.

The rrd graphs are stored according to the server names, so if you change the name of something you sort of loose the data... You could always just rename the files are symlink them though and you won't loose the data.

I have some examples of these graphs up at my recent Some Tips for Better RRD Graphs Server Fault Blog post. Also, the n2rrd page includes both the cacti demo as well as rrd2graph.

I think the bottom line is that going the Nagios route might be lacking in a feature or two but is pretty complete if you don't mind getting your hands dirty with the details of writing rrd templates yourself*. It is probably going to take more of your time but it will encourage to develop more expertise in rrd.

Kyle Brandt
  • 82,107
  • 71
  • 302
  • 444
  • 1
    * [ unreferenced footnote error ] : what were you going to add there, kyle; enquiring minds need to know! – MadHatter Jan 21 '11 at 14:01
0

I demand accurate data and rrd's data display is not accurate - it's normalized! For most users this is fine because they're not using very accurate data to begin with. They're using data whose sample rates are often at a minute or more and that isn't going to give you a very accurate description of what is happening. This also means that if you have a spike in your data somewhere you may never see it.

Consider this - say your Gb network is humming along at about 10MB/sec and all of a sudden there is a spike of 100MB/sec for a couple of minutes. Also note if it was only a 30 second spike you might not even see it at sampling rates of a few minutes. If you look at the data for the day, that 'spike' may only show up as 15MB/sec, though the actual value depends on a number of other factors as well. There's also a very likely probability you'll assume your network is happy when it isn't!

What's even more frustrating for me is the data normalized to the physical width of the graph and range of the x-axis. What this means is that spike I mentioned you didn't see? If you zoom in it magically appears! I'll stick to gnuplot - the graphs may not be as pretty but they're rock solid and gnuplot never modifies the data before displaying it.

-mark

mark seger
  • 31
  • 1
0

I find using pnp4nagios works quite well for graphing. It supports zoom as well. It is not the easiest to implement, but nothing with nagios ever is.