8

I have used munin on multiple servers for many years with great success, however with more than 100 munin-nodes and when there is load on the clients, the processing is timing out.

I have made some scaling changes to the cron job, and number of client processes, and reduced the number of plugins running etc. but I have decided to look for an alternative that has a more scalable architecture.

Any suggestions or experiences would be welcome. I am basically interested in server metrics which cab be used for capacity planning, and diagnosing resource usage. (we have nagios for alerting)

Tom
  • 10,886
  • 5
  • 39
  • 62
  • possible duplicate of [What tool do you use to monitor your servers?](http://serverfault.com/questions/44/what-tool-do-you-use-to-monitor-your-servers) – Ben Pilbrow Apr 18 '11 at 22:33

5 Answers5

8

It sounds like you may have two problems

  1. On your monitoring server, recording the metrics for lots of servers requires more random i/o than your storage can provide. Even if all your metrics are being written to disk, the server may be too overloaded to actually generate graphs from them.
  2. On your clients being monitored, the plugins which collect the metrics are too CPU and memory intensive and don't finish gathering data in time when the clients are experiencing heavy load.

I've used Munin in the past, but I am currently using collectd. The authors of collectd have put a lot of thought and effort into solving these problem. They have a well-designed system for writing the data to RRD files that ensures you don't lose data and can generate up-to-date graphs. There's also support for RRDCacheD. The daemon and the official plugins are written in C, so they use little memory or CPU time. On my client systems it's using less than 2MB of RAM and about a quarter of a second of CPU time every minute. On my monitoring server it is using 20MB of RAM and two-thirds of a second of CPU time every minute. Keep in mind that all my metrics are being gathered and sent to my monitoring server every ten seconds, rather than at intervals of minutes like munin.

sciurus
  • 12,493
  • 2
  • 30
  • 49
  • 2
    munin now has preliminary support for rrdcached. It requires a little extra effort than the default install. This is not a vote for or against munin/collectd, I am only adding this to help anyone struggling with a munin setup and no leeway about changing systems. – dfc Jan 13 '14 at 04:04
3

Although being great tools, Munin and other RRDTool frontends (such as Cacti or Ganglia) have known i/o issues and are dificcult to scale when you monitor hundreads of nodes.

There are some techniques to deal with this i/o bottleneck though. One of these thecniques is to spread writes across a large number of disks to reduce i/o in each disk. On the other hand, many sysadmins use tmpfs filesystems to deal with this problem. RRDCached is also a recent and good option to deal with this and I'd recommend you take a look at this slides.

I'm not that familiar with Munin, but Cacti has a Boost plugin. This plugin caches data in memory and performs mass and on-demand updates to disk, instead of individual writes, thus reducing i/o. I'm pretty sure that Munin has also something like this.

If you can afford them, SSD disks are also good options.

Last but not least, you can also take a look at Reconnoiter. Recconoiter is a brand new fault detection and graphing/trending tool. Unlike most trending tools, Reconnoiter is not RRDTool based and tries to solve this specific issue. I'm not using Reconnoiter in production, but I've made some tests, and despite still being a little "green", looks really promising, especially regarding its scalability.

Hope this helps!

Marco Ramos
  • 3,100
  • 22
  • 25
  • Zabbix also doesn't use RRD, it uses a backend like MySQL or Postgres. If you get your templates right and don't monitor useless stuff, you can easily scale. – coredump Apr 19 '11 at 01:19
2

Check out Zabbix. It is one of the best Open Source performance monitoring tools out there. It scales well and has been used in environments with thousands of computers.

Red Tux
  • 2,074
  • 13
  • 14
0

Marco Ramos gives some solid advice. I want to add some clarification, however: the big problem with munin is it's fixed 5 minute collection schedule. If all the nodes don't return results within the 5 minute window, you start getting dropouts. This is the biggest problem with munin.

Other rrdtool based tools like Ganglia aren't locked in this same 5 minute update window because they don't poll all the data sources in the same sequential way that munin does.

I would recommend you look at Ganglia because it generally seems to scale well (although you do need to turn off the multicast data collection for a large ganglia installation). I suspect you can go quite a long ways with ganglia before you need to start worrying about rrdtool being the choke point. At that point you can do the sorts of things that Marco suggests, like using SSD drives.

Phil Hollenback
  • 14,647
  • 4
  • 34
  • 51
0

I'm replacing Munin w/ Ganglia, Munin kills my server so I'll give Ganglia a try and see how it scales.

sdot257
  • 3,039
  • 5
  • 29
  • 38
  • How did it go? I am interested in such a replacement myself ... – thanasisk Mar 13 '14 at 09:39
  • I prefer Munin's graphs but Ganglia worked well. I've since left the job but when I left, I did replace Munin with Ganglia. With the latest release of Munin, I'm incline to think that they tweaked the memory usage. I wouldnt hesitate to use either, it's a matter of preference I guess. – sdot257 Mar 13 '14 at 18:52