Greetings,
I'd like to ask the collectives opinion and view on distributed monitoring systems, what do you use and what are you aware of which might tick my boxes?
The requirements are quite complex;
No single point of failure. Really. I'm dead serious! Needs to be able to tolerate single/multiple node failure, both 'master' and 'worker' and you may assume that no monitoring location ("site") has multiple nodes in it, or are on the same network. Therefore this probably rules out traditional HA techniques such as DRBD or Keepalive.
Distributed logic, I would like to be deploying 5+ nodes across multiple networks, within multiple datacentres and on multiple continents. I want the "Birds Eye" view of my network and applications from the perspective of my customers, bonus points for the monitoring logic not getting bogged down when you have 50+ nodes, or even 500+ nodes.
Needs to be able to handle a fairly reasonable number of host/service checks, a la Nagios, for ballpark figures assume 1500-2500 hosts and 30 services per host. It'd be really nice if adding more monitoring nodes allowed you to scale relatively linearly, perhaps in 5 years time I might be looking to monitor 5000 hosts and 40 services per host! Adding on from my note above about 'distributed logic' it'd be nice to say:
- In normal circumstances, these checks must run on $n or n% of monitoring nodes.
- If a failure is detected, run checks on another $n or n% of nodes, correlate the results and then use them to decide whether criteria has been met to issue an alert.
Graphs and management friendly features. We need to track our SLAs and knowing whether our 'highly available' applications are up 24x7 is somewhat useful. Ideally your proposed solution should do reporting "out of the box" with minimal faff.
Must have a solid API or plugin system for developing of bespoke checks.
Needs to be sensible about alerts. I don't want to necessarily know (via SMS, at 3am!) that one monitoring node reckons my core router is down. I do want to know if a defined percentage of them agree that something funky is going on ;) Essentially what I'm talking about here is "quorum" logic, or the application of sanity to distributed madness!
I'm willing to consider both commercial and open source options, although I'd prefer to steer clear of software costing millions of pounds :-) I'm also willing to accept there may be nothing out there which ticks all those boxes, but wanted to ask the collective that.
When thinking about monitoring nodes and their placement, bear in mind most of these will be dedicated servers on random ISPs networks and thus largely out of my sphere of control. Solutions which rely on BGP feeds and other complex networking antics likely won't suit.
I should also point out that I've either evaluated, deployed or heavily used/customized most of the open source flavours in the past including Nagios, Zabbix and friends -- they're really not bad tools but they fall flat on the whole "distributed" aspect, particularly with regards to the logic discussed in my question and 'intelligent' alerts.
Happy to clarify any points required. Cheers guys and gals :-)