Geographically distributed, fault-tolerant and "intelligent" application/host monitoring systems

Question

Greetings,

I'd like to ask the collectives opinion and view on distributed monitoring systems, what do you use and what are you aware of which might tick my boxes?

The requirements are quite complex;

No single point of failure. Really. I'm dead serious! Needs to be able to tolerate single/multiple node failure, both 'master' and 'worker' and you may assume that no monitoring location ("site") has multiple nodes in it, or are on the same network. Therefore this probably rules out traditional HA techniques such as DRBD or Keepalive.
Distributed logic, I would like to be deploying 5+ nodes across multiple networks, within multiple datacentres and on multiple continents. I want the "Birds Eye" view of my network and applications from the perspective of my customers, bonus points for the monitoring logic not getting bogged down when you have 50+ nodes, or even 500+ nodes.
Needs to be able to handle a fairly reasonable number of host/service checks, a la Nagios, for ballpark figures assume 1500-2500 hosts and 30 services per host. It'd be really nice if adding more monitoring nodes allowed you to scale relatively linearly, perhaps in 5 years time I might be looking to monitor 5000 hosts and 40 services per host! Adding on from my note above about 'distributed logic' it'd be nice to say:
- In normal circumstances, these checks must run on $n or n% of monitoring nodes.
- If a failure is detected, run checks on another $n or n% of nodes, correlate the results and then use them to decide whether criteria has been met to issue an alert.
Graphs and management friendly features. We need to track our SLAs and knowing whether our 'highly available' applications are up 24x7 is somewhat useful. Ideally your proposed solution should do reporting "out of the box" with minimal faff.
Must have a solid API or plugin system for developing of bespoke checks.
Needs to be sensible about alerts. I don't want to necessarily know (via SMS, at 3am!) that one monitoring node reckons my core router is down. I do want to know if a defined percentage of them agree that something funky is going on ;) Essentially what I'm talking about here is "quorum" logic, or the application of sanity to distributed madness!

I'm willing to consider both commercial and open source options, although I'd prefer to steer clear of software costing millions of pounds :-) I'm also willing to accept there may be nothing out there which ticks all those boxes, but wanted to ask the collective that.

When thinking about monitoring nodes and their placement, bear in mind most of these will be dedicated servers on random ISPs networks and thus largely out of my sphere of control. Solutions which rely on BGP feeds and other complex networking antics likely won't suit.

I should also point out that I've either evaluated, deployed or heavily used/customized most of the open source flavours in the past including Nagios, Zabbix and friends -- they're really not bad tools but they fall flat on the whole "distributed" aspect, particularly with regards to the logic discussed in my question and 'intelligent' alerts.

Happy to clarify any points required. Cheers guys and gals :-)

That's really strange, I was about to ask a similar question. This week we had some customer complaints about site outages, but only from certain locations. Our alert systems did not detect these problems. We contacted our provider and they confirmed that some they had some backbone problems. So I'm also interested in a solution. Thanks! — splattne, Jul 04 '09 at 16:37

pQd · Accepted Answer · 2009-07-04T16:56:38.313

4

not an answer really, but some pointers:

definitivly take a look at presentation about nagios @ goldman sachs. they faced problems you mention - redundancy, scalability: thousands of hosts, also automated configuration generation.
i had redundant nagios setup but at much smaller scale - 80 servers, ~1k services in total. one dedicated master server, one slave server pulling configuration from master at regular intervals few times a day. both servers covered monitoring of the same machines, they had health cross-check between each other. i used nagios mostly as framework for invoking custom product specific checks [ bunch of cron jobs executing scripts doing 'artificial flow controls', results ware logged to sql, nrpe plugins ware checking for successful / failed executions of those in last x minutes ]. all worked very nicely.
your quorum logic sounds good - a bit similar to my 'artificial flows' - basically go on, ipmplement your self ;-]. and have nrpe just check some kind of flag [ or sql db with timestamp-status ] how things are doing.
you'll probably want to build some hierarchy to scale - you'll have some nodes that gather overview of other nodes, do look at presentation from first point. default nagios forking for every single check is overkill at higher number of monitored services.

to answer some questions:

in my case environment monitored was typical master-slave setup [ primary sql or app server + hot standby ], no master-master.
my setup involved 'human filtering factor' - resolver group who was a 'backup' for sms notification. there was already paid group of technicians who for other reasons had 24/5 shifts, they got 'checking nagios mails' as additional task not putting too much load on them. and they ware in charge of making sure that db-admins / it-ops / app-admins ware actually getting up and fixing problems ;-]
i've heard lot's of good things about zabbix - for alerting and plotting trends, but never used it. for me munin does the trick, i have hacked simple nagios plugin checking if there is 'any red' [ critical ] color on munin list of servers - just an additional check. you can as well read values from munin rrd-files to decrease number of queries you send to monitored machine.

edited Jul 04 '09 at 16:56

answered Jul 04 '09 at 16:13

pQd

29,561
5
64
106

Yeah, I have a substantial amount of experience deploying Nagios into reasonably big environments. Automation of the configuration is possible by using a scripting language to yank relevant data out of a directory service like OpenLDAP. Unfortunately what Nagios lacks is the distributed awareness, the ability for all nodes to understand "state" everywhere in the 'cluster' at any given point, and thus make intelligent decisions about when to trigger alerts. We actually use Nagios at the moment hence moaning about 3am text messages for the wrong reasons ;) – nixgeek Jul 04 '09 at 16:16
What happens when your master goes **SPLAT** and does that make the slave a bit unhappy? What about if SQL goes away, did you end up looking at multi-master MySQL or PostgreSQL replicated configurations? How do you configure the solution when your 'master' isn't available? – nixgeek Jul 04 '09 at 16:19
1

@astinus - well for sensible alerts i used custom notification script. instead of relying on nagios notify by mail/pager i stored message to fifo que and had consumer that dispatched message based on custom logic [ based on quite flexible on-call schedule etc ], additionally there was some limit of msgs sent per hour so one does not get 50 smses in short while. i see similar approaches in larger scales - nagios is just skeleton and people script around it and actually use less and less of it's features. – pQd Jul 04 '09 at 16:21
1

With regards to hierarchy, what I have at the moment is an entirely "modular" Nagios setup where your etc/ directory contains a 'core' configuration which is shared (and identical) on all hosts and then etc/modules/$NAME (ie: Mail, Web, Network, DNS) which is 100% portable between servers. Include with cfg_dir =) You put in any module-specific commands, plugins and **everything** to that directory. Making >1 server run those checks is pretty easy as you just copy the module to as many Nagios boxes as required, however once again, the alert logic causes problems :-) – nixgeek Jul 04 '09 at 16:22
1

@astinus#2. in my case config replication master->slave occurs every 6h. if master just dies [power outage etc] - slave will alert everyone about master being dead [ crosscheck between servers ]. one can imagine other scenario - when master dies because of misconfiguration. if that happens up to 5 min before config sync to slave - there will be notification. if it's just before config sync - unfortunatelly we end up not having monitoring system. 'who will watch the watchman'? well maybe yet another very simple nagios. – pQd Jul 04 '09 at 16:25
1

@pQd - interesting, I do agree that implementing the logic in custom notification scripts is probably the way to go. However it gets pretty tricky to avoid duplicate notifications from 2+ hosts, when you have say 50 monitoring hosts, and I've yet to see anyone (in public) put their shared logic into a proper 'message' passing system like Rabbit or Amazon SQS. – nixgeek Jul 04 '09 at 16:26
1

@astinus#3 in my case it was 'Level 8' [of iso osi model] solution: primary nagios was sending sms'es to people on call + mails to 24/5 'resolver group', while 2ndary nagios was only mailing 'resolver group'. it was up to that group to filter duplicates before escalating; – pQd Jul 04 '09 at 16:28
@pQd: Replication every six hours is a neat idea, but may not *quite* tick my boxes, as I'd be concerned about potential problems replicating to "more" hosts, ie: a 1 -> 50 relationship of sorts. There's also a desire to not care if an individual node dies, we put them with *cheap* dedicated providers around the world who sometimes do take 1-2 days to fix hardware faults. – nixgeek Jul 04 '09 at 16:30
@pQd: I think the problem with leaving it to someone else to escalate is you have to employ them ;) I'd prefer to put that logic in software if possible, as humans make mistakes and cost cash. – nixgeek Jul 04 '09 at 16:31
1

@astinus#4 sure - have two types of checks: for each servers [ that do not generate sms'es/escalations ] - but are usefull during the day, and for 'services' [ which are delivered by HA clusters rather then single nodes ]. if 'service' is down [ eg no ping via firewall, no answer on port 80 of http proxy/cluster ] - then you want to wake someone up. – pQd Jul 04 '09 at 16:32
@pQd: Agreed on the checks point. Now how do you stop Nagios sending text messages when the provider network of **one** box has issues? Even with 10 boxes you can reasonable assume that false alerts will happen, and those need to be eliminated as far as is possible. – nixgeek Jul 04 '09 at 16:33
@astinus#5 i'm not sure if understand question right, but to avoid false alerts i always check 2-3 times before there is nagios notification. this obviously delays actual notification [ i get it 2-3 minutes after critical event ], but it also filters out 'noise' caused by temporary glitches. at the same time for critical things my artificial flow was counting number of HTTP/500 responses in last 30 minutes and if every 2nd execution of AF ended with that - was sending alert anyway - via other check. – pQd Jul 04 '09 at 16:36
@pQd; RE: "multiple checks" -- unfortunately checking 2-3 times before there is a Nagios notification doesn't always hit the sweetspot, in the event *one* offsite provider has a network problem you'll sometimes get false alerts screaming your core network is down. Unfortunately staff get bored of these after a while when 99% of the time they are false alerts ;) – nixgeek Jul 04 '09 at 16:45
@astinus well then - custom checks again. on cron - ping providers router, ping your hosts in given location, store results in sql. have nrpe looking at last 5 min results, and return OK/CRITICAL. it is again pushing logic out of nagios, but i'm afraid that's the only solution. – pQd Jul 04 '09 at 16:51

score 1 · Answer 2 · answered Mar 18 '12 at 18:27

What you are asking for sounds a lot like what Shinken has done for Nagios.

Shinken is a Nagios rewrite.

Modern language (Python)
Modern distributed programming framework (Pyro)
Monitoring Realms(multi-tenancy), HA, spares
Livestatus API
Nagios plugin compatible
Native NRPE execution
Business criticality of objects
Business rules can be applied to the state of objects (managing cluster or pool availability)
Graphing can use Graphite or RRDtool based PNP4nagios
Stable and being deployed in large environments
Big deployments can consider pairing it with Splunk for reporting or look into Graphite where RRDtool is not a good fit.

This should be food for thought.

Cheers

Geographically distributed, fault-tolerant and "intelligent" application/host monitoring systems

2 Answers2