Computers are far better than I at analyzing data. I personally prefer systems like OpsView that digest situations and offer a multifaceted interface. Monitoring stats are filtered for abnormal conditions, and individual alerts are delivered to admins responsible for the system. There's an overall health dashboard that's viewable by helpdesk and management that gives an impression of how bad an outage is and whether anyone who can fix it is working on it yet. They put it on rotation on the big screen as something you can see at a glance, not something you stare at all day. Scrolling text and flashing lights aren't how salaried employees should interface with your monitoring systems.
Conrad Albrecht-Buehler has a Google Techtalk ("Making Monitoring Suck Less") that discusses the merits and shortcomings he sees in current dashboard UI design, and proposes some improvements. I don't know if he's published code or even his thesis. The general idea is simple:
- You define situation monitoring as capturing a set of signals about a state. Load, free disk space, network traffic, or even higher level things like forum posts per hour.
- Then you define a heed function that maps the wide input signal from 0 to 1, with 0 being "ignore" and 1 being "zomg!". In terms of Nagios, he replaces the WARNING state with a WARNING integer.
- Finally you define a a aggregator to summarize and prioritize those WARNING signals.
As far as specific tools you'd use to write your own monitoring system, Nagios scripts have a decent interface (probably this is where you'd glue in a HEED mapping if you like it), storing signals can be done with rrdtool, and you can generate graphs from that, and there's a Django app called Graphite that renders rrd databases. There's also Nagvis:
NagVis is a visualization addon for the well known network managment system Nagios.
NagVis can be used to visualize Nagios Data, e.g. to display IT processes like a mail system or a network infrastructure.