For a more comprehensive list of monitoring tools and their features, check out this Wikipedia page.
As the question states, what are the most commonly used tools used for this task and what are their strengths and weaknesses?
For a more comprehensive list of monitoring tools and their features, check out this Wikipedia page.
As the question states, what are the most commonly used tools used for this task and what are their strengths and weaknesses?
I've used Nagios in the past with success. It's very extensible (over 200 add-ons), relatively easy to use and lots of reports. A negative would be the initial setup.
Cacti is a very good web-based frontend to RRDTool, providing very handy graphs and stats. RRDTool is the part that gathers data from multiple systems and monitors a wide range of technical data.
We're using that cacti/RRDTool solution to monitor Unix and Windows systems. We get a lot of useful metrics including load, CPU/RAM usage, HD space, users logged in, network traffic, running processes, and so on.
You will find more information on cacti on the What is Cacti? page.
Personally, I love Munin which is very easy to install and to write plugins for as it has a very straightforward architecture. There are quite many plugins already around for all the purposes you could imagine, so you probably won't even have to write plugins in the first place.
It also provides beautiful graphs and the option to configure (very basic) alerts.
Zabbix. It's open-source, and reasonably simple to setup and customise. We have a lot of custom monitoring scripts that feed into the zabbix server, but it takes care of centralising that data, displaying it appropriately, notifications (email, IM, SMS, twitter, etc), and so forth.
I have been doing roll outs of Spiceworks at our company and we are finding it to be a great tool not just for monitoring servers but everything else on the network.
It does things like automatic inventory and custom monitoring to send you emails when there is a problem (EG: Printer is down to 10% of ink or hard drive of this server has 20%).
Its downside would probably be is density of information per computer, don't get it wrong it has A LOT of data per machine but for things like servers where you might want a lot of stats you might need to use another tool.
EDIT: oh did i mention its business model is based around it being free forever.
Smokeping not only checks the availability of various servers and services but also keeps track of their latency while providing easy to use, nice looking, and quick to display graphs.
Wide range of latency measurement plugins is available out of the box. If you know some Perl, it is easy to create your own ones for any exotic needs.
Large installations will benefit from Master/Slave System for distributed measurement.
Highly configurable alerting system will help you notice issues before they start affecting users or evolve into major outage.
Smokeping is free and OpenSource Software written in Perl by Tobi Oetiker, the creator of MRTG and RRDtool
OpenNMS is used where I work to monitor more than a thousand Linux machines. We monitor the hardware of each machine and the applications running on them.
Zenoss Core is of some use, We are using it (for about a year) for lightweight monitoring of servers, net switches and UPSs.
Zenoss Core is an award-winning open source IT monitoring product that effectively manages the configuration, health and performance of networks, servers and applications through a single, integrated software package.
Nagios is great since it's free and there is plenty of plugin's for it. However the UI and config is very difficult.
It's exact opposite in pro's/con's which is also great is Microsoft System Centre Operations Manager (SCOM) which is not free, has less plugin's but setup and config are brilliant and easy.
I must admit if I was in a primarily Microsoft company, had very high reliance requirements (i.e. can't afford for monitoring to break) or had to think about getting developers to work with it then SCOM would be my recommendation over Nagios.
I've used:
We use AlertFox since a few weeks and are very happy it. It not only checks our uptime and performance, but also monitors shopping cart, user login and other critical parts of the website via transaction scripts (iMacros based).
For our internal monitoring (disk space etc) we use Nagios.
PRTG Network Monitor - can't say enough great things about it. Awesome web front end and especially great for monitoring routers (bandwidth etc) and other devices through SNMP and measuring uptime for SLA's, etc.
www.paessler.com
As a Windows person, MOM. We're looking to upgrade to Systems Center Operations Manager (SCOM) but won't need to until we start deploying Windows 2008.
I'm surprised nobody has mentioned logwatch or logcheck for linux servers - saves a tonne of time reading logs!!
For monitoring statistics (memory usage, load, mysql activity, apache activity, etc.) I use Munin. Out of the box it already tracks a lot of things and plots graphs for different time intervals (last 24 hours, last 7 days, last month, last year). Through plugins even more things can be monitored. It's output are HTML pages with pretty graphs.
Munin has a master/node architecture: nodes gather statistics on a server and the master stores the data and produces HTML and graphs.
I use Monit to keep track of running processes and to restart or alert me when certain configureable conditions arise (high cpu load, high memory usage, no HTTP response, etc.) Monit can also monitor more general things about a server, such as cpu load, memory usage, harddisk status or disk usage.
Monit needs to be configured for every service or hardware you want to monitor and how to respond when something goes wrong. The most used options are to do nothing, send an alert email or restart the service.
Monit is great when it works, but sometimes it fails to start, stop or restart a service and there is not a lot of diagnostic information available to tell you what went wrong. This means you don't know if the problem was with your service or with the Monit configuration, which runs with a cron-like minimal environment.
Both tools are available by default on most Linux distributions.
Our project uses Ganglia for our 100+ node clusters. One reason we use it is because it's the monitoring tool that comes with Rocks.
It's important for us to have very low overhead on each node so that as many resources as possible are available for computation. Ganglia gives us a good overview of the cluster and allows us to drill down to individual nodes if needed. Besides know what's going on right now, we can get a pretty good look at what's happened over the last hour, day, week, month, and year. The graphs of various statistics are basic and functional.
I'm part of a operational monitoring upgrade project. We've had various vendors come onsite to present a few big dollar systems and mixed in some cheaper alternatives to compare.
One of which is Hyperic, which is also available as a free open source solution. I was impressed with its delivered capabilities and extensibility for custom agents.
I use Pingdom for monitoring my server. It sends me an SMS message when the server is unreachable.
It all depends what you mean by "monitor"!
A new entrant on the scene to check out for competing with Cacti and the RRDTool based solutions is Graphite (http://graphite.wikidot.com/)
RRDTool is replaced with a backing store called Whisper. The docs give a pretty good overview of why it differs and I really like the CLI for ad hoc graphing when investigating something.
Hobbit - it's a faster better version of Big Brother (which seems to be alarmingly commercial these days).
We use (and like) WhatsUp from Ipswitch for our relatively small Windows network. It is easy to setup, and relatively easy to manage, and knows how to deal with Windows servers as well as standard stuff.
For larger networks, non-Windows-oriented networks, or networks with lots of varied stuff, I heartily recommend OpenNMS. OpenNMS software if free and the company is more than happy to sell support and implementation services. It also happens to be run by a very sharp friend of mine from college!
If you're in a hurry and want a quick tool to monitor your MS server then use performance monitor for windows, set up a counter log with custom monitoring template and a custome schedule (eg: collect data for 5 min every hour). Then download Microsoft's LogParser and Codeplex's Performance Analysis of Logs (PAL) Tool (http://pal.codeplex.com/) to crunch your counter log. PAL will generate a great documented report with links to possible issue solving documents/tools.
For those who don't like the Nagios web interface there is NPC, a plugin for Cacti that makes the Nagios UI available from within Cacti, but with better looks (ajax etc.).
It reads from a database provided by NDO2DB, which is a great way to have your infrastructure available from within a database for use in scripts and other tools.
Currently we use PRTG from Paessler. It's excellent. No agents required, excellent Ajax web interface, historical logging, graphing, WMI, etc etc. There's a 10 sensor version available for free but we plonked down a couple of grand for the enterprise version. Money well spent.
Zabbix (http://www.zabbix.com) is good too and easier to setup than Nagios.
I use a combination of Solarwinds, VMware server performance tabs, and custom scripts.
Solarwinds Orion Network Performance Monitor is what I use with our Windows sys. admins on my web servers. Still getting some useful app metrics running on it, but it has good information on basic box level stuff (disk, network, CPU).
For my VMware guests, I love the performance tabs.
For my Sun servers, when I need something that isn't available in Solarwinds (because our admin hasn't added it or what), I write custom scripts (usually in Perl) to monitor things like mirror health, swap usage, etc.
I'd like to get more onto Solarwinds, but there's only like 26 hours in a day (or so my boss believes) so I find that can be a tad limiting...
Sorry to say but I've ended up using lots of custom scripts. While far from ideal I doubt there's a more common solution.
We've written our own monitoring software. Our code isn't nearly as sophisticated as a commercial package, but we didn't need much functionality. It was easier to write our own than to investigate other packages and learn how to use them. The code does just what we want and it's easy to extend.
I'm using PA Server Monitor . It's primarily Windows focused (event logs, performance counters, services, etc) although getting better with other systems now that some limited SNMP support has been added. The thing I like best is it's easy to configure compared to a lot of apps (no config files, no command lines, etc). I wouldn't recommend it for a heavy *nix environment though.
Oh, it's not free, but less expensive than some competitors.
I use Polymon and love it.
http://www.codeplex.com/polymon
It's fantastic for monitoring anything that can be communicated by TCP Port, SNMP, Powershell, WMI, SQL, HTTP, Perfmon, or Ping.
I don't monitor anything *nix, so I can't speak to that. But for the Windows world it's very simple to set up, extremely intuitive, and extremely flexible, It has very nice built-in dashboard display, sms or email notification, etc.
I've worked with Pandora FMS, and I like it mainly because it's very flexible and easy con configure for the average sysadmin. Also I like the web interface with all the reports and the extensive documentation. And not very useful for a single datacenter, but very cool is the geolocation interface that shows the position of the agents monitored.
I've also tryed Nagios and I like all the plugins it has, and that it's well known among sysadmins.
Note: I've been one of the developers of Pandora FMS for some time.
For HP servers you can't beat their Systems Insight Manager (SIM), lots of lovely low-level counters and alerts etc., not a bad GUI either and the link to your support contract is worth the effort on its own.
We needed something customisable as we need to monitor some systems which are not online all the time, but can send mail or be dialled in.
We tried nagios (maze of scripts), AppManager (nice, but nonadaptable), Zenoss (nice, but when you mention Oracle, price gets hefty multipliers) and landed on Zabbix which has open protocol, open database structure, heck, I can write a plugin on every level in a hour. It's nicely compartmentalised (server, client, database, ...). And it's web frontend is quite nice and customisable.
YMMV, for us the monitoring of "offline" systems is important and it is usually not covered by such software.
We use WhatsUp from ipswitch, it's very easy for setup small networks, it can autodiscover networks by port scan, it can use windows and SNMP credentials.
For monitor statics like cpu, mem, and disk, we need to setup SNMP. WhatsUp support SNMP v1, v2, v3.
WhatsUp have a passive monitor through syslog (Unix), event viewer (Windows) and SNMP Traps.
It has a nice ajax web interface with custom user and custom workspaces.
P.D. sorry for my bad english
I've used hobbit, big brother and nagios when working for poorer (read cheaper) organizations. Of the three I prefer hobbit because its simple and bulletproof. I've always felt that nagios is is trying to be an open source version of openview or tivoli, and frankly if I have the time to spend configuring a framework like openview or tivoli then monitoring is probably my entire job and my organization can probably afford to buy openview, so why use nagios?
We've just started using "Servers Alive" which is very inexpensive, it isn't too pretty looking, but it supports a tonne of different checks and can alert in several ways, handles technician scheduling/rosters etc for any notifications. You can also make checks rely on others, i.e. "this" system requires "that" to be up/running.
For Windows: Admin Arsenal (but that's a given in that we own the product)
For Unix - IBM Tivoli
We use Orca to monitor our systems. It's not super pretty, but it gives a ton of low level details other monitoring systems don't use.
I use a combination of Nagios, Cacti, custom scripts and one of my own projcts -- System Health Monitor. I like having external service monitoring as well as graphs of system resources so you can do post-mortem analysis of system problems or quickly check the graphs to see if things look 'normal' compared to their historical values.
Nagios combined with nagvis (graphics to show off monitoring)
linked to mail, google talk and twitter.. so you cant escape the monitoring
its even got a great firefox plugin
I am using nagios and hobbit (bigbrother opensource implemenation) independantly and have found both having positive and negative qualities.
nagios:
pro: has a nice sub-minute scheduler for running tasks at regular intevals and has an embedded perl interpreter to boot.
con: config insists on having a 'server' for every test, when sometimes you just want to run a test that is based on an application 'feature' but not necessarily isolated to a single host. Revert to a meta-config that generates the actual nagios config to overcome this.
hobbit:
pro: opensource compiled server instead of the massive scripts used by original big-brother
easy integration with the bb client 'dboard' command to poll data.
con: also stuck in a 'server-oriented' mentality, which fits most folks, but not me.
Currently using Groundworks Open Source Community Edition 5.3 - although support has fallen by the wayside on that version now. May upgrade to GWOS 6 or perhaps jump ship to Zabbix or similar Open Source system. I tend to favour those based on Nagios, but wouldn't go for vanilla Nagios due to the nightmare of managing all those interdependent config files.
Groundworks' WMI Monitoring plugins for NRPE work pretty well. Nagios triggers a WMI service check on a windows box using NRPE, which then does the WMI querying of your other windows boxes. This gets around the requirement to have NRPE agents on your windows boxes, and also the nightmare of trying to get Nagios running on *Nix to authenticate on Windows.
Another nice option is to set up SNMP on your windows boxes as part of your base build. There are some options out there to expose WMI checks via SNMP (SNMPTools) (although you need to install this on each Windows box, making it not agentless).
There are a number of Windows tools which can monitor Windows logs and send an SNMP trap when certain events occur.
We're using AlertGrid, it's ideal for web apps. Unlike millions of typical dotcom monitors it does not monitor performance (response time etc.) from outside, but it lets you trace the execution of your code and all your custom metrics/statistics by sending events from inside of your app. Once you start sending events from your app to AlertGrid, everything is configurable using nice visual editor (100% web) and non-technical people can easily create their own alerting rules. Email, SMS, phone and webhook alerts are available.
It has a plugin for simple server monitoring (windows), which installs as a service, runs in background and emits events about cpu usage, % free RAM, and processes runing. Takes half a minute to set up, and it works! The only caveat is that the machine must have an internet connection.
We started using Server monitoring Bijk.com - http://www.bijk.com before several weegs ago.
And we are happy for simple installation and very easy GUI and maintanence - mail & SMS alerts for free is good for us.
I use 10-Strike Network Monitor
It works as service 24/7 and monitors all devices in the network by periodc polling each device within lan. Also Ican set up the program's response to particular events for example device or service on/off. Program can display a message, play a sound, run external programs, write a record to log, send SMS, restart/shut down a service or a computer and so on.
We us IP Check which has been renamed PRTG it allows for a wide range of sensors that can monitor all sorts of different activity.
Someone should mention Netgong for a simple on/off monitoring tool via ping intervals.
I use NetGain Enterprise Manager from NetGain Systems. It's take just few minutes to install and get it up and monitoring. Best of all, it's free. check out http://www.netgain-systems.com
the very VERY excellent multitail to keep an eye on logfiles. nagios to keep my eye on service uptime. rrdtool to keep my eye on bandwidth.
OPManager (Ports, HTTP Get Requests, ICMP, SNMP (Disk/Memory/CPU)) (personal favourite!) http://www.manageengine.com/network-monitoring/
OpManager is an award winning network monitoring software that helps administrators discover, map, monitor and manage complete IT infrastructure.
Cacti (SNMP Graphing, Traffic, Disk Usage, CPU Utilisation etc) (http://www.cacti.net)
About Cacti. Cacti is a complete network graphing solution designed to harness the power of RRDTool's data storage and graphing functionality.
PRTG (Paessler, no longer available unfortunately)
SmokePing: (packet loss & latency) http://oss.oetiker.ch/smokeping/
Pingdom: http://www.pingdom.com
I've worked with a lot of monitoring systems at a lot of places. Most of them have already been mentioned. Here are a few that haven't been:
SMARTS - now owned by EMC. Really is the best thing ever for root cause. It's not cheap and support may not be good anymore as it's owned by EMC. We were lucky enough to work with the founders of the company to get it implemented.
Big Brother. Nice and simple, but a bad license. It's also the ugliest web gui I've ever seen, so I had to rewrite it. Never got Big Sister to work.
HP Openview, when engineered, installed and run by a competent engineer can be good. However I've only seen it done right once and wrong more often than I can remember. I would never choose to use it.
BMC Patrol. Just awful. Die, die!
And finally, for logs and tracking down problems you just have to use Splunk. If this had been around 10 years ago I would have saved myself a lot of wasted time.
EventLog Analyzer is a web based, real time, agent less, event log and application log monitoring and management software. The eventlog analyser software collects, analyzes, reports, and archives, Event Log from distributed Windows hosts, SysLog from distributed Unix hosts, Routers, Switches, and other SysLog devices, Application logs from IIS Web server, IIS FTP server, MS SQL server, Oracle database server, DHCP Windows and DHCP Linux servers. The eventlog analyzer application generates graphs and reports that help in analyzing system problems with minimal impact on network performance.
Try Ground work.It uses Nagios. So it has all features of nagios and you can edit monitorings graphically through a webinterface which is not possible by nagios alone. https://kb.groundworkopensource.com/display/SUPPORT/Home
Please check Verax NMS. Advantages:
I've used Activexperts Network Monitor with great success (on a mostly Windows network but it had some unix and linux hosts, printers of various brands and so forth that was also monitored with it).
It's really easy to setup and learn, rather cheap for what you get (was $500 for site/enterprise license) and supports vbscript and remote unix commands. If the network is small (a few hundred nodes at most) I think this is much more intuitive than System Center Operations Manager which feels more directed at huge windows networks only.
Network Monitor comes with a lot of predefined scripts for monitoring stuff like e-mail servers including various Exchange versions and all its services, http servers with expected response, event logs, sql queries and expected responses and so on.. .and dependencies are easy to configure ("all these depend on this router so if it fails to respond to ping and snmp, don't bother alarming us about all the stuff behind it that's not responding"). SMS with gateway or local GSM modem support and all rules can of course have actions like service restart, server restart or custom script - to fix reoccuring problems for you (it's important I think, kinda like regression testing is for development).
...I've also tried to tame a Hobbit and didn't really enjoy it at all (nor the bloated Windows agent) - but it was set up for Windows server monitoring and it really blows at that - most likely more suited for a linux or unix-centric network.
We use hyperic - it has both an open source version and a commercial one
It monitors the operating system (RHES 3, 4 and 5 + Ubuntu), Apache, MySql, JBoss, Tomcat, mail servers, memcached and it probably can monitor more applications. No special configuration is needed, all servers were found with the auto discovery, even if they were installed in an untraditional place. It is very easy to use and configure, you can control your services (start/stop etc.) and define alerts.
Minuses - You need to configure it to run on boot (we are using cron to do that).
Nagios with groundwork on top of it.
I'm not sure if groundwork helps or hinders, but nagios is definitely good.
We use Level Platforms for this task. Provides a ton of useful information without overloading the sysadmins, and makes it extremely easy to handle all of the hardware in our server room (as well as many of our clients').
Ipswitch's WhatsUp Gold
We've tried Applications Manager Its running on java and mysql. It's really powerful and easy to configure from the browser. It's not that expensive either.
Currently we use SCOM from MS. I wouldn't recommend it to anyone!
Also take a look at Argent Guardian. It's cross-platform, can function as a syslog server, they'll give you the database schema to do your own reporting, if you need that, and you can import your own images as "maps" to give visual alerts.
We use Ipswitch whatsup gold 12 for monitoring about 2000 devices, both performance and tcp/ip or wmi based monitors and both windows and linux. Good thing about it is that it is easy to use and configure, has bulk change options and autodiscovery, multiple notification methods. The bad side: seems to have had a limit of about 2000 devices, after that performance was getting slow, plus it only runs on windows. The distributed version doesn't really deserve the name and the price tag. We evaluated nagios (setup too complex for a dynamic environment), zenoss (no bulk change or autodiscovery, too limited for dynamic environment) and currently looking at Zabbix, which seems most promising with all the nice features Whatsup has and more, such as fully distributed architecture with probes and server, relatively simple setup, open source backend (mysql, apache)...
I've been using Sysmon for a number of years. There are a few modern services that it doesn't monitor, but it compiles easily on most *nix platforms, has almost no dependancies, is extremely light-weight, can monitor very large numbers of devices and services with ease, can handle complex network layouts (incl. ring topologies) and failover monitoring. It's basically a config file deal, but the format is pretty easy (based on plist/css).
Nagios and HPOpenview are the two that I am familiar with and have experience in. Both are good choices, although for the latter I'll echo other posters that it needs someone that knows how to do it right. hen again the only place I saw it running was when I was with HP so that might have helped my perception.
For the status of servers and services (whether they are up or down, and sending warnings if they go down) and for yes/no questions ("has a backup been done in the last 24 hours?") we use nagios. It is hard to set up, but it is immensely configurable. Custom scripts can be run on remote computers. Alerts can send emails, send text messages or even run custom scripts.
For the health of servers we use munin - it provides nice graphs of memory usage, cpu usage, network usage etc. Pretty easy to set up on linux at least (I have not tried with Windows).
ServersAlive is a relatively cheap, simple tool for all sorts of polling, including TCP services, Windows services, your own custom scripts, whatever. The response from the developer on his mailing list is rapid and personal.
I used it at a previous job for service monitoring and it was reliable, customisable and cheap.
MSP Center (the former OpManager) is really frustrating to use and I can't recommend it. The interface is entirely web-based which means zero feedback and an arbitrarily limited set of choices any time you want to do something. Their website seems full of tips and documentation, but it's a bit like Outlook - it promises a whole bunch of power but is hamstrung by some developer's limited imagination.
If you're looking for a zero-config solution for your helpdesk, well maybe, but it's not any sort of power tool. If you have time to tune your monitoring to meet your needs then there are other solutions that would reward your efforts more.