4

This is one of those generic questions whose only correct answer is ``it depends''. What are the criteria?

  • What is monitored?
    • Reachability, availability? e.g. is a link up/down, does host respond to ICMP, etc.
    • Services? e.g. is something listening at the right port, is a named service running, etc.
    • Resources? CPU usage? e.g. % of total possible, cumulative time, total or per proceess. disk usage? network usage? e.g. bytes or packets moved in or out.
    • Services? e.g. is something listening at the right port, is a named service running, etc.
    • Service or application specific metrics? e.g. DB transactions per second, SMTP messages send or received, etc.
  • How are monitored elements discovered/added/setup/configured? Is there auto discovery? Manual setup?
  • How are particular elements monitored?
    • a local agent? e.g. to do periodic "df" or "ps" or "ping"
    • SNMP?
    • JMX?
    • windows performance counters?
  • How is notification done? e.g. console, email, pager, SMS, IM, etc.
  • How are elements and notifications grouped and prioritized?
    • e.g. will a link failure set off notifications for all the service or reachability elements behind that link? Or a just one? Or is it configurable?
    • e.g. will a host failure set off notifications for all the services or applications hosted there and for lack of resource monitoring data?
    • is there automatic case/ticket/issue creation in tracking system?
  • How is tracking to SLA metrics done?
sysadmin1138
  • 131,083
  • 18
  • 173
  • 296
dsimms
  • 19
  • 1
  • 4
  • 1
    This should probably should be made community wiki. – Zoredache Jun 01 '09 at 17:25
  • See: http://serverfault.com/questions/44/what-tool-do-you-use-to-monitor-your-servers – Zoredache Jun 01 '09 at 17:25
  • From the ServerFault FAQ: "Avoid asking questions that are subjective, argumentative, or require extended discussion. This is not a discussion board, this is a place for questions that can be answered!" – sh-beta Jun 01 '09 at 19:16
  • True that it's subjective to a fair degree, but it's a question that needs to be asked. Can't we ask (and answer) big questions too? – dsimms Jun 01 '09 at 19:34
  • 1
    The big questions are interesting and necessary, but I don't think they're what Serverfault is intended for. Maybe I'm wrong, but if so I'd like to see the FAQ updated to reflect it. Personally, I'm tired of seeing dozens of questions about "the best X" when the answer is always, always, always "It depends." – sh-beta Jun 02 '09 at 13:31

7 Answers7

6

Anything that relies on SNMP to monitor servers is a failure. There are fundamental issues with SNMP making it impossible to properly monitor a server. Furthermore, most SNMP agents suck. Net-SNMP sucks really bad.

Usually issues like this are ignored, as long as pretty graphs are produced. I've told development managers that the data they're looking at is useless, that we're only doing it to satisfy a mandate to produce pretty graphs, and they were OK with that, and continued to ask questions about the graph.

For example, it takes about 20 SNMP requests to get information about a single thread. On a system with a million threads that needs polling once per minute, that's 20 million packets per minute for monitoring! I realize a million threads is a lot and not everyone needs per-minute polling, but it's also not unreasonable and many people need more.

Usually the meaning of "free" memory is confused. I've seen this ignored because it allows for the purchase of extra memory - quite beneficial in a financial environment where a busy day could result in 3x normal memory usage and where management refuses to size for those peaks. Essentially the lies cancel out.

Often monitoring tools meant to monitor switches/routers will get per-CPU statistics via SNMP for a server, and report the data prominently. Many people don't want to hear that per-CPU statistics are not what they want and that per-thread statistics are.

Regardless of how the data is retrieved, many common problems require sub-minute or even sub-second polling to understand. Luckily the Linux sar can sample data at 1-second intervals with no problem. It doesn't save all the data that iostat does, which can make understanding a storage bottleneck guesswork. I just save "iostat -x 1" data as well. For example, if a user complains about sub-second freezes (or, if a customer complains that their transactions that normally take 10ms occasionally take 200ms), sub-second polling of all process/thread statistics is useful. Sadly, few kernels provide a reasonable mechanism to do this. (there's no legitimate reason why I can't pull this data down in a structured way in one system call, and I shouldn't have to deal with conversion of the data to decimal in the kernel, and from decimal in my application, along with other silly overhead).

Failure to save disk performance stats in a reasonable manner is a common oversight.

Failure to have well-synchronized clocks is a common problem. The fact that NTP is always required is missed on many people. The fact that improper NTP configuration can mean you don't know how synchronized two clocks are is a common problem. The fact that a serious business should spend the money on a GPS clock of their own is often missed. For companies involved in NASDAQ trading, I point to the regulations, write up an explanation for our customers about what time accuracy to expect (they frequently ask), and when asking for approval of this explanation, describe what setup we need to obey the regulations, obey our promises to our customers, and troubleshoot problems with vendors that rely on time synchronization.

Delivery of alerts is a common problem. Basically you need to make sure that a person will respond to alerts, that a person is accountable for alerts they acknowledge, and that an alert will be re-sent via another pathway or to another person if it is not acknowledged. If people are receiving bogus alerts that prevent them from treating pages seriously, the monitoring system needs to receive attention.

Understanding the difference between trending and error alerting is important.

Reporting errors in syslog is important, as is having a mechanism to identify new types of errors even if it is not timely.

I've touched on some really important stuff here. But nothing is so important as this - no matter what monitoring/trending/alerting solution you buy, it will have a significant cost to set up and customize for your environment. There is no solution available that makes the setup/maintenance cost significantly lower. A common failure is to keep purchasing new monitoring systems, leave them in a default setup, and allow it to be useless.

Promises from a vendor that they will help customize for free are useless. Unless you have it in writing clearly. Promises from a vendor that they will sell you expensive customization services are useless - you can't trust that they will do so competently.

If you have critical custom in-house applications and your developers refuse to add instrumentation, logging, and other assistance for monitoring to their application, you have a problem. Basically, negligent developers who don't care about the operational aspects of their software. On the other hand, the developers need to be involved in a discussion about what aspects of their software to monitor, so a convenient method of exposing this can be designed. They may be under pressure to add features and not consider reliability or alerting of problems.

carlito
  • 2,489
  • 18
  • 12
  • 5
    "Anything that relies on SNMP to monitor servers is a failure." That seems a ludicrous exaggeration. Many installations (mine included) use SNMP successfully. It certainly has problems, but putting it like this is plain wrong. – sleske Jun 10 '09 at 01:46
  • 3
    Yeah, I have to completely disagree with your statement that "Anything that relies on SNMP to monitor servers is a failure.", too. Yes, there are some definite limitations to SNMP and they need to be well understood, but that doesn't make it a failure. There are situations where SNMP is by far and away the best (if not only) tool that will get the job done. Especially when it comes to network devices and 'non-traditional' devices (UPS's, generators, printers, etc). – Christopher Cashell Jul 01 '09 at 20:39
  • Nobody should make a decision based on this answer, it's main point about SNMP is plainly incorrect. Now I agree about Net-SNMP being a nasty implementation. Unfortunately it's the only practical GNU/Linux choice, but it's significantly better than avoiding SNMP. – J. M. Becker Apr 07 '13 at 21:05
5

Nagios used to be a smaller, lower-end system, but I'd say that the most recent versions have truly been "enterprise class". SNMP-based, open source, integrates with everything from Cacti to RRDTool. You'll need to spend time configuring and building custom reporting scripts, but to be honest that's the case for the commercial tools as well.

Traverse (was NetVigil) is a commercial tool that is bigger than "old Nagios" and on a par if not slightly better than current Nagios.

There are lots of mid-range monitoring systems.

At the high end you got HP OpenView, IBM Tivoli, CA Unicenter and many others. Price tag can run to US$millions for licensing and implementation consulting, which is a requirement.

No matter where you are in the spectrum, monitoring software will require an investment in your time. It can easily be a full-time job for the care and feeding of a monitoring system in a larger shop.

tep
  • 304
  • 1
  • 5
2

We've recently begun evaluating Zenoss with various Nagios plugins. It seems to be quite configurable. We had tried Nagios about a year ago & ran into configuration issues. Zenoss seemed a little easier to use.

We had also debated about "The dude" but wanted a *nix based server.

I also recently ran across an infoworld article detailing some open source monitoring tools that are quite valuable.

Pete
  • 136
  • 1
  • 8
1

I used a product by Castle Rock called SNMPc - its not the most polished of tools, but it does everything that you could want and wont break the bank.

Its basically an SNMP statistics collation tool, that can baseline and warn if baselines are deviated from. It can be given thresholds for growth and decline warnings and works well with any SNMP capable device.

Enabling SNMP in *nix is simple, as it is within Windows. Extensibility of SNMP is quite easy too (at least on *nix)

SNMP is free - there are 3 levels; all to do with security. SNMP 1 is plain text and very 'insecure'. SNMP 2 is encrypted, but its trivial. SNMP 3 uses certificates. It can be a bit of a chore getting it to work the first time though.

Because there are so many counters and statistics that you can pull, it can also take a while working out which ones are right for you - but once this is done, its very straight forward.

You pay for front end collation and trigger on events to make SNMP useful. You can do it with open source software, but I wanted a modicum of commercial support.

Data can be polled from the devices (normal) and on critical systems, you can get the individual system to send a trap event notifying the trap manager that something went wrong, and they need to know now, and not wait for the next poll period.

Polling remote devices can be done by using a collection agent - same sort of thing as the console, but without all the reporting wizardry - that then pushes the stats at the central console periodically.

Of all the monitoring systems I have used, SNMP kept supplying what I was asked for, and within the budget I was given.

There is a product for Microsoft Servers called MOM 'Microsoft Operations Manager' where the 5 server workgroup version is (or at least was) free... but extending it to keep an eye on enterprise systems such as Exchange and SQL could cost a lot in licenses and connectors.

Beyond that - My experience is limited to SNMP, MOM, and Spotlight (by Quest) which was awesome and a bit too far beyond our budgetary range for all but the most critical of Oracle Databases.

Iain
  • 363
  • 1
  • 4
1

I am a big fan of nagios and have set it up to monitor all my servers and many of the services that they run. It is one of those programs that I constantly tinker with especially as things that we do change quite frequently. I can even get it to check for certain text on our public websites.

Originally I had notifications set just as emails but have experimented with SMS alerts and more recently IM alerts.

I have used it for well over a year now but am still no where near having it perfected. One downside I have found is that historical details aren't stored well but that may be more to do with the fact I haven't found the right plugin.

Simon Foster
  • 2,572
  • 6
  • 36
  • 54
1

I can't recommend Nagios highly enough. Entirely free (assuming you don't count the hours required to negotiate its considerable learning curve), and extendable in just about every way - all the monitoring and alerting stuff has a modular architecture, so you can write your own plugins, pager scripts, whatever. If you're not that much of a coder, there's also a large and helpful community producing plugins of all kinds, although the bundled ones are pretty good - there are remote agents you can install on your servers, or interfaces to let you perform WMI/SNMP checks, or talk to Tivoli/OpenView server agents. There are also add-ons to extend the basic Nagios engine in useful ways, like logging performnce data to MySQL or RRDTool. The configuration is a little complicated, but with enough bullying you can get it to monitor pretty much anything that's got a network connection. Plus, once you've set up all the host parent-child relationships and extended info (IP address, circuit ID, whatever), you can use it as network documentation.

RainyRat
  • 3,700
  • 1
  • 23
  • 29
0

I am actually not so in it to monitoring system but is always interested, so thanks for bring this up and I will be checking closely.

I previously used one from ManageEngine (http://www.manageengine.com/products/opmanager/index.html), and quite like it.

A good monitoring system I prefer is a one that is web-based, no-agent needed, and SNMP-based one.

kentchen
  • 754
  • 5
  • 9