2

What "health parameters" do you guys monitor on a web (or sql) server (Windows 2008)?

RAM, CPU, Disk space, event log, specific web pages, network..more?

Do you have alarms that goes of on all of these if something critical is reached, i.e. ram usage over X % or something like that?

I (or more accurate the sysadmins) have access to WhatsUp Gold as a monitoring tool. But right now there are hardly any alarms set up I think.

  • I have written a [long answer about "how to monitor a production server" here](http://serverfault.com/questions/71441/what-is-the-best-way-to-monitor-a-production-server/72731#72731). – Dirk Paessler Oct 10 '09 at 22:19

4 Answers4

1

I just spent the past few months researching this exact question. My research was focused on Nginx, but the principles are the same and can apply to any web server (Windows or otherwise).

First, some theory: You want to monitor metrics across your system stack -- not just the web server application itself, but the process it runs within, the server it runs on, and the hosting provider the server lives in. You want to monitor:

  • Potential Bad Things (i.e. things that could go wrong - disk filling up, network getting saturated, etc.)
  • Actual Bad Things (i.e. thing that did go wrong)
  • Good Things (specifically, when they stop happening -- e.g. visits to /checkout)

Second, what to monitor. I boiled it down to these 14 items. YMMV depending on specific installation / server software, but I think the principles will apply regardless:

  1. Requests per Second (activity volume)
  2. Response Time (performance)
  3. Active Connections (activity volume
  4. Response Codes (2xx, 3xx, 5xx and their relative distribution)
  5. Process File Handles (this is Nginx-specific and relates to the number of maximum workers and possible connections)
  6. Process State (is the server application alive?)
  7. Server State (is the server itself alive?)
  8. Server Load Average (is the server healthy?)
  9. Server Network Usage (is there enough bandwidth?)
  10. Disk Space (Room for logs / cache)
  11. Hosting Provider Status (AWS going down == Your Server Going Down)
  12. DNS Expiration (DNS expiring = Your Server Going Down)
  13. SSL Certificate Expiration (Certificate Expiring = Your Server Going Down)
  14. User Activity (key pages - are they being viewed and returning 200 OK ?)

Full details are here if curious:

[Disclosure: I'm affiliated with Scalyr, the company that hosts the linked-to guide and for whom I wrote the guide]

nlh
  • 209
  • 2
  • 6
1

It depends what the server is doing really. For example, I know my Exchange 2007 servers will use a lot of memory, that’s what Exchange does, it grabs as much as it can, so monitoring this server for High Ram use would keep me awake all night, however I want to know if my disk space is getting low on here, as Exchange is prone to stop working with low disk space. On the other hand, I'm not really that concerned about the disk usage on my print server.

Really you need to look at your servers and determine what you need to know about them, what’s important to them running correctly, what’s nice to know for historical or tracking purposes, and what is superfluous. When you've determined what’s critical, then you really should have alarms or triggers setup for these events, what’s the point in monitoring something if you don't know when it goes wrong?

Sam Cogan
  • 38,158
  • 6
  • 77
  • 113
0

The idea of monitoring is to compare with a baseline. It's meaningless to know that your disk usage is 90% and your bandwidth is 10GB/day, if you don't know whether that's normal or not.

Basically take everything you can get cheaply (all the RAW data should be fairly cheap), recording a baseline, which will help you detect anomalies. Anomalies include things like programs going wrong and eating all the disk space, memory leaks increasing memory use, number of processes doubling when the number of logged in users is the same, etc.

The big thing is what you can glean from the raw data, and often to record samples of that data. If your disk space grows very slowly, then diskspace sampling doesn't need to be done every five minutes.

Lee B
  • 3,380
  • 1
  • 17
  • 15
0

I monitor CPU, Disk space, CPU Queuing, ping (checking that the machine is up), that the IIS service is running, I call an ASPX page to ensure that .NET is happy and processing. I log into the app passing in a username and password as a user would to ensure that the page loads and doesn't throw a 500 or timeout.

I usually don't monitor memory used, as that's usually at or near 100%. IIS does a decent job of keeping the memory ok, and IIS restarts the application pool every day or so by default which would clean up anything residual.

I tend not to monitor disk IO as it can be all other the place. On some systems SQL, Exchange, etc I'll track the disk queue for each drive, but with a very high threshold. The systems will spike, so I just want to know if they go bat shit.

mrdenny
  • 27,074
  • 4
  • 40
  • 68