Making sense of Ubuntu Server Status Info

Question

I'm trying to create a simple shell script to monitor my server. I plan to set up a CRON job to run it every five or 10 minutes.

Here's how it will work:

Run a number of linux commands e.g. iostat, mpstat, top, etc. and output the results to a text file
Send the text file via CURL to a URL that will receive the text file and process the data, and then post important metrics to a database

I plan to use this data to determine when I need to have my server upgraded.

However, I don't have much experience with server monitoring so I don't know what kind of thresholds I should be looking out for. For example, when I run something like mpstat -P ALL, what kind of figures should bother me? Or iostat?

I just want to be able to have some kind of point of reference to know when my servers are in a good state i.e. reasonable load, or in a bad state i.e. overloaded, and require upgrade or load-balancing.

Thanks in advance.

It kinda seems like you are asking for help to reinvent the wheel here. There are loads of monitoring systems. Search around the site. — Zoredache, Sep 20 '11 at 00:49
@Zoredache: I'm not trying to reinvent the wheel, I just need some perspective. I just want to be able to make sense of data provided by the simple tools already available on Ubuntu linux like top, iostat, mpstat etc. Plus, I'm not looking for a ready-made graphical tool because I don't need all that visibility, I'm just looking at thresholds i.e. good or bad. — Obi Hill, Sep 20 '11 at 02:09
I agree with Zoredache, you are reinventing the wheel. With a combination of munin and nagios you can collect, visualise, and alert on any number of performance and health metrics. — ThatGraemeGuy, Sep 20 '11 at 08:20
@GraemeDonaldson: I don't dispute that there are tools out there for server monitoring. However, as I said in my question, I don't have a point of reference to know what the data is saying i.e. is 80% of memory usage bad or ok, Is 20% swap usage bad or ok, etc.?! That's my issue, I'm trying to find some good information and perspectives on thresholds so that I know just when to setup alerts. Are there any specific metrics you work with? — Obi Hill, Sep 21 '11 at 14:26

score 2 · Accepted Answer · answered Sep 20 '11 at 03:11

I would say that the free Monit would be a more appropriate tool for testing the thresholds you're looking for and giving a simple at-a-glance view of your system's health.

Out of the box, you can setup some basic checks. The syntax is very human-readable, so a barebones setup that checks system load, memory usage, swap utilization, CPU usage and disk space for various mountpoints, and can send an email would look like this:

if loadavg (1min) > 6 then alert
if loadavg (5min) > 5 then alert
if memory usage > 90% then alert
if swap usage > 20% then alert
if cpu usage (user) > 90% then alert
if cpu usage (system) > 75% then alert
if cpu usage (wait) > 75% then alert

check device root with path /
    if SPACE usage > 80% then alert

check device var with path /var
    if SPACE usage > 80% then alert

check device usr with path /usr
    if SPACE usage > 80% then alert

check device tmp with path /tmp
    if SPACE usage > 80% then alert

In addition, I know you're saying that you don't require any graphical tools, but it may make sense to have something that can track trends. Munin is a good tool for this. There are plenty of others, but it's worth considering.

Thanks a lot. However, I have no idea what thresholds are safe and unsafe i.e. should it be 80% of memory usage or 90%?! That's my main problem, I don't really know what thresholds to work with. — Obi Hill, Sep 21 '11 at 14:21
What I posted is a good start. Adjust accordingly if the thresholds are triggered too often. Thresholds are unique to environments and applications. There are some situations where it's perfectly healthy for a server to run at a load of 10... There are others where that's a sign of a major resource problem. This is going to be a very individual thing. — ewwhite, Sep 21 '11 at 14:37
Ok Thanks. I know it's an individual thing. I'm just looking for some basic info. For example, what you had above with `if swap usage > 20% then alert` how did you arrive at that? I don't know how to read the data. All tolled, it looks like Monit can do what I need. I'm sure I can put together a shell script to send the data for each server using CURL. I just need to figure out how to read meaning into all the data provided as I'm not a very seasoned system administrator. — Obi Hill, Sep 21 '11 at 14:52

score 1 · Answer 2 · answered Sep 20 '11 at 06:45

Obi Hill: Well, you ARE reinventing the wheel. Gathering all that data, parsing it and analysing it is a solved problem you should not rewrite Yet Another Time.

SNMP is one very handy way of gathering system information for further processing (for example, graphing trends with MRTG or passing the data to Nagios or similar monitoring program).

Also programs like Cacti or Munin can do all this for you.

score 0 · Answer 3 · answered Sep 20 '11 at 03:00

0

How many servers do you have?

Perhaps you should take a look at Puppet, RunDesk or ControlTier.

answered Sep 20 '11 at 03:00

A T

397
1
4
15

Making sense of Ubuntu Server Status Info

3 Answers3