How can I make Status Information for Nagios services easier to read?

Question

I'm running Nagios in an environment with several servers, each with several services on them. There are a few custom checks, but it's nice to use existing checks if possible. I'm using NRPE plugin check check_disk to check each mounted file system for utilization:

command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w 10% -c 5% -p / -p /var -C -u GB -w 200 -c 100 -r '^/mounts[^/]+$'

It's handy to have these all checked as a single service ("Disks"), but when one of these goes to warning mode, it's hard to read the output in the Status Information line:

DISK WARNING - free space: / 6 GB (9% inode=92%): /var 125 GB (67% inode=99%): /mounts/vol0 1152 GB (16% inode=99%): /mounts/vol1 1096 GB (15% inode=99%): /mounts/vol2 126 GB (1% inode=99%): /mounts/vol3 228 GB (3% inode=99%): /mounts/vol4 3245 GB (44% inode=99%): /mounts/vol5 108 GB (1% inode=99%):

In the above case, the check is warning because /, /mounts/vol2, and /mounts/vol5 are below threshold. An operator has to wade through each value to find the value exceeding set levels. Also, if one in critical and the others are warning, it would be nice to show them differently, either by marking them, or by putting them on different lines.

Is there a straightforward way to do this, without creating a new command for every mount point? Or am I missing some other fundamental method of Nagios magic to make this friendly?

Unhelpful comment: this is really more an issue with the check_disk command, not nagios. You could try tweaking/re-writing the check_disk plugin? — , Oct 10 '11 at 04:24

Stefan Lasiewski · Accepted Answer · 2011-10-10T21:21:56.003

Try the --errors-only flag which should greatly reduce the amount of text spit out by this plugin.

 -e, --errors-only
 Display only devices/mountpoints with errors

This seems to do the trick for me. Note the drastic difference in the output:

# /usr/lib64/nagios/plugins/check_disk -w 20% -c 10% 
DISK WARNING - free space: / 37167 MB (96% inode=98%); /dev/shm 244 MB (100% inode=99%); /boot 84 MB (18% inode=99%); /home 21253 MB (99% inode=99%);

But with the --errors-only flag, it's now clear that my problem is with /boot:

# /usr/lib64/nagios/plugins/check_disk -w 20% -c 10% --errors-only
DISK WARNING - free space: /boot 94 MB (20% inode=99%);

If there are no problems on the system, the output is very short:

# /usr/lib64/nagios/plugins/check_disk -w 20% -c 10% --errors-only
DISK OK

(Note: I have removed everything after the first | for clarity. The Nagios web interface also trims this output before it is displayed on the screen.)

Also see this discussion on the Debian bugtacker: nagios2: complains about disk space in an uncomprehensible way.

This works better than what I had in mind. The great thing is perfdata is still all logged, which is great, because one of my next tasks is to have some sort of RRD display of that data. — Paul, Oct 16 '11 at 20:37

score 3 · Answer 2 · edited Oct 10 '11 at 07:02

The standard way is to have everything on one line. You only have two options:

define a check for each disk (I know is not what you want but I still find this the best solution)
write your own plugin or a wrapper around check_disks which parses the output: you can then for example put the disks below the threshold in the status lines or shorten the output to include only the relevant disks.

You can write the wrapper in any language but given the task I would suggest a scripting language (e.g., Perl). There are guidelines on how to develop plugins: http://nagiosplug.sourceforge.net/developer-guidelines.html

score 2 · Answer 3 · answered Oct 10 '11 at 11:13

2

As @Matteo mentioned, I think also that you should define a check for each partition. But here's an example of wrapper to sort disk usage in descending order:

check_disk -w 20% -c 10% -p /dev/sda1 -p /dev/sdb2 -p /dev/sdb4 | 
    awk -F"|" '{ print $1 }' | awk -F": " '{ print $2 }' | \
        tr ";" "\n" | sed 's/^ //' | sort -k4,4n

PS: My check_disk plugin returns a list separated by ; instead of : as you showed.

answered Oct 10 '11 at 11:13

quanta

50,327
19
152
213

Mine returns `;` as well, but Nagios presents it as `:` on the web page, which is where I got my example text from above. – Paul Oct 16 '11 at 21:41

score 1 · Answer 4 · answered Oct 16 '11 at 16:20

You might consider check_multi, it combines the ability to show a single status line, with the ability to look at more details by actually having each disk checked independently. You can see from some of the screenshots how it'd work for you. In the example of disk checks, you'd have one check_multi check which displays "1 warning, 2 OK", when you click on that service, you'd see 3 separate checks, showing which disk is in warning with details about that disk in particular, while still showing the other 2 clearly as well.

How can I make Status Information for Nagios services easier to read?

4 Answers4