Can someone explain the "use-cases" for the default munin graphs?

Question

When installing munin, it activates a default set of plugins (at least on ubuntu). Alternatively, you can simply run munin-node-configure to figure out which plugins are supported on your system. Most of these plugins plot straight-forward data. My question is not to explain the nature of the data (well... maybe for some) but what is it that you look for in these graphs?

It is easy to install munin and see fancy graphs. But having the graphs and not being able to "read" them renders them totally useless.

I am going to list standard plugins which are enabled by default on my system. So it's going to be a long list. For completeness, I am also going to list plugins which I think to understand and give a short explanation as to what I think it's used for. Pleas correct if I am wrong with any of them.

So let me split this questions in three parts:

Plugins where I don't even understand the data
Plugins where I understand the data but don't know what I should look out for
Plugins which I think to understand

Plugins where I don't even understand the data

These may contain questions that are not necessarily aimed at munin alone. Not understanding the data usually mean a gap in fundamental knowledge on operating systems/hardware.... ;) Feel free to respond with a "giyf" answer.

These are plugins where I can only guess what's going on... I hardly want to look at these "guessing"...

Disk IOs per device (IOs/second)
What's an IO. I know it stands for input/output. But that's as far as it goes.
Disk latency per device (Average IO wait)
Not a clue what an "IO wait" is...
IO Service Time
This one is a huge mess, and it's near impossible to see something in the graph at all.

Plugins where I understand the data but don't know what I should look out for

IOStat (blocks/second read/written)
I assume, the thing to look out for in here are spikes? Which would mean that the device is in heavy use?
Available entropy (bytes)
I assume that this is important for random number generation? Why would I graph this? So far the value has always been near constant.
VMStat (running/I/O sleep processes)
What's the difference between this one and the "processes" graph? Both show running/sleeping processes, whereas the "Processes" graph seems to have more details.
Disk throughput per device (bytes/second read/written)
What's thedifference between this one and the "IOStat" graph?
inode table usage
What should I look for in this graph?

Plugins which I think to understand

I'll be guessing some things here... correct me if I am wrong.

Disk usage in percent (percent)
How much disk space is used/remaining. As this is approaching 100%, you should consider cleaning up or extend the partition. This is extremely important for the root partition.
Firewall Throughput (packets/second)
The number of packets passing through the firewall. If this is spiking for a longer period of time, it could be a sign of a DOS attack (or we are simply recieving a large file). It can also give you an idea about your firewall performance. If it's levelling out and you need more "power" you should consider load balancing. If it's levelling out and see a correlation with your CPU load, it could also mean that your hardware is not fast enough. Correlations with disk usage could point to excessive LOG targets in you FW config.
eth0 errors (packets in/out)
Network errors. If this value is increasing, it could be a sign of faulty hardware.
eth0 traffic (bits/second in/out)
Raw network traffic. This should correlate with Firewall throughput.
number of threads
An ever-increasing value might point to a process not properly closing threads. Investigate!
processes
Breakdown of active processes (including sleeping). A quick spike in here might point to a fork-bomb. A slowly, but ever-increasing value might point to an application spawning sub-processes but not properly closing them. Investigate using ps faux.
process priority
This shows the distribution of process priorities. Having only high-priority processes is not of much use. Consider de-prioritizing some.
cpu usage
Fairly straight-forward. If this is spiking, you may have an attack going on, or a process is hogging the CPU. Idf it's slowly increasing and approaching max in normal operations, you should consider upgrading your hardware (or load-balancing).
file table usage
Number of actively open files. If this is reaching max, you may have a process opening, but not properly releasing files.
load average
Shows an summarized value for the system load. Should correlate with CPU usage. Increasing values can come from a number of sources. Look for correlations with other graphs.
memory usage
A graphical representation of you memory. As long as you have a lot of unused+cache+buffers you are fine.
swap in/out
Shows the activity on your swap partition. This should always be 0. If you see activity on this, you should add more memory to your machine!

Great question, easily applicable to Cacti and other graphing apps. The graphs often look great, but it is pretty hard to figure out what they mean, and more what something that needs further attention looks like. — dunxd, Nov 30 '11 at 10:06
For the "Why would I graph this? So far the value has always been near constant." part, remember that most information is usually only valuable in case of issues. — Steve Schnepp, Nov 30 '11 at 16:01

score 11 · Accepted Answer · answered Nov 30 '11 at 09:42

Disk IOs per device (IOs/second)

With traditional hard drives this is a very important number. I/O operation is a read or write operation to disk. With rotational spindles you can get around from dozens to perhaps 200 IOPS per second, depending on the disk speed and its usage pattern.

This is not all to it: modern operating systems do have I/O schedulers which try to merge several I/O requests as one and make things faster that way. Also the RAID controllers and so on do perform some smart I/O request reordering.

Disk latency per device (Average IO wait)

How long it took from performing the I/O request to an individual disk to actually receive the data from there. If this hovers around couple of milliseconds, you are OK, if it's dozens of ms, then you are starting to see your disk subsystem sweating, if it's hundreds of more ms, you are in big trouble, or at least have a very, very slow system.

IO Service Time

How your disk subsystem (possibly containing lots of disks) is performing overall.

IOStat (blocks/second read/written)

How many disk blocks were read/written per second. Look for spikes and also the average. If average starts to near the maximum throughput of your disk subsystem, it's time to plan for performance upgrade. Actually, plan that way before that point.

Available entropy (bytes)

Some applications do want to get "true" random data. Kernel gathers that 'true' randomness from several sources, such as keyboard and mouse activity, a random number generator found in many motherboards, or even from video/music files (video-entropyd and audio-entropyd can do that).

If your system runs out of entropy, the applications wanting that data stall until they get their data. Personally in the past I've seen this happening with Cyrus IMAP daemon and its POP3 service; it generated a long random string before each login, and on a busy server that consumed the entropy pool very quickly.

One way to get rid of that problem is to switch the applications to use only semi-random data (/dev/urandom), but that's not among this topic anymore.

VMStat (running/I/O sleep processes)

Not thought about this one before, but I would think that this tells you about per-process I/O statistics, or mainly if they are running some I/O or not, and if that I/O is blocking I/O activity or not.

Disk throughput per device (bytes/second read/written)

This is purely bytes read/written per second, and more often this is more human-readable form than blocks, which may vary. Block size may differ because of the disks used, file system (and its settings) used, and so on. Sometimes the block size might be 512 bytes, other times 4096 bytes, sometimes something else.

inode table usage

With file systems having dynamic inodes (such as XFS), nothing. With file systems having static inodes maps (such as ext3), everything. If you have combination of static inodes, a huge file system and huge number of directories and small files, you might encounter a situation where you cannot create more files on that partition, even though in theory there would be lots of free space left. No free inodes == bad.

considering the inode usage. I am currently using ext4, and the max-indodes and open-inodes in that graph are extremely close (open:31.11k table size: 32.12k). Which would leave me with around 1k inodes remaining. As the system is freshly installed, I don't believe this points to a problem. Is ext4 dynamically allocating inodes? I haven't found anything about that on google... — exhuma, Dec 01 '11 at 07:50
See `df -i`, it reports you current inode usage. ext4 has fixed inodes, for example my Fedora 16 reports for my root partition `rootfs 3276800 238083 3038717 8% /` — Janne Pikkarainen, Dec 01 '11 at 07:57
Hmmm... interesting. This suggests that the munin graph is not correct. I also just no realised that the munin graph shows only one value. Should it not show one value per file-system to be helpful? See also the `df -i` screenshot (http://i44.tinypic.com/oixkiq.png) vs the munin-graph (http://i39.tinypic.com/dxl64z.png) — exhuma, Dec 02 '11 at 08:11
... The value in the graph (25.57k) is actually not at all seen in the `df` output. — exhuma, Dec 02 '11 at 08:17
Upon further investigation, I see that the munin plugin `open_inodes`, takes the value from `/proc/sys/fs/inode-nr`. It's a kernel, and not a file-system value. A bit more googling pointed me to this: http://www.mjmwired.net/kernel/Documentation/sysctl/fs.txt#119 From that document I would assume that the limit could be found in `inode-max`. But this file does not exist on my system. Is it possible that this is no longer pertinent on newer kernels? This would allow me to remove this graph from my munin instance! — exhuma, Dec 02 '11 at 08:42

Can someone explain the "use-cases" for the default munin graphs?

Plugins where I don't even understand the data

Plugins where I understand the data but don't know what I should look out for

Plugins which I think to understand

1 Answers1