The underlying counters are documented in https://www.kernel.org/doc/Documentation/block/stat.txt
Setting meaningful thresholds based on the absolute number of IOPs and sectors read from or written to a block device (the lowercase -w
and -c
options) requires a priori knowledge of the actual capabilities of that particular block device (for instance by benchmarking them).
Using the queue length (the UPPERcase -W
and -C
options) seems a bit more universal. When you get an increased IO queue that is bad, regardless of how fast the underlying storage is, you're pushing more reads/writes than it can support and your applications will slow down.
I have no idea though if the documented 50 and 100 milliseconds are reasonable or completely arbitrary values.
For my virtual servers using absolute numbers is relatively easy, they are provisioned in flavors with specific limits and I would only need to set the the warning/critical levels at for instance 80% respectively 95% of those assigned limits.
For example with a flavor 600 IOPS and 10 MB/s:
Divide the assigned disk_read_bytes_sec
and disk_write_bytes_sec
by 512 (the sector size) to get the limits in sectors the virtual disk will support. (10 MB = 10000000 bytes) / 512 = 19531
19531 * 80% = 15624 and 600 * 80% = 480
19531 * 95% = 18749 and 600 * 95% = 570
./check_diskstat.sh -d vda -w 480,15624,15624 -c 570,18749,18749