11

On Windows, whenever I want to validate / confirm that there might be IO-related issues on a volume that a database or other low-latency app lives on, I check disk latency.

If I see the Windows Average Disk sec / Transfer counter > 18-20ms consistently, then my canary in a coal mine just died and I need to investigate further. Drop-dead simple.

I'm looking at Linux now, and don't see a similar latency-based metric. The quick research that I've done indicates that I might not even WANT to...I see lots of references to I/O Wait being the way most people track this.

Is there a ballpark rule of thumb that you use in regards to this? For example is ANY i/o wait I see bad for a database's volume? Is there a simple iostat command that gives me a better look at overall disk health than just eyeballing TOP?

Thanks much!

  • 4
    You can look up `ioping` – ewwhite Apr 04 '17 at 00:28
  • Thanks, @ewwhite. I guess I'm just wondering if I need to change my focus entirely and instead monitor this in a different way, you know? – Russell Christopher Apr 04 '17 at 00:30
  • 2
    Enable sysstat collection on your systems. Then you can examine the iowait CPU percentage, which is very useful for diagnosing IO-related slowness. – EEAA Apr 04 '17 at 00:36
  • 2
    @RussellChristopher You can see example `sar` output [here](https://gist.githubusercontent.com/anderiv/ad9d511728fe1a869076e28c6c0564c8/raw/c60bf4e06e1a68d2633c44485170aa4b40e601b6/gistfile1.txt). Pay attention to the `%iowait` column. – EEAA Apr 04 '17 at 01:12
  • @Matt while it is VERY similar, the focus is slightly different. That QA is more focused on performing tests in a simulated environment, where as this Q seems to be more about monitoring current performance in production environment. – BeowulfNode42 Apr 04 '17 at 07:54

1 Answers1

12

Personally I use the command iostat -xk 10 and look at the await column.

  • -x Display extended statistics.
  • -k Display statistics in kilobytes per second. Or use m for megabytes/s.
  • 10 display interval in seconds

This is a virtually identical metric to the windows Average Disk sec / Transfer and is listed in ms instead of seconds. So similar rules of thumb could be applied, though this will depend on all sorts of things. I typically find that users start grumbling at 15ms and 20ms is very bad.

Press ctrl+c to quit, or specify the number of iterations to view with the count parameter. Note that the first iteration result is skewed heavily due to the small time sample used in the first iteration.

From the man iostat page

await The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.

Edit: await is the main metric I use to watch a disk under production loads to see if its throughput and iops is able to keep up with demand.

The %iowait stat is more about the balance between cpu and disk usage. %iostat will remain lower than expected if both cpu and disk activity are high. On the other side, starting at fairly low disk usage levels, %iostat can be relatively high if the cpu is idle. This being said await needs to be taken with a grain of salt as well. If there is a lot of sequential read/write happening it will skew the figure to a lower value, and your 18~20ms rule of thumb will not be useful under these conditions because most chunks being written will be the sequential data and will be serviced by the disk very quickly, while the other random io will be waiting, due to the Native-Command-Queuing (NCQ) system built in to the disk to optimise throughput by letting the disk choose the sequence that requests are serviced.

BeowulfNode42
  • 2,595
  • 2
  • 18
  • 32
  • Thanks @beowulfNode42. Is this the primary metric you use in terms of eyeing "bad disk"? New Relic, seems to focus on i/o wait and disk utilization (read and write) percentage...This makes me wonder if I'm chasing the wrong metric, or if THEY are simply reporting less useful info.... – Russell Christopher Apr 04 '17 at 13:07
  • @RussellChristopher the other stats provide the required context in which to interpret the await info. eg, is there lots of iops (r/ and w/s), lots of MB/s, is the avg request size (avgrq-sz) large or small, and what the avg queue size is (avgqu-sz). Yes, along with the cpu related metrics %iowait, %user, %system etc to see if the disk is slowing the cpu or vice-versa. – BeowulfNode42 Apr 07 '17 at 00:47