Testing a production hard hard / debugging high I/O Load

Question

I've been getting a lot of high load lately on this server more so then it should for what little is on this system. Seems like just the simplest stuff using the disk like a YUM update will spike the drive well in 10LA when it runs well under 1.

Could this be a bad drive?

iostat -xdk 1 50 http://pastebin.com/hRxY50FC

We'd need more detail on the hardware, RAID and server setup. — ewwhite, Oct 04 '12 at 01:18

score 0 · Answer 1 · edited Apr 13 '17 at 12:14

An important thing to understand with high disk utilization besides raw disk capacity is how memory is behaving on your system.

Good file IO often relies a significant amount of caching. Two things can happen under memory pressure that can result in high IO load:

Files that are cached get pushed from memory to make room for process memory (This can be viewed by looking at the output of the free command
You might start to push memory actively in and out of the swap partition. You can see if this happening by looking at bi/bo under swap with the vmstat command.

If all that looks okay, you probably want to look into Determining which process is causing heavy disk I/O?

Soham Chakraborty · Accepted Answer · 2012-10-04T11:03:45.170

Can you post iostat -xdk 1 50 when the problem occurs. See the man page of iostat on what switch you can use to get partition names listed). Pastebin it along with a top poutput taken at the same time.

Okay, so here when your disk seems to become too loaded at certain times in your workload.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz       await  svctm  %util
sda              85.00     5.00  249.00   11.00  6040.00    64.00    46.95    10.73       44.23   3.85 100.00
sda               3.00     0.00  275.00    0.00  7764.00     0.00    56.47     7.63   23.27   3.64 100.00
sda             125.00    29.00  221.00    3.00  5508.00   128.00    50.32     7.49   41.08   4.46 100.00
sda              14.00    65.00  224.00   28.00  5940.00   372.00    50.10     1.97    8.05   3.52  88.80

Comparing with the other iterations, the read request becomes too sporadically large at times. And then await increases. However, the average queue size noted in avgqu-sz is still pretty low. That means, most of the await time is spent while the storage is servicing the requests. It is not in the linux side, I mean not on the scheduler side.

Roughly speaking, there are two queues. One in the scheduler and the other in hardware side. await time is measured on the basis of each IO from the time it hits the IO scheduler to the time when it is serviced by the storage i.e. disk. avgqu-sz is the average number of IO contained within both the IO scheduler and in the storage lun queue. If the avgqu-sz is much less than the queue depth of the storage, that means little time is spent in the scheduler queue. Scheduler will pass those IOs to the storage and until they are serviced by the storage, the await time will keep increasing.

Long story short, in my opinion, at particular times, the storage is becoming slow and that increases the latency.

Thanks man, I will check them. Can I take few hours to respond. I have to do some stuffs first. — Soham Chakraborty, Oct 04 '12 at 04:31
Adding another answer since I can't seem to get it fit within this comment. Oh edit button is there. cool — Soham Chakraborty, Oct 04 '12 at 10:31
Apparently, it was a bad drive. Attempting to clone via dd. One of the partitions was corrupted so the system did not boot up. — Tiffany Walker, Oct 04 '12 at 10:44
umm, see drive problem. Pretty much disk issues almost surface because of IO load. — Soham Chakraborty, Oct 04 '12 at 11:08

Testing a production hard hard / debugging high I/O Load

2 Answers2