Strange Recurrent Excessive I/O Wait

Question

I know quite well that I/O wait has been discussed multiple times on this site, but all the other topics seem to cover constant I/O latency, while the I/O problem we need to solve on our server occurs at irregular (short) intervals, but is ever-present with massive spikes of up to 20k ms a-wait and service times of 2 seconds. The disk affected is /dev/sdb (Seagate Barracuda, for details see below).

A typical iostat -x output would at times look like this, which is an extreme sample but by no means rare:

iostat (Oct 6, 2013)
  tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
 0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
 0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
16.00      0.00    156.00      9.75     21.89    288.12     36.00     57.60
 5.50      0.00     44.00      8.00     48.79   2194.18    181.82    100.00
 2.00      0.00     16.00      8.00     46.49   3397.00    500.00    100.00
 4.50      0.00     40.00      8.89     43.73   5581.78    222.22    100.00
14.50      0.00    148.00     10.21     13.76   5909.24     68.97    100.00
 1.50      0.00     12.00      8.00      8.57   7150.67    666.67    100.00
 0.50      0.00      4.00      8.00      6.31  10168.00   2000.00    100.00
 2.00      0.00     16.00      8.00      5.27  11001.00    500.00    100.00
 0.50      0.00      4.00      8.00      2.96  17080.00   2000.00    100.00
34.00      0.00   1324.00      9.88      1.32    137.84      4.45     59.60
 0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
22.00     44.00    204.00     11.27      0.01      0.27      0.27      0.60

Let me provide you with some more information regarding the hardware. It's a Dell 1950 III box with Debian as OS where uname -a reports the following:

Linux xx 2.6.32-5-amd64 #1 SMP Fri Feb 15 15:39:52 UTC 2013 x86_64 GNU/Linux

The machine is a dedicated server that hosts an online game without any databases or I/O heavy applications running. The core application consumes about 0.8 of the 8 GBytes RAM, and the average CPU load is relatively low. The game itself, however, reacts rather sensitive towards I/O latency and thus our players experience massive ingame lag, which we would like to address as soon as possible.

iostat:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.77    0.01    1.05    1.59    0.00   95.58

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sdb              13.16        25.42       135.12  504701011 2682640656
sda               1.52         0.74        20.63   14644533  409684488

Uptime is:

19:26:26 up 229 days, 17:26,  4 users,  load average: 0.36, 0.37, 0.32

Harddisk controller:

01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04)

Harddisks:

Array 1, RAID-1, 2x Seagate Cheetah 15K.5 73 GB SAS
Array 2, RAID-1, 2x Seagate ST3500620SS Barracuda ES.2 500GB 16MB 7200RPM SAS

Partition information from df:

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdb1            480191156  30715200 425083668   7% /home
/dev/sda2              7692908    437436   6864692   6% /
/dev/sda5             15377820   1398916  13197748  10% /usr
/dev/sda6             39159724  19158340  18012140  52% /var

Some more data samples generated with iostat -dx sdb 1 (Oct 11, 2013)

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00    15.00    0.00   70.00     0.00   656.00     9.37     4.50    1.83   4.80  33.60
sdb               0.00     0.00    0.00    2.00     0.00    16.00     8.00    12.00  836.00 500.00 100.00
sdb               0.00     0.00    0.00    3.00     0.00    32.00    10.67     9.96 1990.67 333.33 100.00
sdb               0.00     0.00    0.00    4.00     0.00    40.00    10.00     6.96 3075.00 250.00 100.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     4.00    0.00   0.00 100.00
sdb               0.00     0.00    0.00    2.00     0.00    16.00     8.00     2.62 4648.00 500.00 100.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     2.00    0.00   0.00 100.00
sdb               0.00     0.00    0.00    1.00     0.00    16.00    16.00     1.69 7024.00 1000.00 100.00
sdb               0.00    74.00    0.00  124.00     0.00  1584.00    12.77     1.09   67.94   6.94  86.00

Characteristic charts generated with rrdtool can be found here:

iostat plot 1, 24 min interval: http://imageshack.us/photo/my-images/600/yqm3.png/

iostat plot 2, 120 min interval: http://imageshack.us/photo/my-images/407/griw.png/

As we have a rather large cache of 5.5 GBytes, we thought it might be a good idea to test if the I/O wait spikes would perhaps be caused by cache miss events. Therefore, we did a sync and then this to flush the cache and buffers:

echo 3 > /proc/sys/vm/drop_caches

and directly afterwards the I/O wait and service times virtually went through the roof, and everything on the machine felt like slow motion. During the next few hours the latency recovered and everything was as before - small to medium lags in short, unpredictable intervals.

Now my question is: does anybody have any idea what might cause this annoying behaviour? Is it the first indication of the disk array or the raid controller dying, or something that can be easily mended by rebooting? (At the moment we're very reluctant to do this, however, because we're afraid that the disks might not come back up again.)

Any help is greatly appreciated.

Thanks in advance, Chris.

Edited to add: we do see one or two processes go to 'D' state in top, one of which seems to be kjournald rather frequently. If I'm not mistaken, however, this does not indicate the processes causing the latency, but rather those affected by it - correct me if I'm wrong. Does the information about uninterruptibly sleeping processes help us in any way to address the problem?

@Andy Shinn requested smartctl data, here it is:

smartctl -a -d megaraid,2 /dev/sdb yields:

smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST3500620SS      Version: MS05
Serial number:
Device type: disk
Transport protocol: SAS
Local Time is: Mon Oct 14 20:37:13 2013 CEST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature:     20 C
Drive Trip Temperature:        68 C
Elements in grown defect list: 0
Vendor (Seagate) cache information
  Blocks sent to initiator = 1236631092
  Blocks received from initiator = 1097862364
  Blocks read from cache and sent to initiator = 1383620256
  Number of read and write commands whose size <= segment size = 531295338
  Number of read and write commands whose size > segment size = 51986460
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 36556.93
  number of minutes until next internal SMART test = 32

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   509271032       47         0  509271079   509271079      20981.423           0
write:         0        0         0         0          0       5022.039           0
verify: 1870931090      196         0  1870931286   1870931286     100558.708           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  16   36538                 - [-   -    -]
# 2  Background short  Completed                  16   36514                 - [-   -    -]
# 3  Background short  Completed                  16   36490                 - [-   -    -]
# 4  Background short  Completed                  16   36466                 - [-   -    -]
# 5  Background short  Completed                  16   36442                 - [-   -    -]
# 6  Background long   Completed                  16   36420                 - [-   -    -]
# 7  Background short  Completed                  16   36394                 - [-   -    -]
# 8  Background short  Completed                  16   36370                 - [-   -    -]
# 9  Background long   Completed                  16   36364                 - [-   -    -]
#10  Background short  Completed                  16   36361                 - [-   -    -]
#11  Background long   Completed                  16       2                 - [-   -    -]
#12  Background short  Completed                  16       0                 - [-   -    -]

Long (extended) Self Test duration: 6798 seconds [113.3 minutes]

smartctl -a -d megaraid,3 /dev/sdb yields:

smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST3500620SS      Version: MS05
Serial number:
Device type: disk
Transport protocol: SAS
Local Time is: Mon Oct 14 20:37:26 2013 CEST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature:     19 C
Drive Trip Temperature:        68 C
Elements in grown defect list: 0
Vendor (Seagate) cache information
  Blocks sent to initiator = 288745640
  Blocks received from initiator = 1097848399
  Blocks read from cache and sent to initiator = 1304149705
  Number of read and write commands whose size <= segment size = 527414694
  Number of read and write commands whose size > segment size = 51986460
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 36596.83
  number of minutes until next internal SMART test = 28

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   610862490       44         0  610862534   610862534      20470.133           0
write:         0        0         0         0          0       5022.480           0
verify: 2861227413      203         0  2861227616   2861227616     100872.443           0

Non-medium error count:        1

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  16   36580                 - [-   -    -]
# 2  Background short  Completed                  16   36556                 - [-   -    -]
# 3  Background short  Completed                  16   36532                 - [-   -    -]
# 4  Background short  Completed                  16   36508                 - [-   -    -]
# 5  Background short  Completed                  16   36484                 - [-   -    -]
# 6  Background long   Completed                  16   36462                 - [-   -    -]
# 7  Background short  Completed                  16   36436                 - [-   -    -]
# 8  Background short  Completed                  16   36412                 - [-   -    -]
# 9  Background long   Completed                  16   36404                 - [-   -    -]
#10  Background short  Completed                  16   36401                 - [-   -    -]
#11  Background long   Completed                  16       2                 - [-   -    -]
#12  Background short  Completed                  16       0                 - [-   -    -]

Long (extended) Self Test duration: 6798 seconds [113.3 minutes]

Can you check cron ? maybe it's rotating very big logs. You can use iotop to see which process is doing the IO. — Cristi, Oct 12 '13 at 19:48
Run `pidstat -d 10` for a long time and use the results to correlate process IO commits against when iostat shows a high IOWait. — Matthew Ife, Oct 12 '13 at 20:26
Mlfe, pidstat shows that usually one or two processes write to disk in any given 2 s interval (the game server, diskstat-update and sometimes java, but the fun fact is that when I do iostat and pidstat both every 2 s, in the instance when iowait goes up and util% approaches 100%, all processes vanish from the pidstat list, i.e. it seems that *nobody* writes to disk anymore. — Chris, Oct 12 '13 at 20:43
@Cristi iotop doesn't yield any useful information, there seems to be no specific process that generates high I/O, i.e. just before the latency goes up. — Chris, Oct 12 '13 at 20:51
Have you tried another IO scheduler? http://stackoverflow.com/questions/1009577/selecting-a-linux-i-o-scheduler You'll need to see which one gives you best performance, but generally deadline is a good option. Also if you're concerned about your raid controller, have you any indication from `dmesg`, or if the disks supports S.M.A.R.T, does `smartctl` indicate any errors? — Petter H, Oct 12 '13 at 21:20
@PetterH Current scheduler seems to be cfq (`cat /sys/block/sdb/queue/scheduler` => noop anticipatory deadline [cfq]). `smartctl` gives absolutely no indication regarding drive failure, selftest completed without errors. Regarding `dmesg`, it contains nothing else but network messages due to the long uptime, but also in times of high iowait nothing unusual seems to accumulate there. [edit: I obviously misread the scheduler info; we're running with cfq at the moment.] — Chris, Oct 12 '13 at 21:54
@PetterH Just talked to another admin, changing the scheduler to deadline didn't improve the situation, he tried that already. — Chris, Oct 12 '13 at 22:08
Can you add the `smartctl -a` output for the particular drive? Also, if `smartctl` is not outputting anything, have you tried specifying the `megaraid` driver as the type, such as `smartctl -a -d megaraid,0 /dev/sdb`? I'm wondering if the pause is the disk re-writing a sector and this might be visible in smartmontools. — Andy Shinn, Oct 14 '13 at 03:24
@AndyShinn: I added the data you requested to the original post. Interestingly, the lag disappeared yesterday for a short period of about 1.5 hours, bringing the load down to virtually none as it should be on that system, as you can see in this graph: [link](http://img17.imageshack.us/img17/8854/fa25.png). We are pretty sure that nobody actively did anything on the system to change that. (The larger spikes that you see around 8 AM are a somewhat bigger backup transfer, but even that should come nowhere near to seriously bothering the system performance-wise. — Chris, Oct 14 '13 at 19:00
Does dmesg indicate any errors/timeouts with the disk? I should point out that some of those iostat values are *very* unusual. 2 4096 byte block writes took 1 second to process? Thats not right. — Matthew Ife, Oct 14 '13 at 19:01
@MIfe, that's precisely what we think, this _isn't_ right especially on the background of the actual disk load of the system, which is pretty close to, well, naught. But none of the measures we took did anything to mend that, and we're a bit at a loss regarding where to look further. Interestingly, the bytes_written/sec value did not change a tad during the time when the lag was gone, on the contrary. The value was around 70 kBytes/sec all the time. — Chris, Oct 14 '13 at 19:18
@Mlfe, nope. Nothing of the sort. `blktrace` is available, though, and we could switch on block logging for the device, but I must admit that we lack the experience in interpreting the data output for it to be of any meaningful help. If someone from around here can provide us with a hint where to look, or can point us to some web resource, we'd be really grateful. Even more so, if it leads to resolving the problem. :-) — Chris, Oct 14 '13 at 19:29
I have the exact same issue. Problem is repeating every hour approximately. IO is not saturated, but every hour IO just stops for a couple of seconds up to maybe a minute. Atop shows 0 for RDSK and WRDSK, and iowait just skyrockets. Also, svctm goes to 1000 just like in your case. I have IBM ServeRaid M5110 (rebranded LSI 9265-8i) with 4 Samsung SSDs (PM853T) attached, and create VD with RAID 0. After losing couple of days on kernel tunning, I just think that the issue can simply be incompatibility between controller and the drives.... Any further ideas about this? :-/ — Jakov Sosic, Nov 07 '14 at 23:02
what is your filesystem, and can check here http://wiki.hetzner.de/index.php/Partition_Alignment/en#Example_of_faulty_alignment and see if your partitions are aligned properly ? — Hrvoje Špoljar, Nov 07 '14 at 23:56

catpnosis · Answer 1 · 2013-10-20T21:00:09.250

2

(I assume that your disks are directly attached into the server, and not over NFS, for example.)

What is important is that your svctm (in iostat output) is extremely high, which suggests hardware problem with RAID or disks. Svctm for normal disks should be around 4 (ms). May be less, but not too higher.

Unfortunately, smartctl output is not informative in your case. It have errors corrected but this could be normal. Long test seems to be completed OK, but that's inconclusive again. ST3500620SS seems to be good old server/raid type disk, which should respond quickly on read errors (unlike desktop/non-raid disks), so this could be more complicated hardware problem than just bad sectors. Try to find something unusual (like high error rates) in RAID statistics: http://hwraid.le-vert.net/wiki/LSIMegaRAIDSAS

My suggestion is disks replacement should be next step.

Update:

Svctm is more important factor, as high util% is just consequence of svctm being abnormally high.

I saw similar problem when desktop disks was installed into Promise RAID. Desktop disks designed to try to repair read errors by many long retries, which contributes into latency (these read errors could be because of some other factor, such as vibration, which is much stronger in server room than in desktop). Unlike that, disks designed to be used in RAID just report quickly any errors to RAID controller, which can correct them with RAID reduncancy. Plus, server disks could be designed to be more mechanically resitant against constant strong vibration. There is common myth that server disks are same as desktop just being more expensive, which is wrong, they are actually different.

Q: Ah, what I wanted to ask: if it's a hardware problem, don't you think that the problem should be continually visible and not disappear for some time? Do you happen to have any explanation for that effect?

A:

Problem may always be there but it become noticeable only on high load.
Vibration levels could be different in different time of day (depending, for example, on what nearby servers do). If your problem is disks being affected by vibration it's definitely could disappear and reappear. I saw similar behavior when I had my 'desktop disks' problem. (Of course, your disks is server ones and recommended for RAIDs, so it's not exactly the same problem. But it could be similar.)

edited Oct 20 '13 at 21:00

answered Oct 20 '13 at 05:44

catpnosis

191
4

Thank you for your reply. We will certainly have a look at RAID statistics. What still irritates me, however, is that during one short period of 1.5 h the latency was completely normal again (see [link](http://img513.imageshack.us/img513/4453/7vaz.png) for example, where you find the util% dropping to virtually zero on last Sunday). To change the commit interval of kjournald from 5s to 30s somewhat helped with the problem - write processes of the game server don't get blocked as often as they used to be (which can be seen in the graph as well). – Chris Oct 20 '13 at 15:03
Ah, what I wanted to ask: if it's a hardware problem, don't you think that the problem should be continually visible and _not_ disappear for some time? Do you happen to have any explanation for that effect? – Chris Oct 20 '13 at 15:08
I updated my post with answers to your questions and more info. – catpnosis Oct 20 '13 at 21:01
Thanks for your update. As I don't know the exact situation in the rack where the machine is installed, I don't have a good opinion on that. I am, however, skeptical toward this "environmental" explanation because the overall load is very low and pretty much constant. If the problem is triggered by the combination of a constant vibrational outside influence in combination with high load, shouldn't it be visible mainly during the backup at 8 AM? At the moment, it's constantly there with only a short period where it wasn't, and that somehow seems to contradict your hypothesis. – Chris Oct 20 '13 at 23:10
Update: Someone with root access provided me with the output of `megacli -PDList -a0` just now and it shows no errors for any of the 4 disks installed. Temperatures of all drives are at 20 °C (68 °F) - cooling obviously isn't a factor either. – Chris Oct 20 '13 at 23:25
@Chris Did you resolve the problem? I wonder of outcomes. – catpnosis Dec 08 '13 at 16:30
Any further info on this problem? – Jakov Sosic Nov 07 '14 at 14:52
1

@catpnosis, as had happened before, the problem disappeared by itself without any intervention from the administrators. However, the server recently decided to perform an unscheduled reboot, which might have helped to resolve the problem, and the issue didn't reappear since. In short, we are still completely at a loss regarding the cause of the problem. Hopefully, it is now gone for good. – Chris Dec 09 '14 at 08:07

score 0 · Answer 2 · answered Nov 11 '14 at 12:58

I had a very similar issue. IBM ServeRaid M5110 (rebranded LSI 9265-8i) and CentOS 6.x

First VD was RAID0 of 4 IBM branded Hitachi drives.

Then we bought Samsung PM853T SSDs, and installed them in another 4 drives and created another RAID0. When we switched our workload from platters to SSDs, every 1 hour IO would just skyrocket, and all operations would halt. Load would go from regular ~2 to up over 80. After couple of dozen seconds, everything would calm down and applications would continue working.

Such situation didn't ever occur on platters.

So, my first impression was some kind of incompatibility between LSI and Samsung. After couple of days, and a lot of head scratching and debugging, I found out that MegaCli64 was the culprit. We run it via Zabbix for monitoring drives health, but when scanning controller, MegaCli would halt waiting at SSDs, dozen of seconds plus for each SSD times 4 almost two minutes. That would drop all I/Os to zero, and skyrocket iowait and load.

Solution was to find MegaCli version which didn't cause the issue. We downloaded version from IBM site.

Strange Recurrent Excessive I/O Wait

2 Answers2