I virtualized a datacenter a few months back and we have a pool of 3 HP DL360 G5 servers, each with 32GB of Memory and dual Intel Xeons. Recently we have been experiencing 2 issues, the first of which is the disk read speed has become extremely slow. Typing "ls" on a linux VM that has only a few files takes numerous seconds to return a file list. Also, VMs on the Cluster will sometimes get remounted as read only filesystems by themselves. Dmesg on the hosts produces a plethora of "DRDY ERR" errors. The main storage repositories we use are on a Drobo B800i, shared over isci. I posted iostat and a grep of the DRDY errors from dmesg below, these are enterprise servers and they are going down intermittently, which is never good:
Here is an Iostat from one of the servers: [root@XenServer-1 tmp]# iostat Linux 2.6.32.43-0.4.1.xs1.8.0.835.170778xen (XenServer-1.ethoplex.com) 07/31/2014
avg-cpu: %user %nice %system %iowait %steal %idle
0.42 0.00 0.46 3.51 0.40 95.21
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
cciss/c0d0 17.30 76.54 304.24 893755376 3552874247
cciss/c0d0p1 1.04 0.27 22.82 3169526 266433488
cciss/c0d0p2 0.00 0.01 0.00 73890 0
cciss/c0d0p3 16.25 76.24 281.43 890365720 3286440759
sda 76.84 59.78 87.32 698047689 1019733585
dm-0 0.68 0.95 0.28 11071656 3217737
sdb 3.44 177.64 37.74 2074378210 440737634
dm-2 0.00 0.01 0.00 135808 2216
dm-3 12.23 361.61 131.55 4222728781 1536204287
sdc 4.05 27.93 328.02 326147810 3830552980
sdd 6.23 101.72 113.03 1187808537 1319897350
tda 1.61 9.74 40.01 113749658 467248640
dm-28 0.84 36.78 23.11 429521222 269838659
dm-14 0.24 56.24 0.00 656723598 0
dm-21 0.08 18.17 0.00 212172507 0
tdb 0.08 0.12 1.44 1384368 16853616
dm-5 0.38 4.03 36.17 47063052 422416430
tdc 0.61 4.03 36.10 47062722 421602000
dm-7 1.26 17.74 5.51 207110960 64292628
tde 1.22 17.64 5.49 206019946 64129696
dm-30 0.03 0.01 0.60 61956 6979438
dm-4 0.02 0.00 8.85 1014 103326613
tdd 0.11 0.00 8.82 1264 103049216
dm-9 0.00 0.02 0.05 175978 591472
tdg 0.00 0.02 0.05 175950 590704
dm-10 0.01 0.09 0.21 1104226 2488947
tdf 0.01 0.09 0.21 1105562 2472346
dm-6 0.00 0.00 0.04 1568 419135
dm-16 0.00 0.01 0.00 132105 0
dm-17 0.03 0.05 0.76 625890 8867990
dm-8 0.00 0.06 0.10 752923 1226072
tdh 0.00 0.07 0.10 788356 1218922
tdi 0.00 0.00 0.00 884 0
Dmesg Grep DRDY:
[11645348.631020] ata1.00: status: { DRDY ERR }
[11646434.714902] ata1.00: status: { DRDY ERR }
[11648427.773389] ata1.00: status: { DRDY ERR }
[11648950.139954] ata1.00: status: { DRDY ERR }
[11649612.475350] ata1.00: status: { DRDY ERR }
[11650177.522603] ata1.00: status: { DRDY ERR }
[11650649.818020] ata1.00: status: { DRDY }
[11651837.989833] ata1.00: status: { DRDY ERR }
[11654729.414605] ata1.00: status: { DRDY ERR }
[11655685.782290] ata1.00: status: { DRDY ERR }
[11657120.774143] ata1.00: status: { DRDY ERR }
[11659704.724995] ata1.00: status: { DRDY }
[11661322.210812] ata1.00: status: { DRDY ERR }
[11662029.088563] ata1.00: status: { DRDY ERR }
[11663314.187972] ata1.00: status: { DRDY ERR }
[11667978.796829] ata1.00: status: { DRDY ERR }
[11670487.088008] ata1.00: status: { DRDY ERR }
[11671800.577054] ata1.00: status: { DRDY ERR }
Dmesg:
[11464689.083861] sr 1:0:0:0: CDB: Get event status notification: 4a 01 00 00 10 00 00 00 08 00
[11464689.083875] ata1.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in
[11464689.083876]res 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
[11464689.083896] ata1.00: status: { DRDY }
[11464694.133755] ata1: link is slow to respond, please be patient (ready=0)
[11464699.123711] ata1: device not ready (errno=-16), forcing hardreset
[11464699.123727] ata1: soft resetting link
[11464699.344063] ata1.00: configured for PIO0
[11464699.348375] ata1: EH complete
[11464706.383733] ata1.00: qc timeout (cmd 0xa0)
[11464706.383766] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[11464706.383782] sr 1:0:0:0: CDB: Test Unit Ready: 00 00 00 00 00 00
[11464706.383794] ata1.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[11464706.383795]res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
[11464706.383806] ata1.00: status: { DRDY ERR }
[11464711.433625] ata1: link is slow to respond, please be patient (ready=0)
[11464716.433591] ata1: device not ready (errno=-16), forcing hardreset