1

I virtualized a datacenter a few months back and we have a pool of 3 HP DL360 G5 servers, each with 32GB of Memory and dual Intel Xeons. Recently we have been experiencing 2 issues, the first of which is the disk read speed has become extremely slow. Typing "ls" on a linux VM that has only a few files takes numerous seconds to return a file list. Also, VMs on the Cluster will sometimes get remounted as read only filesystems by themselves. Dmesg on the hosts produces a plethora of "DRDY ERR" errors. The main storage repositories we use are on a Drobo B800i, shared over isci. I posted iostat and a grep of the DRDY errors from dmesg below, these are enterprise servers and they are going down intermittently, which is never good:

Here is an Iostat from one of the servers: [root@XenServer-1 tmp]# iostat Linux 2.6.32.43-0.4.1.xs1.8.0.835.170778xen (XenServer-1.ethoplex.com) 07/31/2014

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.42    0.00    0.46    3.51    0.40   95.21

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
cciss/c0d0       17.30        76.54       304.24  893755376 3552874247
cciss/c0d0p1      1.04         0.27        22.82    3169526  266433488
cciss/c0d0p2      0.00         0.01         0.00      73890          0
cciss/c0d0p3     16.25        76.24       281.43  890365720 3286440759
sda              76.84        59.78        87.32  698047689 1019733585
dm-0              0.68         0.95         0.28   11071656    3217737
sdb               3.44       177.64        37.74 2074378210  440737634
dm-2              0.00         0.01         0.00     135808       2216
dm-3             12.23       361.61       131.55 4222728781 1536204287
sdc               4.05        27.93       328.02  326147810 3830552980
sdd               6.23       101.72       113.03 1187808537 1319897350
tda               1.61         9.74        40.01  113749658  467248640
dm-28             0.84        36.78        23.11  429521222  269838659
dm-14             0.24        56.24         0.00  656723598          0
dm-21             0.08        18.17         0.00  212172507          0
tdb               0.08         0.12         1.44    1384368   16853616
dm-5              0.38         4.03        36.17   47063052  422416430
tdc               0.61         4.03        36.10   47062722  421602000
dm-7              1.26        17.74         5.51  207110960   64292628
tde               1.22        17.64         5.49  206019946   64129696
dm-30             0.03         0.01         0.60      61956    6979438
dm-4              0.02         0.00         8.85       1014  103326613
tdd               0.11         0.00         8.82       1264  103049216
dm-9              0.00         0.02         0.05     175978     591472
tdg               0.00         0.02         0.05     175950     590704
dm-10             0.01         0.09         0.21    1104226    2488947
tdf               0.01         0.09         0.21    1105562    2472346
dm-6              0.00         0.00         0.04       1568     419135
dm-16             0.00         0.01         0.00     132105          0
dm-17             0.03         0.05         0.76     625890    8867990
dm-8              0.00         0.06         0.10     752923    1226072
tdh               0.00         0.07         0.10     788356    1218922
tdi               0.00         0.00         0.00        884          0

Dmesg Grep DRDY:

[11645348.631020] ata1.00: status: { DRDY ERR }
[11646434.714902] ata1.00: status: { DRDY ERR }
[11648427.773389] ata1.00: status: { DRDY ERR }
[11648950.139954] ata1.00: status: { DRDY ERR }
[11649612.475350] ata1.00: status: { DRDY ERR }
[11650177.522603] ata1.00: status: { DRDY ERR }
[11650649.818020] ata1.00: status: { DRDY }
[11651837.989833] ata1.00: status: { DRDY ERR }
[11654729.414605] ata1.00: status: { DRDY ERR }
[11655685.782290] ata1.00: status: { DRDY ERR }
[11657120.774143] ata1.00: status: { DRDY ERR }
[11659704.724995] ata1.00: status: { DRDY }
[11661322.210812] ata1.00: status: { DRDY ERR }
[11662029.088563] ata1.00: status: { DRDY ERR }
[11663314.187972] ata1.00: status: { DRDY ERR }
[11667978.796829] ata1.00: status: { DRDY ERR }
[11670487.088008] ata1.00: status: { DRDY ERR }
[11671800.577054] ata1.00: status: { DRDY ERR }

Dmesg:

[11464689.083861] sr 1:0:0:0: CDB: Get event status notification: 4a 01 00 00 10 00 00 00 08 00
[11464689.083875] ata1.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in
[11464689.083876]res 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
[11464689.083896] ata1.00: status: { DRDY }
[11464694.133755] ata1: link is slow to respond, please be patient (ready=0)
[11464699.123711] ata1: device not ready (errno=-16), forcing hardreset
[11464699.123727] ata1: soft resetting link
[11464699.344063] ata1.00: configured for PIO0
[11464699.348375] ata1: EH complete
[11464706.383733] ata1.00: qc timeout (cmd 0xa0)
[11464706.383766] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[11464706.383782] sr 1:0:0:0: CDB: Test Unit Ready: 00 00 00 00 00 00
[11464706.383794] ata1.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[11464706.383795]res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
[11464706.383806] ata1.00: status: { DRDY ERR }
[11464711.433625] ata1: link is slow to respond, please be patient (ready=0)
[11464716.433591] ata1: device not ready (errno=-16), forcing hardreset
Michal Sokolowski
  • 1,461
  • 1
  • 11
  • 24
Riley
  • 103
  • 2
  • 14
  • unexplainable intermittent failure usually means some hardware is going bad. Unless the storage unit is getting overloaded, something is on the blink and you might need a 2nd storage unit to isolate the problem. i'd start with the network cables and switch. if possible, maybe add some local storage for testing & comparison. – user16081-JoeT Jul 31 '14 at 22:36
  • Local attached storage does not seem to have this issue. I will be doing a firmware upgrade on the switch tonight and then I will post the results. – Riley Jul 31 '14 at 22:47
  • Have you looked at the logs on the drobo's? You might find a ton of surprises in there. Actually we have some drobos within my company and a colleague mentioned they were somewhat unreliable. Still, I'd start there since the problems seem to be affecting multiple VM's across several servers by the sounds of it. – hookenz Aug 01 '14 at 04:18
  • I will look into this. I have the diagnostic file with their support department being looked at. – Riley Aug 01 '14 at 17:28
  • We have updated the switch firmware on our Mikrotik Switch and the Disk I/O has significantly improved. – Riley Aug 04 '14 at 19:18

0 Answers0