We have a Debian server with 3Ware 9650SE 8-drive RAID controller, with 5 disk RAID6 array, acting as virtual machine host, all Linux. Problems keep occurring and I suspect an undetected broken disk.
We have had several crashes now where both host and all guests are saying that the IO system blocked for 120 seconds or more. We suspected a faulty RAID controller, but we replaced it with an identical one with identical firmware, which didn't fix it. I didn't think it would, because a second RAID1 array kept working properly.
Almost a week ago (Sunday), when this was acting up, the auto verify was at 66%. Last night (friday morning) it was at 67%. Both before and after booting, and both while experiencing problems. When I turned off the verify with tw_cli /c0/u0 stop verify
, things became responsive again.
I suspect it got stuck on a disk fault at around 66%. An auto verify starts on Saturday:
# tw_cli /c0 show verify
/c0 basic verify weekly preferred start: Saturday, 12:00AM
and would normally be long done by Friday. Seeing as how Sunday was 66% and Friday was 67%, it's unlikely to be coincidence.
'smartctl -a -d 3ware,0 /dev/twa0' and 'smartctl -t long' (long SMART self test) on all the drives didn't reveal any errors. Neither does tw_cli /c0 show alarms
.
I suspected a disk is broken in a way that is hard to detect, but I took each drive out of the array one by one, created a 'single' array from it and dd'ed full of zeros. No disk showed errors.
Or any other advice?
Edit:
this is the layout:
# tw_cli /c0 show
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 OK - - 256K 5587.9 RiW OFF
u1 SPARE OK - - - 1863.01 - OFF
u2 RAID-1 OK - - - 1862.63 RiW ON
VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 1.82 TB SATA 0 - ST32000542AS
p1 OK u0 1.82 TB SATA 1 - ST32000542AS
p2 OK u0 1.82 TB SATA 2 - ST32000542AS
p3 OK u0 1.82 TB SATA 3 - ST32000542AS
p4 OK u0 1.82 TB SATA 4 - ST32000542AS
p5 OK u1 1.82 TB SATA 5 - WDC WD2002FYPS-02W3
p6 OK u2 1.82 TB SATA 6 - WDC WD2002FYPS-02W3
p7 OK u2 1.82 TB SATA 7 - WDC WD2002FYPS-02W3
Name OnlineState BBUReady Status Volt Temp Hours LastCapTest
---------------------------------------------------------------------------
bbu On Yes OK OK OK 0 xx-xxx-xxxx
The unit in question is u0.
edit2:
tw_cli /c0 show diag shows something interesting (edit3: this is harmless, I found out it's caused by calling smartctl -a -d 3ware,X /dev/twa0
where X is an invalid port):
QueueAtaPassthrough() called with invalid TargetHandle: 0x17, portHandle: 0xFF
Legacy opcode=0xB1 error=0x10E
E=010E T=14:15:51 : Invalid operation for specified port
E=010E T=14:15:51 U=0 : Return error status to host
Error, Unit 23: Invalid operation for specified port
(EC:0x10e, SK=0x05, ASC=0x24, ASCQ=0x00, SEV=01, Type=0x70)
No additional sense data
Error, Unit 23: 0x10E OVERRIDDEN due to invalid sense buffer descriptor
sense buffer: len=0, address=0x414ca2c7c
Send AEN (code, time): 0031h, 06/21/2013 14:26:16
Synchronize host/controller time
(EC:0x31, SK=0x00, ASC=0x00, ASCQ=0x00, SEV=04, Type=0x71)
I get tons of these. I have no idea what it means though. I can't even make out which unit or port it is. (edit3: I do know now, it's harmless).
Given my edit3, I'm back to square one. Nothing indicates a disk is broken, except that the verify hangs at 66% and causes the array to hang, which also sometimes happens randomly. I wish the verify would find the fault...