Harddisks falling offline for unknown reason

Question

I have 7 systems running the setup below. Now and then a different disk falls offline, but on closer inspection the disk is good and not faulty and works flawlessly for at least another year. Since this happens on all the 7 systems I find it unlikely that there is a single part that is acting up (e.g. cable), but that it is instead the combination of some of the parts that are slightly incompatible.

The problem is to locate the exact point where the incompatibility is.

(If you instead have a work-around where you can do a virtual re-seating of the harddisk from the command line, then you may be able to answer https://serverfault.com/questions/523315/re-activate-device-that-is-considered-dead).

Server hardware: Dell 1950, Dell R815, Dell R715.

Operating system:

$ uname -a
Linux franklin 3.2.0-4-amd64 #1 SMP Debian 3.2.41-2+deb7u2 x86_64 GNU/Linux

Controller:

$ lspci |grep 22: 
22:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
$ sas2flash -listall
LSI Corporation SAS2 Flash Utility
Version 15.00.00.00 (2012.11.06) 
Copyright (c) 2008-2012 LSI Corporation. All rights reserved

    Adapter Selected is a LSI SAS: SAS2008(B2)   

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

1  SAS2008(B2)     15.00.00.00    0f.00.00.04    07.29.00.00     00:22:00:00

    Finished Processing Commands Successfully.
    Exiting SAS2Flash.

SAS/SATA-expander Supermicro 4U SAS/SATA Expander Backplane with single LSI SAS2X36 Expander Chip:

cat /sys/devices/pci0000:20/0000:20:03.0/0000:22:00.0/host5/port-5:0/expander-5:0/port-5:0:21/end_device-5:0:21/target5:0:21/5:0:21:0/model
SAS2X36         
cat /sys/devices/pci0000:20/0000:20:03.0/0000:22:00.0/host5/port-5:0/expander-5:0/port-5:0:21/end_device-5:0:21/target5:0:21/5:0:21:0/rev
0717

Disks:

$ cat /sys/devices/pci0000:20/0000:20:03.0/0000:22:00.0/host5/port-5:0/expander-5:0/port-5:0:1/end_device-5:0:1/target5:0:1/5:0:1:0/model
Hitachi HDS72404
$ cat /sys/devices/pci0000:20/0000:20:03.0/0000:22:00.0/host5/port-5:0/expander-5:0/port-5:0:1/end_device-5:0:1/target5:0:1/5:0:1:0/rev
A3B0

Disks in one system:

$ cat /sys/devices/pci0000:20/0000:20:0b.0/0000:23:00.0/host5/port-5:0/expander-5:0/port-5:0:8/end_device-5:0:8/target5:0:8/5:0:8:0/model
ST3000DM001-9YN1
$ cat /sys/devices/pci0000:20/0000:20:0b.0/0000:23:00.0/host5/port-5:0/expander-5:0/port-5:0:8/end_device-5:0:8/target5:0:8/5:0:8:0/rev
CC4C

Syslog:

sd 5:0:22:0: [sdw] Unhandled error code
mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
sd 5:0:22:0: [sdw] Unhandled error code
mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
sd 5:0:22:0: [sdw]
mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
sd 5:0:22:0: [sdw] CDB: Write(10): 2a 00 3a 92 b9 00 00 01 00 00
end_request: I/O error, dev sdw, sector 982694144
sd 5:0:22:0: [sdw]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
sd 5:0:22:0: [sdw] CDB: Write(10): 2a 00 3a 92 b7 00 00 01 00 00
end_request: I/O error, dev sdw, sector 982693632
sd 5:0:22:0: [sdw] Unhandled error code
sd 5:0:22:0: [sdw]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
sd 5:0:22:0: [sdw] CDB: Read(16):
sd 5:0:22:0: [sdw] Unhandled error code
 88 00 00 00 00 01 43 e2 f2 d0 00 00 00 10 00 00
end_request: I/O error, dev sdw, sector 5433914064
sd 5:0:22:0: [sdw]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
sd 5:0:22:0: [sdw] CDB: Write(10): 2a 00 3a 92 bd 00 00 01 00 00
end_request: I/O error, dev sdw, sector 982695168
sd 5:0:22:0: [sdw]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
md/raid:md4: Disk failure on sdw, disabling device.
md/raid:md4: Operation continuing on 9 devices.
scsi 5:0:22:0: [sdw] Unhandled error code
scsi 5:0:22:0: [sdw]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
scsi 5:0:22:0: [sdw] CDB: Write(10): 2a 00 3a 92 b8 00 00 01 00 00
end_request: I/O error, dev sdw, sector 982693888
scsi 5:0:22:0: [sdw] Unhandled error code
scsi 5:0:22:0: [sdw]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
scsi 5:0:22:0: [sdw] CDB: Write(10): 2a 00 3a 92 bc 00 00 01 00 00
end_request: I/O error, dev sdw, sector 982694912
mpt2sas1: removing handle(0x0021), sas_addr(0x500304800182694c)

A bad cable, that lacks shielding can cause issues with checksums. [thus causing read and write issues]. Have you tried replacing the cables? — monksy, Jul 15 '13 at 11:12
Cables have been replaced with known goods. Also I would expect Linux to retry the command after resetting the scsi bus. — Ole Tange, Jul 15 '13 at 11:13
From what I've managed to dig up so far, the messages indicate that there is connectivity issues - not SMART alerts.. maybe someone else with extensive BiY experience can help. All I know is that they stay away from S-ATA disks in large setups because of the lack of commands/queues compared to SAS. I'll ask a few to take a look at this. — pauska, Jul 15 '13 at 11:56
Long list of different problems with S-ATA on SAS expanders (including the 0x31120303 errorcode): http://hardforum.com/showthread.php?t=1548145, not quite the same messages but similar: http://serverfault.com/questions/407703/deciphering-continuing-mpt2sas-syslog-messages and so on. — pauska, Jul 15 '13 at 12:25
and here's a exact match: http://comments.gmane.org/gmane.linux.file-systems.zfs.user/1620 - also lots of S-ATA disks on a SAS backplane under heavy load. — pauska, Jul 15 '13 at 12:30
and another one who suspects a different drive model to be incompatible with a LSI SAS expander: http://forums.storagereview.com/index.php/topic/29493-seagate-barracuda-green-2tb-review/page__st__30 — pauska, Jul 15 '13 at 12:32
The bottom line here - you should get a batch of SAS drives instead and see if this fixes the problem - at least test it on one of the servers.. — pauska, Jul 15 '13 at 12:33

score 1 · Answer 1 · edited Apr 13 '17 at 12:14

1

We're missing information here. You're suggesting that you have 24-45 disks per server in this storage setup.

Which specific controller(s) are you using?
Due to the number of disks, you may have some drives in an external enclosure. Please provide the make/model of the external drive enclosure in use.
What specific drive models are you using? Are all of the disks desktop-grade drives?
What filesystem are you using?
Describe the disk and RAID layout.
Was this always a problem or did it develop over time?
is Supermicro involved anywhere in this setup?

Depending on the enclosure setup, you may be running into SATA timeouts or bus errors. This can have an ill effect on all of the drives attached to the controller.

Another issue could be poor SAS/SATA link negotiation. I've certainly experienced this on some SAS expanders when 1.5Gbps and 6.0Gbps drives are mixed on the same board.

Please provide more information.

edited Apr 13 '17 at 12:14

Community

1

answered Jul 15 '13 at 12:43

ewwhite

194,921
91
434
799

1

I am very interested in your question about Supermicro. Can you elaborate? – Halfgaar Jul 15 '13 at 12:57
@Halfgaar Could you provide the feedback on the other questions I asked? – ewwhite Jul 15 '13 at 13:35
The original post is not mine. I'm just curious about that statement. – Halfgaar Jul 15 '13 at 13:40
1

@Halfgaar Ooops... Well, I've found that Supermicro SAS expanded/backplanes and some of the JBOD enclosures don't behave predictably in many circumstances. The note in my answer about SAS/SATA speed downshifting and link negotiation is something I've only experienced on certain revisions of Supermicro gear. I also can't use their JBODs for ZFS anymore because of wonky behavior. – ewwhite Jul 15 '13 at 13:43

Harddisks falling offline for unknown reason

1 Answers1