1

I have a 45-disk array of Seagate Barracuda 3 TB ST3000DM001 (yes these are desktop drives I'm aware of that) in a Supermicro sc847 JBOD, connected via LSI 9285. I have found a solution for the problem description below by reducing speed via

MegaCli -PhySetLinkSpeed -phy0 2 -a0;
for i in $(seq 48); do MegaCli -PhySetLinkSpeed -phy${i} 2 -a0; done

and rebooting.

The question remains: Is this typical for current 6 gb/s equipment? Is this the sad state of SATA storage? Or is some of my equipment (the sff-8088 cables come to mind) bad?

The Problem was:

Synchronizing HW RAID-6, disks kept offlining. Fetching SMART values reveiled that those which offlined did not increase powered-on hours anymore. That is, their firmware (CC4C) seems to crash.

Digging into the matter by switching to Software RAID-6, with the disks passed-through, I got tons of kernel messages scattered across all disks, with 6 gb/s:

sd 0:0:9:0: [sdb]  Sense Key : No Sense [current]
Info fld=0x0
sd 0:0:9:0: [sdb]  Add. Sense: No additional sense information

And finally, when a disk offlines:

megasas: [ 5]waiting for 160 commands to complete
...
megasas: [35]waiting for 159 commands to complete
...
megasas: [155]waiting for 156 commands to complete
...
megaraid_sas: pending commands remain after waiting, will reset adapter.

Ugly controller reset here, then minutes later:

megaraid_sas: Reset successful.
sd 0:0:28:0: Device offlined - not ready after error recovery
...
sd 0:0:28:0: [sdu] Unhandled error code
sd 0:0:28:0: [sdu]  Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
sd 0:0:28:0: [sdu] CDB: Read(10): 28 00 23 21 2f 40 00 00 70 00
sd 0:0:28:0: [sdu] killing request

Reduced speed to 3 gb/s like written above, all problems vanished.

korkman
  • 1,647
  • 2
  • 13
  • 26
  • Out of curiosity, how is your RAID setup arranged? You said RAID6, but do you have multiple groups of RAID6, or are all drives in one RAID group? – ewwhite Apr 14 '12 at 14:36
  • Are you running the latest available firmware versions for all of the drives and the LSI card? The last time I saw anything like this, it was an issue with the drive firmware, though that was with enterprise-grade drives. Those consumer-grade drives simply may not be able to keep up with the commands from the RAID card. Maybe you can disable NCQ? – Charles Apr 14 '12 at 18:17
  • @Charles: Firmware is latest on RAID card (JBOD was actually new feature thx to firmware) and disks. Yes I can disable NCQ. Will test that next week (machine down because of OOM ... different story). – korkman Apr 14 '12 at 19:26
  • @ewwhite: I currently plan for 9 RAID-6 volumes, in groups of 14 partitions. So each disk will hold 3 partitions, which are members of 3 different RAID groups. – korkman Apr 14 '12 at 19:28
  • Rough. No big suggestions here, other than I would be using 6G nearline SAS instead of SATA just because of the unpredictable failure modes of SATA disks. http://serverfault.com/questions/331499/how-can-a-single-disk-in-a-hardware-sata-raid-10-array-bring-the-entire-array-to/331504#331504 – ewwhite Apr 14 '12 at 19:34
  • If they bother releasing a firmware update for a desktop drive, you'd usually better apply it (HP24 comes to mind). I was going to suggest you apply any firmware updates, but the cables seem more likely. If the disks support 6gbps I'd expect them to be able to handle a rate of requests sufficient to need it. – Falcon Momot Apr 15 '12 at 05:21

1 Answers1

2

Much like 4-pair UTP (ethernet) cable, not all SATA cables can be used for every speed. Make sure your cables are rated for 6.0gbps (usually the cable has a "split" appearance and has text like 6.0gbps SATA printed on it).

Falcon Momot
  • 24,975
  • 13
  • 61
  • 92
  • Does this apply to SFF-8088? Wikipedia says it's spec'd "with future 10 Gbit/s capability". So not really implementation tolerance for more or less shielding. I think I will replace the cables, though, because they were really cheap. 1m 22 EUR. Other brands were 60-120 EUR. Maybe the manufacturer didn't honor the specs. – korkman Apr 14 '12 at 19:41
  • It'd be the cables, not the connector, almost certainly. – Falcon Momot Apr 14 '12 at 19:45
  • Heh, true. SFF-8088 only specs the connector, not the cable. Webshop gives no details about cable ratings, of course. But I'll keep that in mind from now on. – korkman Apr 14 '12 at 19:56
  • Can't keep the cable theory up since it also happens on a second array and with expensive brand cables (external). It's focused on the rear backplanes as it seems, not all disks per se. So a disk / SAS switch compatiblity issue? – korkman Oct 08 '12 at 17:33
  • Possibly, but extremely unlikely. It's more likely to be a bad controller. – Falcon Momot Oct 09 '12 at 00:49
  • Also, two different controllers (LSI) tested. I think I have now replaced everything except the disks :-) – korkman Oct 11 '12 at 12:27