Same hard drive problems in dmesg on several servers

Question

We have a couple of these SuperMicro MicroCloud units.

In total we've got 16 servers (2x8) which all randomly spew these messages in dmesg:

[4661350.802707] ata2.00: failed command: WRITE FPDMA QUEUED
[4661350.802734] ata2.00: cmd 61/00:28:00:d0:fc/04:00:0f:00:00/40 tag 5 ncq 524288 out
[4661350.802735]          res 40/00:0c:00:f8:fc/00:00:0f:00:00/40 Emask 0x10 (ATA bus error)
[4661350.802821] ata2.00: status: { DRDY }

Everything seems fine even due to the errors, but it feels very wrong to ignore them. It happens mostly during periods of high disk activity.

On one single server, it randomly stopped, even during high disk activity.

Googling suggests it can be due to loose connectors or drive failures, but it happens on 16 different servers with different types of hard drives even (eight use 7200 RPM WD Black SATA drives, and another eight use 10000 RPM VelociRaptors).

We tried with two different linux kernels, 2.6.32 (Debian Squeeze) and 3.2.44 (Debian Wheezy).

The server vendor suggests to upgrade to the newest BIOS, but we are already running it.

So now we're stuck :) Anybody got a suggestion?

Full dmesg: http://pastebin.com/Z9k1kXbc

Update: Jim Garrison pointed me to an AskUbuntu Question where they mention defective intel chipsets. I now worry that we are affected by this, although it was discovered already back in 2011. (The servers were built in Q42012 but SuperMicro could have had an old batch from 2011 - they make their own motherboards).

"lspci" gives me this:

00:00.0 Host bridge: Intel Corporation Sandy Bridge DMI2 (rev 07)
00:1f.2 SATA controller: Intel Corporation Patsburg 6-Port SATA AHCI Controller (rev 06)

A softpedia news article mentions that "rev04" is affected. Should I gather from "rev 07" in lspci that we are not affected?

Intel errata from June 2013 mention some similar problems:

Due to a circuit design issue on Intel 6 Series Chipset and Intel C200 Series Chipset, electrical lifetime wear out may affect clock distribution for SATA ports 2-5. This may manifest itself as a functional issue on SATA ports 2-5 over time.

The chipsets are named as "Intel® Q67 Chipset", "Intel® Q65 Chipset", etc in the errata. How can I find out which kind of chipset (named like that) I have, from a Debian command prompt?

Update: I have now located the correct errata for the chipset, I think. (It is BD82C602J). Nothing too serious there, it seems.

They've been Super nice previously, we have lots of "SM gear". :) But this one case is rather annoying, I'd have to say. — sune, Jul 09 '13 at 23:26
A [thread](http://forum.zentyal.org/index.php?topic=11312.0) I found seems to mention bad cables. Try swapping the ata0/1 cables with others (like say, ata2/3). If possible you could probably just shove in replacements. — Nathan C, Jul 09 '13 at 23:56
The problem is that it's part of some backplane, which connects directly to the nodes. So there is no sata cable you can disconnect/connect. Also, it happens on 16 different servers so it would be suspicious if they were all bad. :/ — sune, Jul 10 '13 at 00:08
All the boards with that chipset were pulled from the distribution channel, and any that actually made it to customers were replaced. It's highly unlikely you received a board with one of these chips. — Michael Hampton, Jul 10 '13 at 00:16
Michael, I think you are right; I now found the correct [spec change document](http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/c600-series-chipset-spec-update.pdf) and it mentions no serious-looking SATA issues. — sune, Jul 10 '13 at 00:54

score 2 · Answer 1 · answered Jul 10 '13 at 00:16

It certainly looks like a controller issue to me. Hopefully you have some warranty left. It's a bus error, not an unresponsive drive as is usually the case with defective drive controller boards and marginal cables (or backplanes); it looks like it's the system board.

You could also try flashing over the BIOS (even with the same version) to rule out BIOS corruption of some kind.

I believe the result of this is just a reset and continue, so you may have nothing to worry about, though it will negatively impact performance. It may also deteriorate over time.

I fear that you may be right, but it would of course be nice to try as many options as possible before going the replacements route. Flashing the BIOS over could be a next step. — sune, Jul 10 '13 at 00:53

Same hard drive problems in dmesg on several servers

1 Answers1