We have a couple of these SuperMicro MicroCloud units.
In total we've got 16 servers (2x8) which all randomly spew these messages in dmesg:
[4661350.802707] ata2.00: failed command: WRITE FPDMA QUEUED
[4661350.802734] ata2.00: cmd 61/00:28:00:d0:fc/04:00:0f:00:00/40 tag 5 ncq 524288 out
[4661350.802735] res 40/00:0c:00:f8:fc/00:00:0f:00:00/40 Emask 0x10 (ATA bus error)
[4661350.802821] ata2.00: status: { DRDY }
Everything seems fine even due to the errors, but it feels very wrong to ignore them. It happens mostly during periods of high disk activity.
On one single server, it randomly stopped, even during high disk activity.
Googling suggests it can be due to loose connectors or drive failures, but it happens on 16 different servers with different types of hard drives even (eight use 7200 RPM WD Black SATA drives, and another eight use 10000 RPM VelociRaptors).
We tried with two different linux kernels, 2.6.32 (Debian Squeeze) and 3.2.44 (Debian Wheezy).
The server vendor suggests to upgrade to the newest BIOS, but we are already running it.
So now we're stuck :) Anybody got a suggestion?
Full dmesg: http://pastebin.com/Z9k1kXbc
Update: Jim Garrison pointed me to an AskUbuntu Question where they mention defective intel chipsets. I now worry that we are affected by this, although it was discovered already back in 2011. (The servers were built in Q42012 but SuperMicro could have had an old batch from 2011 - they make their own motherboards).
"lspci" gives me this:
00:00.0 Host bridge: Intel Corporation Sandy Bridge DMI2 (rev 07)
00:1f.2 SATA controller: Intel Corporation Patsburg 6-Port SATA AHCI Controller (rev 06)
A softpedia news article mentions that "rev04" is affected. Should I gather from "rev 07" in lspci that we are not affected?
Intel errata from June 2013 mention some similar problems:
Due to a circuit design issue on Intel 6 Series Chipset and Intel C200 Series Chipset, electrical lifetime wear out may affect clock distribution for SATA ports 2-5. This may manifest itself as a functional issue on SATA ports 2-5 over time.
The chipsets are named as "Intel® Q67 Chipset", "Intel® Q65 Chipset", etc in the errata. How can I find out which kind of chipset (named like that) I have, from a Debian command prompt?
Update: I have now located the correct errata for the chipset, I think. (It is BD82C602J). Nothing too serious there, it seems.