3
my current storage setup consists of two traditional HDD's and two SSD's in my Linux box, each two on their own RAID 1 array which is encrypted via luks. I have a story of sorts, rather than a concrete question.
For over a year now, I've randomly gotten "hard resetting link" errors in the kernel log from some of my drives. I would RMA the problem drive, and the new drives would cause the problem to stop. A few months later, I would eventually start seeing the same error again at seemingly random times. The drive would be marked as failed in RAID and no longer showed up in fdisk -l
. I would reboot the computer and the drive would show up again and I could re-add to the array and it would rebuild. Sooner or later that problem would happen again, usually a few hours later.
About six months ago, I replaced two of the traditional HDD's with SSD's in the hopes that they wouldn't have nearly as high of a failure rate as the traditional drives. However, over the past few days I started having problems with both one of the new SSD's and one of the traditional drives.
I'm starting to see a pattern emerge. I get a new drive, a few months later I start having problems with it. I always assumed it was due to HDD's having a high failure rate, but now it's happening with SSD's so I'm thinking it isn't the drive's fault. What else could be problem? I've had multiple OS's installed since I started having the problem so I want to rule out a software issue. This leaves either the SATA cables, or the motherboard. Could the disk encryption be putting too much stress on the drives? Is there anything I can do to determine more info? Thanks as always.
Below is the dmesg
output of the problem from a question I asked a few months ago when I was having the same problem.
[43161.734107] ata3: ATA_REG 0x41 ERR_REG 0x84
[43161.734110] ata3: tag : dhfis dmafis sdbfis sactive
[43161.734113] ata3: tag 0x0: 1 1 0 1
[43161.734123] ata3.00: exception Emask 0x1 SAct 0x1 SErr 0x180000 action 0x6 frozen
[43161.734127] ata3.00: Ata error. fis:0x21
[43161.734130] ata3: SError: { 10B8B Dispar }
[43161.734134] ata3.00: failed command: READ FPDMA QUEUED
[43161.734142] ata3.00: cmd 60/08:00:a8:03:00/00:00:00:00:00/40 tag 0 ncq 4096 in
[43161.734144] res 41/84:04:a8:03:00/84:00:00:00:00/40 Emask 0x10 (ATA bus error)
[43161.734148] ata3.00: status: { DRDY ERR }
[43161.734150] ata3.00: error: { ICRC ABRT }
[43161.734155] ata3: hard resetting link
[43161.734158] ata3: nv: skipping hardreset on occupied port
[43162.220095] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43162.260202] ata3.00: model number mismatch 'WDC WD2002FAEX-007BA0' != 'C WD2002FAEX-007BA0 �'
[43162.260206] ata3.00: revalidation failed (errno=-19)
[43162.260211] ata3.00: limiting speed to UDMA/133:PIO2
[43167.220123] ata3: hard resetting link
[43167.220127] ata3: nv: skipping hardreset on occupied port
[43167.710060] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43167.750228] ata3.00: model number mismatch 'WDC WD2002FAEX-007BA0' != 'C WD2002FAEX-007BA0 �'
[43167.750232] ata3.00: revalidation failed (errno=-19)
[43167.750236] ata3.00: disabled
[43172.710100] ata3: hard resetting link
[43173.620110] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43173.640455] ata3.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[43178.620116] ata3: hard resetting link
[43179.530113] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43179.550748] ata3.00: ATA-8: WDC WD2002FAEX-007BA0, 05.01D05, max UDMA/133
[43179.550753] ata3.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 31/32)
[43179.570208] ata3.00: model number mismatch 'WDC WD2002FAEX-007BA0' != 'C WD2002FAEX-007BA0 �'
[43179.570213] ata3.00: revalidation failed (errno=-19)
[43179.570220] ata3: limiting SATA link speed to 1.5 Gbps
[43179.570224] ata3.00: limiting speed to UDMA/133:PIO3
[43184.530066] ata3: hard resetting link
[43184.530070] ata3: nv: skipping hardreset on occupied port
[43185.020091] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43185.060949] ata3.00: configured for UDMA/133
[43185.060969] sd 2:0:0:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[43185.060974] sd 2:0:0:0: [sdd] Sense Key : Aborted Command [current] [descriptor]
[43185.060980] Descriptor sense data with sense descriptors (in hex):
[43185.060983] 72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[43185.060995] 00 00 03 a8
[43185.061000] sd 2:0:0:0: [sdd] Add. Sense: Scsi parity error
[43185.061006] sd 2:0:0:0: [sdd] CDB: Read(10): 28 00 00 00 03 a8 00 00 08 00
[43185.061017] end_request: I/O error, dev sdd, sector 936
[43185.061023] Buffer I/O error on device sdd, logical block 117
[43185.061044] sd 2:0:0:0: rejecting I/O to offline device
[43185.061048] sd 2:0:0:0: killing request
[43185.061062] ata3: EH complete
[43185.061075] sd 2:0:0:0: rejecting I/O to offline device
[43185.061123] sd 2:0:0:0: rejecting I/O to offline device
[43185.061134] sd 2:0:0:0: rejecting I/O to offline device
[43185.061140] sd 2:0:0:0: rejecting I/O to offline device
[43185.061145] sd 2:0:0:0: [sdd] READ CAPACITY(16) failed
[43185.061147] sd 2:0:0:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[43185.061152] sd 2:0:0:0: [sdd] Sense not available.
[43185.061155] sd 2:0:0:0: rejecting I/O to offline device
[43185.061166] sd 2:0:0:0: rejecting I/O to offline device
[43185.061175] sd 2:0:0:0: rejecting I/O to offline device
[43185.061185] sd 2:0:0:0: rejecting I/O to offline device
[43185.061193] sd 2:0:0:0: rejecting I/O to offline device
[43185.061198] sd 2:0:0:0: [sdd] READ CAPACITY failed
[43185.061202] sd 2:0:0:0: rejecting I/O to offline device
[43185.061209] sd 2:0:0:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[43185.061215] sd 2:0:0:0: [sdd] Sense not available.
[43185.061226] sd 2:0:0:0: rejecting I/O to offline device
[43185.061235] sd 2:0:0:0: rejecting I/O to offline device
[43185.061245] sd 2:0:0:0: rejecting I/O to offline device
[43185.061254] sd 2:0:0:0: rejecting I/O to offline device
[43185.061263] sd 2:0:0:0: rejecting I/O to offline device
[43185.061274] sd 2:0:0:0: rejecting I/O to offline device
[43185.061280] sd 2:0:0:0: [sdd] Asking for cache data failed
[43185.061283] sd 2:0:0:0: [sdd] Assuming drive cache: write through
[43185.061289] sdd: detected capacity change from 2000398934016 to 0
[43185.061610] ata3.00: detaching (SCSI 2:0:0:0)
[43185.062444] sd 2:0:0:0: [sdd] Stopping disk
[43249.120042] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[43249.120046] ata4.00: failed command: FLUSH CACHE EXT
[43249.120051] ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[43249.120052] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[43249.120054] ata4.00: status: { DRDY }
[43249.120059] ata4: hard resetting link
[43249.120060] ata4: nv: skipping hardreset on occupied port
[43249.610042] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43249.650323] ata4.00: configured for UDMA/133
[43249.650326] ata4.00: retrying FLUSH 0xea Emask 0x4
[43249.650452] ata4.00: device reported invalid CHS sector 0
[43249.650458] ata4: EH complete
1Thanks for the reply. Yes, my question is how can I figure out why all these drives keep failing on me. My mobo has been with me since 2008 when I built this system. I wonder if it's feeling the effects of old age. Four years isn't that old though. I do have three cold cathode lights in my case. I've never heard of those causing problems with cables though. More info on this? I have a few spare SATA cables lying around. I'll swap them out and change the SATA ports on my mobo. I can turn off the cold cathodes as well. – shanet – 2012-08-22T03:04:45.853
2I'd really like to avoid decrypting the drives, especially if I have to RMA them in the future. Although, I could take a spare drive, put an unencrypted filesystem on it, and have a cron job write random data to it for a while each day and see what happens. – shanet – 2012-08-22T03:06:58.737
2I worked in a shop where we installed cold cathode lights in machines (all the kewl kids did it ;) One day we set one of the egg timers we used next to one of the lights that was on. To say the timer went bat shiat insane crazy would be an understatement. We discovered the lights were throwing off huge amounts of RF. This was causing some of the problem we were seeing. Could be a faulty connection, or a poorly made product, or maybe it's just old... I can say that since then I've never put one in a computer... – Everett – 2012-08-22T03:11:38.137
1Interesting. I'm a sucker for case lighting so I've had these guys in here since 2008 also when I built the system. I'll leave them off for a while and see what happens. Thanks for your help, I never would have thought of possible interference from the cold cathodes. – shanet – 2012-08-22T03:20:31.833
Note, I'm not guaranteeing it's them, just saying it's possible, might as well eliminate it. Glad to be of service. – Everett – 2012-08-22T03:21:54.703