My story starts out quite simply. I have a light-duty server, running Arch Linux, which stores most of its data on a RAID-1 composed of two SATA drives. It was working without any problems for about 4 months. Then, suddenly I started getting read errors on one of the drives. Always, the messages looked a lot like these:
Apr 18 00:20:15 hope kernel: [307085.582035] ata5.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr 18 00:20:15 hope kernel: [307085.582040] ata5.01: failed command: READ DMA EXT
Apr 18 00:20:15 hope kernel: [307085.582048] ata5.01: cmd 25/00:08:08:6a:34/00:00:27:00:00/f0 tag 0 dma 4096 in
Apr 18 00:20:15 hope kernel: [307085.582050] res 51/40:00:0c:6a:34/40:00:27:00:00/f0 Emask 0x9 (media error)
Apr 18 00:20:15 hope kernel: [307085.582053] ata5.01: status: { DRDY ERR }
Apr 18 00:20:15 hope kernel: [307085.582056] ata5.01: error: { UNC }
Apr 18 00:20:15 hope kernel: [307085.621301] ata5.00: configured for UDMA/133
Apr 18 00:20:15 hope kernel: [307085.640972] ata5.01: configured for UDMA/133
Apr 18 00:20:15 hope kernel: [307085.640986] sd 4:0:1:0: [sdd] Unhandled sense code
Apr 18 00:20:15 hope kernel: [307085.640989] sd 4:0:1:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 18 00:20:15 hope kernel: [307085.640993] sd 4:0:1:0: [sdd] Sense Key : Medium Error [current] [descriptor]
Apr 18 00:20:15 hope kernel: [307085.640998] Descriptor sense data with sense descriptors (in hex):
Apr 18 00:20:15 hope kernel: [307085.641001] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Apr 18 00:20:15 hope kernel: [307085.641010] 27 34 6a 0c
Apr 18 00:20:15 hope kernel: [307085.641020] sd 4:0:1:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate failed
Apr 18 00:20:15 hope kernel: [307085.641023] sd 4:0:1:0: [sdd] CDB: Read(10): 28 00 27 34 6a 08 00 00 08 00
Apr 18 00:20:15 hope kernel: [307085.641027] end_request: I/O error, dev sdd, sector 657746444
Apr 18 00:20:15 hope kernel: [307085.641035] ata5: EH complete
Apr 18 00:20:15 hope kernel: [307085.641672] md/raid1:md16: read error corrected (8 sectors at 657744392 on sdd1)
Apr 18 00:20:17 hope kernel: [307087.505082] md/raid1:md16: redirecting sector 657742336 to other mirror: sdd1
Each error complained of a different sector number, and was accompanied by a several-second delay for the user (me) accessing the disk.
I checked the smartctl output, and saw the following output (irrelevant parts clipped):
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 193 193 051 Pre-fail Always - 1606
5 Reallocated_Sector_Ct 0x0033 194 194 140 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 162 162 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 51
Looking back in the logs, I found that the errors had actually been happening for a few days, mostly during backups, but also frequently during very light use (meaning about every 5th time I tried to save a text file). I concluded that my disk was dying, that the RAID-1 was handling it appropriately, and that it was time to order a replacement disk. I ordered a new disk.
Much to my surprise, a day later, the errors... stopped. I had done nothing to fix them. I hadn't rebooted, hadn't taken the drive offline, nothing. But the errors just stopped.
At that point, curious to see whether the bad sectors were just in idle portions of the disk now, I took the disk out of the RAID, put it back in the RAID, and allowed it to complete the ensuing full resync. The resync completed without any errors, 9 hours later (2TB disks take a little while).
Also, the smartctl output has changed a bit, as follows:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 193 193 051 Pre-fail Always - 1606
5 Reallocated_Sector_Ct 0x0033 194 194 140 Pre-fail Always - 43
196 Reallocated_Event_Count 0x0032 162 162 000 Old_age Always - 38
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
So, the part of this that's weirding me out is, of course, "Since when do bad disks fix themselves?"
I suppose it's possible that a very small area of the drive spontaneously went bad, and that the drive simply took 3 days (!) before its sector reallocation code kicked in and it mapped some spare sectors over a bad area of the disk... But I can't say that I've ever seen that happen.
Has anyone else seen this kind of behavior? If so, what was your experience with the drive afterward? Did it happen again? Did the disk eventually fail completely? Or was it just an unexplained glitch that remained unexplained?
In my case, I already have the replacement drive (obtained under warranty), so I'll probably just replace the drive anyway. But I'd love to know if I misdiagnosed this somehow. If it helps, I have have complete 'smartctl -a' output from when the problem was happening. It's just a bit long, so I didn't post it here.