recover a raid disk which apparently has no errors

3

2

For several years now we run a software RAID1 on a old Gentoo box with a custom Linux 2.6.31. The RAID consists of 2 harddisk with 4 partitions each. In the last years it happend about 3-4 times that a disk was thrown out of the array. But each time badblocks didn't report an error and i was able to reactivate the disk like this

mdadm /dev/md3 -r /dev/sda3
mdadm /dev/md3 -a /dev/sda3

This time the situation is different: mdadm reported 2 faulty partitions over the past 24 h, both on the same disk sda. Again i ran badblocks without success: It reported 0 bad blocks. If i try to add the faulty disk back to the array, it breaks with the same error each time:

Mar 25 23:09:10 xen0 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 25 23:09:10 xen0 kernel: ata1.00: irq_stat 0x40000001
Mar 25 23:09:10 xen0 kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Mar 25 23:09:10 xen0 kernel: res 51/04:00:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error)
Mar 25 23:09:10 xen0 kernel: ata1.00: status: { DRDY ERR }
Mar 25 23:09:10 xen0 kernel: ata1.00: error: { ABRT }
Mar 25 23:09:10 xen0 kernel: ata1.00: configured for UDMA/133
Mar 25 23:09:10 xen0 kernel: ata1: EH complete
Mar 25 23:09:10 xen0 kernel: end_request: I/O error, dev sda, sector 18297870
Mar 25 23:09:10 xen0 kernel: md: super_written gets error=-5, uptodate=0
Mar 25 23:09:10 xen0 kernel: md: md3: recovery done.
Mar 25 23:09:10 xen0 kernel: RAID1 conf printout:
Mar 25 23:09:10 xen0 kernel: --- wd:1 rd:2
Mar 25 23:09:10 xen0 kernel: disk 0, wo:1, o:0, dev:sda3
Mar 25 23:09:10 xen0 kernel: disk 1, wo:0, o:1, dev:sdb3
Mar 25 23:09:10 xen0 kernel: RAID1 conf printout:
Mar 25 23:09:10 xen0 kernel: --- wd:1 rd:2
Mar 25 23:09:10 xen0 kernel: disk 1, wo:0, o:1, dev:sdb3

The sector is always the same: 18297870. If i check the output of smartctl -a /dev/sda it doesn't show any reallocated sectors, so i thought, i could enforce reallocation with a method i found here.

hdparm --read-sector  18297870 /dev/sda
hdparm --write-sector  18297870 --yes-i-know-what-i-am-doing /dev/sda

But with the above commands the disk does not report any error at all - and thus does not reallocate the sector:

smartctl -a /dev/sda | grep -i reall
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

I googled for the status: { DRDY ERR } message and some say this could be a kernel bug. But then i wonder why this just started to happen now. There was no change to the system over the past years.

There's one thing which still bugs me though: smartctl -a /dev/sda also reports some errors in the logs. It's always the same error:

Error 15 occurred at disk power-on lifetime: 39971 hours (1665 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 38 df f7 a7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 00 00 00 00 08  35d+12:17:45.571  FLUSH CACHE EXT
  61 80 f0 0e 96 2d 00 08  35d+12:17:44.033  WRITE FPDMA QUEUED
  61 80 e8 8e 95 2d 00 08  35d+12:17:44.033  WRITE FPDMA QUEUED
  61 80 e0 0e 95 2d 00 08  35d+12:17:44.033  WRITE FPDMA QUEUED
  61 80 d8 8e 94 2d 00 08  35d+12:17:44.033  WRITE FPDMA QUEUED

All this doesn't make sense to me: If the disk is really faulty, then why can i successfully read and write to the sector at which the RAID resync fails each time? And why doesn't the disk first try to reallocate the broken sector if it really detects an error?

Michael Härtl

Posted 2013-03-25T22:32:58.277

Reputation: 191

Answers

1

This question asks how to recover a drive that a RAID has dropped, and asks how a drive can periodically drop out of a RAID when sector problems do not exist. To begin to understand what might be going on, it may be helpful to realize that it is completely plausible that the drives go into a long-term error recovery or calibration function and become unresponsive for a long enough time period that a RAID controller will fail the drive even though it has no media errors.

If the drive was not designed for use in a RAID, a periodic self-test operation can be enough to fail the drive. This paragraph references the issue:

https://en.wikipedia.org/wiki/RAID#Integrity

This 49-day bug link describes a notorious instance of a TLER induced RAID faults.

Drives specifically designed for RAID limit the length of time the drive can go offline to perform maintenance/recovery operations.

When buying a drive for use in a RAID, and to avoid issues of this nature, it is helpful to look for time-limited error recovery (TLER) in the drive specification.

The TLER issue does not explain the difficulty in rebuilding the array, however, consider a situation where a drive sustains a sector failure, and the drive firmware remaps the failed sector to a spare so that the drive continues to look "perfect". One might wonder if the 18297870 sector was remapped to a spare sector and accesses to the spare sector takes too long for some reason. This does seem a bit farfetched, but knowing that disk access time delays can wreak havoc with RAIDs could be key to figuring out what is going on.

Check for manufacturer issued firmware updates for the drives. Firmware updates are not often available for consumer-class drives, but server-class drives often get firmware updates that fix firmware bugs that cause operational issues even when a hardware fault has not occurred. A web search for the terms "drive firmware bug RAID dropout" produces a lot of pertinent hits. Some results identify particular brands and models of drives.

This link documents one instance of a firmware deficiency that caused RAID failure. Though it is not identical to the problem documented above, the article shows the relevance of firmware as a causative agent in RAID problems. See also the Seagate Barracuda wikipedia page that contains many references to drive firmware bugs that caused performance problems.

kbulgrien

Posted 2013-03-25T22:32:58.277

Reputation: 445