1

Improbably I had two drives fail in the same Raid5 array within 2 weeks of one another. Which means the array is dead. Yes yes hot spares not being lazy about replacing the failed drive I know. But let's move past that.

The data is somewhat backed up and not of critical importance, so I am not particularly panicked by this. I would still like to try to salvage what I can anyway.

It is a 4-device Software RAID5 set up with mdadm. The drives are as follows:

/dev/sde - device 0, healthy 
/dev/sdf - device 1, first failure, hard failure, totally dead
/dev/sdg - device 2, second failure, badblocks reports a few bad sectors
/dev/sdc - device 3, healthy

I think you can see where I'm going with this. Given that sdg has only the few bad sectors I'd like to believe that most of the data is salvagable. When I reassemble the array with

mdadm --create /dev/md0 --assume-clean --level=5 --raid-devices=4 /dev/sde missing /dev/sdg /dev/sdc

I get no complaints and the device assembles and starts just fine in degraded mode. The problem occurs when I try to mount it. As soon as I run

mount -t ext4 /dev/md0 /mnt/raid

The bad blocks are detected at that point, /dev/sdg fails out of the array, and with only /dev/sde and /dev/sdc still operational the raid goes inactive and the mount fails.

Is there some way to prevent mdadm from failing the drive as soon as it detects a bad block? Some debug flag I can set? Something? I realize that some of the data will be corrupt, and some of the reads will fail.

I'm guessing what I am asking is impossible, although I don't see the theoretical reason that it needs to be. The RAID device could just say I/O error like the drive itself does. But I figure that if the only way to avoid dd failing on a normal hard drive's bad blocks is to use a different program dd_rescue instead, I sort of figure the same will end up being true with mdadm, except I doubt there is any such thing as "mdadm_rescue".

Still, I will ask anyway, and please enlighten me if I am wrong or if you can think of a way to pull some of the data out without the drive instantly crashing out of the array.

cecilkorik
  • 445
  • 1
  • 4
  • 13
  • 1
    This is not as improbable as you'd have thought. Alas, RAID5 is not a reliable solution... If the data are backed up (somewhat??) the best bet is to fix your RAID (getting away from RAID5) and restore. Just my $.02... – Deer Hunter Feb 05 '13 at 08:06
  • 1
    If I were you I would get disk images of all the devices in the array as no doubt they are consumer 7200 rpm drives and improbability nears one that you'll get a further error during the rebuild. – user9517 Feb 05 '13 at 08:28
  • I have nowhere to put the disk images at the moment unfortunately, and you're right they are Seagate Barracudas. As soon as I get the new Seagate Constellations that I ordered (which should be a bit more reliable) I will take some images of the old array. – cecilkorik Feb 05 '13 at 18:41

1 Answers1

3

Off hand, try doing a disk dump of the dying drive to a healthy drive, and then add the healthy drive to the array.

Stephan
  • 999
  • 7
  • 11
  • 1
    This should work, though then you have the mystery of _which files are corrupt?_ Solving that problem will probably take much longer than just restoring from backup. – Michael Hampton Feb 05 '13 at 08:23
  • Heh, I completely agree with you. I have the impression he'd rather not spend the time/effort to do the restore from backup; if the data really isn't that valuable to him, then I suppose it doesn't matter. – Stephan Feb 05 '13 at 18:22
  • I don't have a spare healthy drive of adequate size at the moment (just ordered a bunch) so whenever I get them I might give this a try. Michael Hampton makes a very good point, but at this point I'm more or less resigned to the data being a half-broken mess. I guess it's just my digital-hoarder nature that I would like to try to recover whatever bits and pieces I can from it. I do have a backup, it's just not as recent as I would like. – cecilkorik Feb 05 '13 at 18:39
  • I'm a bit of a digital horder too, to be fair, so I'll offer a few thoughts on how I manage it. First: checksumming. Have a script regularly crawl all of your files, creating an md5hash of each file. If it's a new file, the checksum gets written into a new file in a /checksums directory. If the file name matches a file in /checksum then they're compared. If they don't match, they email alert you. Do this once every couple of weeks; that way you have a record of what your 'good' data looks like. Second, if the data is valuable, Deer Hunter is right; raid5 isn't good. Automate backups. :) – Stephan Feb 05 '13 at 18:58