Raid 1 mdadm (linux) disk failure recovery: DRDY err (UNC) keeps repeating can't reach login

Over the weekend, I got several emails from our network storage server (just a custom box with CentOS 5 and 2 2tb drives software raid 1) indicating SMART detected issues with one of the drives.

I did a status and 2 of the raided partitions were marked failed:

    [root@aapsan01 ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
      104320 blocks [2/2] [UU]

md0 : active raid1 sdb3[1] sda3[2](F)
      4064320 blocks [2/1] [_U]

md3 : active raid1 sdb5[1] sda5[0]
      1928860160 blocks [2/2] [UU]

md2 : active raid1 sdb2[1] sda2[2](F)
      20482752 blocks [2/1] [_U]

So, I set all sda's partitions to "failed," removed all sda mirrors successfully, put a brand new 2tb identical drive in (after shutdown) and booted. Now, I cannot reach the login because error messages keep repeating after md: autodetect raid array is reached during the boot process. At first the errors were something like:

  DRDY err (UNC) -- exception emask media error

Now I get I/O errors. I tried with the corrupt drive removed and then with it in again. Same show. The write ups I've found show this to be a simple recovery process. What gives? Anyone encounter anything similar? It appears as though the boot process is still continuing, though it's taking eons to go through each step. Has anyone ever had to wait so long to reach the prompt? Hopefully, if I can't get to the prompt I can get somewhere with the rescue cd.

Flotsam N. Jetsam

Posted 2010-12-06T17:52:34.527

Reputation: 1 291

1Isn't it some sdb partitions that have failed? – Linker3000 – 2010-12-06T19:37:26.473

How can you tell from the stat message? The email I got from the mdadm daemon said "It could be related to component device /dev/sda3." – Flotsam N. Jetsam – 2010-12-06T21:01:49.377

Look at md2 - it has two partitions in the array listed in order [sdb2] [sda2] and the status of the pair is listed as [_U], which means that the first partition ([sdb2]) has dropped out of the pairing. Have a read here: http://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array

– Linker3000 – 2010-12-06T23:03:08.900

Answers

Linker3000

Posted 2010-12-06T17:52:34.527

Reputation: 25 670

That is very helpful. I've seen write ups all over the web on it, but nowhere can I remember anyone stating for certain that the underscore side indicates the bad one. It probably should be intuitive but I guess I've been in sort of a panic mode and it didn't sink in. Thanks. – Flotsam N. Jetsam – 2010-12-07T15:59:33.120

Following on on Linker3000's answer, the contents of the disk you removed first should still be ok. Remove the disk that you now know is actually the broken one and try starting with the good disk alone. There is a small chance that md marked your healthy disk as being behind when you re-added it with the broken disk present. In this case, you need to start from a live CD/USB and re-activate your RAID. Once you have your system running ok, you can start again with the normal steps to add a new disk to your RAID 1s.

Joachim Wagner

Posted 2010-12-06T17:52:34.527

Reputation: 144

I'm a dummy. I had misidentified the failing disk and was trying to use the bad one in my recovery effort. For anyone interested, you can use lshal to get the s/n of the bad drive. redirect lshal's output to a log file and then search for sda sdb or whatever mdadm or SMART identified as being bad.

Flotsam N. Jetsam

Posted 2010-12-06T17:52:34.527

Reputation: 1 291