I have a Linux system set up with 3 software RAID1 devices, each of which comprises two identical partitions on two identical disks. Recently, one of the non-root partitions on one disk began experiencing DMA errors; I therefore marked it as failed. When I rebooted the machine, it launched the kernel successfully but began printing DMA errors (presumably associated with the failed partition) almost immediately. Shouldn't marking the problematic partition as failed permit the machine to boot without any errors? If not, how can I get the system to boot? I tried modifying the mdadm.conf file in the machine's boot image to not list the problematic partition in the device list of the RAID device that comprised it, but that didn't seem to have any effect. I should also note that I can access the degraded RAID device if I boot from a rescue CD and manually assemble the device from the remaining good partition.
1 Answers
It sounds like you are using Linux Software RAID, and you've got the RAID devices set up using partitions instead of whole disks.
In this case simply failing the partition won't help you: The failing drive (hardware component) is what's throwing the errors. Any time the operating system tries to access that hardware component you'll have problems, and since drives don't typically go bad in just one spot the problems will spread across all partitions until the drive finally gives up and dies.
My suggestion to you is to back up your data NOW, using the rescue CD (which apparently works per your question), and then replace the failing hardware component, rebuilding your RAID array(s) as appropriate.
Long-Term you will want to structure your software RAID the same way you would a hardware RAID (using whole drives, not partitions) - If necessary create the RAID across the physical drives you have, then partition the virtual (RAID) device. This allows you to fail a dying drive (hardware component), or if necessary to remove it and allow the system to boot without it with a known and well defined set of side effects, rather than being surprised as a drive's ever-escalating level of failure causes more and more partition-RAIDs to go wonky...
If you are trusting software RAID in production you should also be running smartd
from the smartmontools suite and have it configured to alert you when drives start to look flaky...
- 79,345
- 17
- 128
- 213
-
Right - I am using software RAID. Out of curiosity, is there some way to reconfigure the OS such that it could boot off of the degraded array if I physically remove the faulty drive? – lebedov Oct 04 '11 at 20:28
-
@lebedov, isn't it setup that way already? If you had RAID1, the everything should be identical, aside from maybe the bootloader. – Zoredache Oct 04 '11 at 20:36
-
A lot of that depends on your server's BIOS or EFI configuration, not the OS. The question @mailq pointed you at (http://serverfault.com/questions/196445/boot-debian-while-raid-array-is-degraded) may have some pointers on the Linux-specific stuff you need to do. – voretaq7 Oct 04 '11 at 20:36
-
@Zoredache good point - If the bootloader is only on the failed drive you're SOL. Again why you should mirror the whole disk, including the boot sector. I know that's possible on FreeBSD (GEOM_MIRROR metadata goes at the end of the drive), and I think it's possible on Linux too... – voretaq7 Oct 04 '11 at 20:37
-
The problem is that if I remove the bad disk (sda), the undamaged mirror partitions on the good disk (sdb) are seen as sda* by the OS; the system refuses to boot when it can't find the sdb* partitions. – lebedov Oct 04 '11 at 20:39
-
@voretaq7: I did run lilo after failing the bad partitions. Shouldn't that ensure that the bootloader is on the good partition? – lebedov Oct 04 '11 at 20:41
-
It sure sounds like you have done something weird/wrong when setting up your system. Software RAID doesn't care about the symbolic name of the device normally. Your fstab should be referring to the md* devices. The only thing that should need to be fixed is the bootloader (grub/lilo). Fixing that should be trivial from a livecd. – Zoredache Oct 04 '11 at 20:43
-
@lebedov A *partition* is not a *drive* -- Boot loaders live on raw naked disk devices (the BIOS knows not of "partition", it knows "Stuff in first block of drive. Me grab. Me execute." - Yes it's really that caveman-primitive. In your case you're getting past the "boot" part and being bitten by the fact that your disk IDs are changing b/c the hardware configuration is different. Best solution: See answer (Back up. Replace disk. Repair system.) – voretaq7 Oct 04 '11 at 20:44
-
@Zoredache: if I remove the bad drive and reboot with a rescue CD, running mdadm --examine /dev/sda1 lists /dev/sdb1 in the partition information. – lebedov Oct 04 '11 at 21:37
-
@Zoredache: never mind my last comment - all I needed to do to boot was to update mdadm.conf. – lebedov Oct 04 '11 at 21:55