I have a Linux server (CentOS 5.5) that has two identical IDE hard drives. I've used software RAID (mdadm) to create mirrors for each filesystem, so that either hard drive could fail and no data would be lost.
Today one of my hard drives failed. The whole point of RAID should be to allow the system to keep running when this happens; but what happened instead was that the console began spewing the same 4 lines over and over:
hdb: task_out_intr: status=0x61 { DriveReady DeviceFault Error }
hdb: task_out_intr: error=0x04 { DriveStatusError }
ide: failed opcode was: unknown
ide0: reset: success
Due to the high rate of errors being produced, the console was unusable. I was able to SSH in, but the first command I tried just hung. I SSH'ed in again and tried to reboot, but that got hung up as well. Ultimately I had to physically reset the machine.
I know how to remove the failed drive from the MD and replace it, etc. But having the machine lock up and become unusable in this situation seems to defeat the whole point of having RAID mirrors in the first place.
Is this just the way the Linux kernel always behaves in this situation? Or is there some way to configure the kernel so that when a hard drive fails, it rate-limits the errors being produced, and doesn't prevent the machine from being used and cleanly rebooted?