Can I prevent a Linux server from locking up/spewing console errors when a hard drive fails?

Question

I have a Linux server (CentOS 5.5) that has two identical IDE hard drives. I've used software RAID (mdadm) to create mirrors for each filesystem, so that either hard drive could fail and no data would be lost.

Today one of my hard drives failed. The whole point of RAID should be to allow the system to keep running when this happens; but what happened instead was that the console began spewing the same 4 lines over and over:

hdb: task_out_intr: status=0x61 { DriveReady DeviceFault Error }
hdb: task_out_intr: error=0x04 { DriveStatusError }
ide: failed opcode was: unknown
ide0: reset: success

Due to the high rate of errors being produced, the console was unusable. I was able to SSH in, but the first command I tried just hung. I SSH'ed in again and tried to reboot, but that got hung up as well. Ultimately I had to physically reset the machine.

I know how to remove the failed drive from the MD and replace it, etc. But having the machine lock up and become unusable in this situation seems to defeat the whole point of having RAID mirrors in the first place.

Is this just the way the Linux kernel always behaves in this situation? Or is there some way to configure the kernel so that when a hard drive fails, it rate-limits the errors being produced, and doesn't prevent the machine from being used and cleanly rebooted?

score 1 · Accepted Answer · answered Sep 22 '10 at 12:05

1

I haven't run into this, but since you're using software RAID, it's possible that the hard disk failure is causing something to interfere with I/O on the disk controller, so you're getting other failures like the locking up of commands.

The data should be intact (unless it's corrupted, in which case you have duplicated corruption). If the drive itself failed you should be able to power down, remove the bad drive, power back up and hopefully things will come back online with a broken mirror set.

Sounds to me like the nature of the failure isn't sitting well with the controller. Take out the bad drive. It doesn't do you any good to keep it in there and can be causing more harm.

answered Sep 22 '10 at 12:05

Bart Silverstrim

31,092
9
65
87

As stated in my question: I know the drive is bad, and that I need to remove it, and I know how to do that (how to break and re-form the mirror, etc.). My question is about why the kernel spews IDE errors non-stop in this situation, making the system unusable -- and what I might be able to do to make the system configuration more robust if this should happen again someday. – bjnord Sep 22 '10 at 13:38
Which I was trying to answer with, if the drive failed in a way that it's driving the controller nuts instead of "just dying", then you need to remove it before something happens to the controller. And unless you dive into the IDE support source code I don't know of a quick way to eliminate errors like that. You might have luck looking in syslog (or klog)'s conf file in /etc to see if emergency level errors are broadcast to the console. – Bart Silverstrim Sep 22 '10 at 14:39
@bjnord, you probably have consumer drives and/or a consumer HD controller; which do **not** handle failure well. It's not Linux's fault, it's the consumer hardware. – Chris S Sep 22 '10 at 15:29
@Chris S: You're right, definitely consumer hardware. I hadn't thought about this distinction; past machines I've set up with RAID didn't have this problem when a hard drive died, but those were real servers. – bjnord Sep 23 '10 at 13:55
Thanks for the tip to look for emergency-level errors. You were right; there was a line in /etc/syslog.conf that sent emerg.* to * (which presumably is what was sending these to the console). I've taken that out, which should at least make the console usable in an emergency. – bjnord Sep 23 '10 at 13:58

score 0 · Answer 2 · answered Sep 22 '10 at 13:00

0

When an ide disk runs into a read error, most of the time it will simply refuse to answer the read command.

Your error message (displaying hdb) implicates that both hard drives are on the same cable. That might be tha cause of your problem: The failed disk blocked the whole ide bus - the linux kernel just has to wait for a timeout and thus has no chance to access the working disk.

answered Sep 22 '10 at 13:00

Turbo J

503
2
8

Just to clarify the details of my setup: I actually have two pairs of identical drives. I have two identical drives that are hda/hdc, and those are a mirror for the / filesystem -- and I have two more identical drives that are hdb/hdd, and those are a mirror for a much larger data filesystem. So hdb, the failing drive, is the slave on the primary IDE controller, and hdd, the intact drive, is the slave on the secondary IDE controller. So they are on two different controllers and two different cables. – bjnord Sep 22 '10 at 13:36

Can I prevent a Linux server from locking up/spewing console errors when a hard drive fails?

2 Answers2