RAID1 mdadm, automatically fail a drive and avoid a read-only file-system?

Question

I am managing a server with two solid state drives configured in mdadm RAID1. The server is running RHEL6 with an ext4 filesystem.

This evening the server went offline shortly after nightly backups began and the console reported disk errors:

Upon logging into the console, I found that one of the disks had been marked failed by mdadm and the file-system was set to read-only.

Is there a way that I can configure mdadm to fail the drive before the file-system is re-mounted as read-only? I would much rather run as a single disk system for a short time (until a replacement disk can be installed) rather than immediately kick the file-system into read-only mode -- which would guarantee an outage.

Halfgaar · Accepted Answer · 2018-03-19T18:48:28.307

It does that by default, but granted, I've had similar issues with this. MD is not really eager at failing disks (or in fact repairing sectors by re-writing them, which hardware RAID controllers do). That's why I set up my log monitoring to scan for 'ata exception' and e-mail me when that happens. At least with traditional HDDs, this allows you to see disk failures much faster.

If the file system is marked read-only, the error went higher up the chain, and the MD device also saw errors. Are you sure there were no errors on sdb?

Or, are you sure the drives failed at all? It can happen, just recently to me, that the entire PCI bus failed. All devices connected to it started spewing errors (all ATA and ethernet), and indeed the file systems were marked as read-only, and the MD arrays as failed. But obviously the disks or MD wasn't the issue.

To check if the drives were in error: I don't have much experience with SMART on SSD drives, but at least with HDD drives, the SMART log may show something; there is an error log in it, and you can look at the smart parameters, perhaps compare with the other disk.

If smartmontools is installed, you can do:

smartctl -a /dev/sda

You may also be interested in How do I troubleshoot my RAID array.

Edit: As for the PCI bus thing. It does look like your issue was localized to one disk or controller.

It's possible there were errors on `/dev/sdb` too, but they weren't present on the console screen when I logged in. The system went read-only, so nothing was logged to `/var/log/messages` and this is a critically important production server, so I didn't spend the extra time to read through `dmesg` -- maybe I'll have to do that next time. I think the drives are actually fine since a server reboot fixed the problem immediately and no errors were recorded on the SMART data. Since this is such an important production system, I'm leaning towards just immediately replacing the entire motherboard. — Elliot B., Mar 19 '18 at 17:00
Network access did remain online during this outage (some web services were running but returning Internal Server Errors). Do you think that would rule out the entire PCI bus as the cause of failure? The NICs are on-board the mobo. — Elliot B., Mar 19 '18 at 17:27

RAID1 mdadm, automatically fail a drive and avoid a read-only file-system?

1 Answers1