MegaRAID storage manager update and now tons of media errors

Question

I was just poking around a 5 year old server and noticed the MegaRAID storage manager (14.08.01) appeared to be not responding. The server has been running for something like 400 days without rebooting.

I didn't want to reboot it so I installed the new version (17.05.00) and it seemed to go in fine. Immediately upon launching MSM it started to find "Unexpected sense unrecovered read error" on disk 0.

I ordered an express RMA drive from WD and then launched a consistency check. Now I am seeing the same error (but far less frequently) on another drive as well. I have four drives in RAID 10 plus one hot spare. One of the drives has 156 media errors and the other has 10. Am I screwed?

Should I Fail the drive that has the most media errors and try to rebuild?

Your backup is good ? as never forget one problem, the adapter never flagged the disk, thus I would no longer thrust the controller. It happened to me in the past on a megaraid and both disk was corrupted in a RAID1. — yagmoth555, Mar 24 '18 at 23:59
Backup is offsite so it would be somewhat of a pain to recover from it. Do you think I should replace the controller as well? — Kevin Morse, Mar 25 '18 at 00:23

Spooler · Answer 1 · 2018-03-25T00:24:14.910

Check your filesystems after repairing your array, in case there was silent data corruption.

You can lose two entire drives in a four drive RAID 10. Depending on which of those drives are failing, you may not be screwed one bit. Make sure both of those drives are members of opposite RAID 1 arrays. If they are, you're almost certainly fine. You also have a hot spare, and that should act as a "spillover" space for most controllers - though I'm not certain if your controller will do this because I don't know what it is.

Even if your controller does not use a hot spare as scratch space or emergency space it should still have been doing patrol reads regularly, which may have detected these issues and relocated data areas. Your controller log would be a good place to see if that's happened during at least the last few patrol reads. I've no idea how old these media errors are, though.

Regarding your adapter, if you're not running manufacturer "certified" drives in your controller, your controller won't necessarily be so intelligent about ejecting members when they begin to fail - typically only being able to eject them when they drop out or report a serious SMART failure. However, a drive can have been going bad for quite some time before triggering its overall SMART health report.

Even if it's not fine, perform the rebuild and do a consistency check + filesystem check. You'll also see filesystem I/O errors in dmesg if you've actually been running into filesystem level corruption. Worst case, you'll need to restore some files or the whole array from backup. Do the rebuild one disk at a time, not both. Start with replacing the most ragged disk.

Controller is an LSI SAS3008 running RAID 10 As mentioned, when I logged on to the server the MegaRAID said not responding. I reinstalled with a new version and the logs appear to have been completely flushed. The controller was (and still is) set to email on any Warnings and I hadn't received any but no idea how long the service was down for. — Kevin Morse, Mar 25 '18 at 00:30
Could have been years. Oh, well. You can also use smartmontools (smartctl) to grab SMART data from your two good drives to make sure they're actually good. Aside from that, it seems like you could be in a relatively good place with only that many media failures. — Spooler, Mar 25 '18 at 00:33
So the consistency check aborted because the bad sector table filled up. The controller failed the drive and then brought the hot spare online but the rebuild also failed. There is another RAID controller in the server that had free ports so I threw in some spare disks and started copying the virtual hard drives off. All but one copied successfully, of course it had to be the OS drive of the domain controller... File system check came back clean on the other five servers but this one won't even copy off... — Kevin Morse, Mar 26 '18 at 07:19
At least it's a domain controller. You have another replica copy to spin up a new one from, right? — Spooler, Mar 26 '18 at 16:01

MegaRAID storage manager update and now tons of media errors

1 Answers1