Why URE fails raid rebuild and "renders RAID 5 unusable"

I'm sorry but I just can't comprehend from a theoretical point of view.

Why is it that running into a single URE, the raid controller decides everything else are ruined and just dies? Stupid. A 40 TB array is useless because 1mb is lost?

Rebuild the whole damn thing, then just do a checksum check on all the files if the filesystem supports it. Even if not, it's just a case of being prompted with "file corrupted" when trying to open those files.

This whole thing just screams stagnant hardware technology to me.

Edit- It seems people just jump straight on band wagon of "you shouldn't rely on RAID for backup". Well, I'm not interested in that. Yes RAID is for availability, not durability. The fact remains, you still can salvage ~99% of the RAID if the rebuild just skips over the URE.

raid

Sleeper Smith

Posted 2014-01-11T13:00:03.957

Reputation: 145

5Is there a question somewhere here are do you just want to rant? – terdon – 2014-01-11T13:01:15.727

A confirmation of "yes the hardware raid controller manufacturers are just lazy and dumb" would be good? Or a good explanation of why URE fails raid rebuild? I find it really. really. hard to believe. I'm hoping I missed something. – Sleeper Smith – 2014-01-11T13:03:07.080

3Why URE fails to rebuild is easy. It runs into a situation where it can not rebuild. RAID 5 has one single space copy. After a disk dies it has no redundancy at all. Any error encountered in that state will kill files. You are then left with two options: 1) FAIL in a known broken state. 2) Partially restore and leave a mess. You might get a bit more from that mess, especially in combination with backups, but in most cases 2 would be the wrong choice. – Hennes – 2014-01-11T17:00:40.290

1@Hennes By a mess you mean recovering 99% of the rest of the array with 1 or 2 corrupted file? – Sleeper Smith – 2014-01-12T00:33:27.513

2Unknown. And that is part of the problem. Most RAID setups do not know anything about files. They present a block device to the OS (just like a regular disk). And just like a regular disk that block device is partitioned and a filesystem is used on top of that partition. The filesystem knows about files. The block devices does not. It can not tell if a block belongs to empty space, a single file, a linked files (thus destroying multiple files with one error) or even if it is part of the directory entries (which could render the whole FS unuseable). – Hennes – 2014-01-12T13:47:30.120

A single 40TB array is asking for trouble. Lesson learned I guess. – pauska – 2014-01-13T16:03:34.143

@pauska God forbid analogy. Any size of storage of any configuration with no back up = asking for trouble. Nothing to do with size/setup. – Sleeper Smith – 2014-01-13T22:07:31.013

Answers

The problem is not lazy manufacturers or ancient technology. It it a misunderstanding in the goal of RAID. ^*1. The goal of RAID is to keep the filesystem usable after a disk dies. Not to replace a backup of guarantee a succesfull rebuild.

Let me expand on that with a practical example:
You are the IT guy for an office with 100 people. You need to build a fileserver for them.

Now if you used a single disk for that and the disk died then 100 people would be picking their nose until you replaced the disk and restored the backups. And you would need to backup quite often (e.g. every day).

Now you use RAID. The single disk dies but the array remains available in a degraded state. All files are still accessible and everybody can continue working. At 8 PM ^*2 you run a new set of backups, shut down the server, replace the broken disk and restore the data. Either with a rebuild or from backup. Everybody can continue to work and no data is lost.

Now there are a few assumptions here:

You do have backups. Really, you should have them since RAID will not protect against some things like server theft, lightning, fire, ...
RANT OVER.
A disk rebuild can take a long time when you have large disks. This was fine with old 80MB drives with server qualifications. If you use huge (multi TB) consumer drives it will take long time. Restoring from backup might be faster. For this reason alone you need to consider making and testing backups when you work with a 40TB array.

Note that occasionally a sector on a disk will fail. This is a fact of life. If happens rarely and drives have a way to work around this (reallocating sectors, also see TLER). If you have huge disks and you try to rebuild them then you are reading a huge amount of sectors. The chances of running into an URE are small but non-zero. If this happens fall back to backups.

^*1: RAID as is RAID1 (mirror), RAID 5, RAID 6, or a combination like RAID10.

^*2Or whenever everyone has gone home. An email with "emergency maintenance at 5PM!" would help here

Hennes

Posted 2014-01-11T13:00:03.957

Reputation: 60 739

4See edit. I don't care about durability. But if I were a sysadmin, my pick is restore with corrupted files and just recover those from backups and have the file recorded in new sectors. Not spend hours rebuilding an array only and be told "oh one sector's missing, go download your 40tb backup). – Sleeper Smith – 2014-01-12T00:24:25.500

2And no, I don't have backups. Just like AWS offers low redundancy storage, there are files that doesn't matter. But I find it hard to believe that there's technical limitation in rebuilding the rest of 40TB array when ONE sector fails. – Sleeper Smith – 2014-01-12T00:27:27.500

No, the RAID manufacturers are not dumb or lazy.

To put it as simply as possible: If you're trying to rebuild data (especially from parity, as-in RAID5 for example), and there's an Unrecoverable Read Error while reading the source you're building from, then it's impossible to properly rebuild the array from that corrupted source.

Ƭᴇcʜιᴇ007

Posted 2014-01-11T13:00:03.957

Reputation: 103 763

3Do you understand how RAID 5 works? You just repeated my question. One URE is ONE fail read of ONE sector. At worst it invalidate that corresponding stripe across the rest of the disks. How does that invalidate the entire 100TB array? – Sleeper Smith – 2014-01-12T00:31:55.490

2@SleeperSmith you're asking the wrong question. The right question is "How do you know making the remaining data available won't cause {planes to crash into buildings, the stock market to plummet, some patient in the OR to have the wrong kidney removed, <insert litany of other terrible things here>}?" -- The answer is "You don't know." When presented with data in a known inconsistent state that has unknown consequences if used you declare failure. You don't fake it and hope for the best. – voretaq7 – 2014-01-13T16:21:06.460

1@SleeperSmith Your question has been *VERY THOROUGHLY* answered by two folks who clearly understand the How and Why behind RAID. I'm sorry if you don't understand the answer, or if the answer is not what you want to hear, but that doesn't change the facts. If you dislike the standard behavior of RAID controllers you are of course free to write your own controller firmware (or software RAID implementation) that behaves how you want, and you can re-learn the industry's lessons for yourself empirically... – voretaq7 – 2014-01-13T22:09:16.843

5@voretaq7 Lol, thoroughly answering a question I didn't ask. My question is WHY does URE FAIL the raid TECHNICALLY. TECHNICALLY. I don't need a bloody lesson on durability or availability. – Sleeper Smith – 2014-01-13T22:12:42.850

@SleeperSmith Because you have had more failures than your RAID can tolerate. e.g RAID-5 can only tolerate a single (1) failure and remain in a consistent, usable, recoverable state. If you lose one drive, and another drive starts throwing UREs you have had a double (2) failure. 2 > 1. RAID broken. Data corruption at the level RAID cares about is *100% guaranteed* (the block device is toast - sectors are lost). RAID knows not this "file" of which we speak (remember that a RAID array may not even contain a filesystem: I could be writing to the raw block device. Many databases do this.) – voretaq7 – 2014-01-13T22:25:45.690

@voretaq7 "another drive starts throwing UREs" stopped reading right there. Look up what it means first. – Sleeper Smith – 2014-01-13T22:31:00.597

As I said in my edit. I give up. You guys can go back to following whatever white paper the vendors spew out. – Sleeper Smith – 2014-01-13T22:32:51.287

2@SleeperSmith I'm done. As usual trying to explain sound and reasoned data integrity principles to someone who doesn't want to understand them is fruitless, and I've wasted entirely too much of my time trying to educate you. It's your data - do what you want. – voretaq7 – 2014-01-13T22:39:19.410