2
I'm sorry but I just can't comprehend from a theoretical point of view.
Why is it that running into a single URE, the raid controller decides everything else are ruined and just dies? Stupid. A 40 TB array is useless because 1mb is lost?
Rebuild the whole damn thing, then just do a checksum check on all the files if the filesystem supports it. Even if not, it's just a case of being prompted with "file corrupted" when trying to open those files.
This whole thing just screams stagnant hardware technology to me.
Edit- It seems people just jump straight on band wagon of "you shouldn't rely on RAID for backup". Well, I'm not interested in that. Yes RAID is for availability, not durability. The fact remains, you still can salvage ~99% of the RAID if the rebuild just skips over the URE.
5Is there a question somewhere here are do you just want to rant? – terdon – 2014-01-11T13:01:15.727
A confirmation of "yes the hardware raid controller manufacturers are just lazy and dumb" would be good? Or a good explanation of why URE fails raid rebuild? I find it really. really. hard to believe. I'm hoping I missed something. – Sleeper Smith – 2014-01-11T13:03:07.080
3Why URE fails to rebuild is easy. It runs into a situation where it can not rebuild. RAID 5 has one single space copy. After a disk dies it has no redundancy at all. Any error encountered in that state will kill files. You are then left with two options: 1) FAIL in a known broken state. 2) Partially restore and leave a mess. You might get a bit more from that mess, especially in combination with backups, but in most cases 2 would be the wrong choice. – Hennes – 2014-01-11T17:00:40.290
1@Hennes By a mess you mean recovering 99% of the rest of the array with 1 or 2 corrupted file? – Sleeper Smith – 2014-01-12T00:33:27.513
2Unknown. And that is part of the problem. Most RAID setups do not know anything about files. They present a block device to the OS (just like a regular disk). And just like a regular disk that block device is partitioned and a filesystem is used on top of that partition. The filesystem knows about files. The block devices does not. It can not tell if a block belongs to empty space, a single file, a linked files (thus destroying multiple files with one error) or even if it is part of the directory entries (which could render the whole FS unuseable). – Hennes – 2014-01-12T13:47:30.120
A single 40TB array is asking for trouble. Lesson learned I guess. – pauska – 2014-01-13T16:03:34.143
@pauska God forbid analogy. Any size of storage of any configuration with no back up = asking for trouble. Nothing to do with size/setup. – Sleeper Smith – 2014-01-13T22:07:31.013