Why does a URE cause loss of array during RAID 5 rebuild?

Question

From what I have read, RAID 5 is problematic with large disks becuase if a single disk fails, you are likely to have an unrecoverable read error while the array is being rebuilt. From what I can gather, this URE prevents the entire array from being rebuilt. Why does an error in a single bit/block/sector cause the whole rebuild to fail?

In terms of worse case scenerios, I could image if the URE occurred in a "bad" place (e.g., a filesystem superblock) you could lose everything, but do you always lose everything and if so why?

David Schwartz · Answer 1 · 2018-10-28T16:57:42.107

1

When faced with known inconsistent data, you don't run with it and hope for the best. If you reach the conditions your RAID level, by design, cannot tolerate, you stop. This is what your backups are for and precisely the conditions under which you already understand you're going to use them.

RAID is not backup. It's a way to continue running through a certain class of failures.

edited Oct 28 '18 at 16:57

answered Feb 02 '16 at 00:53

David Schwartz

31,215
2
53
82

But if I am not using RAID and I have a URE, I just restore the affected files from backup. Why does RAID make things worse? – StrongBad Feb 02 '16 at 01:14
@StrongBad It doesn't make things worse. You're describing a process that requires figuring out which files are affected, restoring those specifically, determining whether the medium is still reliable, and so on -- all before you can resume operation. The whole point of RAID is to minimize this time and risk. Doing a full restore/rebuild is almost always faster and safer. – David Schwartz Feb 02 '16 at 14:59
1

So then is the answer that you do not lose the whole array, but rather it is generally easier to just do a full restore/rebuild? That I can buy, but people seem to suggest that a URE prevents rebuilding the array. – StrongBad Feb 02 '16 at 15:06
@StrongBad If there's a URE, the rebuild fails. I think you're still missing the entire point of RAID, which is to permit you to continue to operate reliably even if a disk fails, that is, to minimize downtime. As soon as you can't operate reliably, that's it, the RAID has failed. You will always have downtime to get back to reliable operation. – David Schwartz Feb 02 '16 at 15:08
1

Yes, I am missing something which is why I asked the question. Restoring my entire array from the offsite backup would take a long time (much longer than rebuilding the array). If someone spills water on the array, I am fine with that downtime, but I would prefer to avoid that downtime if a disk goes bad. I guess my SOHO setup is such that doing a full restore is not faster. – StrongBad Feb 02 '16 at 15:20
1

@StrongBad the reason is that most raid controllers acknowledge David's point and will stop the rebuild and flag the array as bad, expecting that you want to restore from backup. RAID does not make it worse when a single URE in an otherwise good array stripe occurs, it will automatically fix it and you can keep using the array without interruption. The problem is only when the amount of failures exceeds the designed ability of the raid level to handle. Personally I find that a regularly scrubbed array will have less of a chance to have a URE during a rebuild. – BeowulfNode42 Oct 28 '18 at 13:37
@StrongBad Then you need to choose a RAID level that makes the probability of needing to restore from backup appropriately low. But if you exceed the recovery capability of that choice, you're restoring from backup, RAID 6 might be right for you and use regular scrubbing, – David Schwartz Oct 28 '18 at 16:59

Why does a URE cause loss of array during RAID 5 rebuild?

1 Answers1

Linked