4

I just recently setup a 3 drive 4TB MDRAID 5 array for mirroring and an online backup of our server.

I am preparing for a future hardware (drive) failure and wanted to mitigate a recovery failure from a URE.

Typically I think of the process for rebuilding an array to be:

  1. Remove and replace failed drive.
  2. Rebuild array

From my understanding, in a degraded RAID 5 array you can still access data; but when the failed drive has been replaced and a the array is rebuilding, if a URE is detected, the recovery will fail and the data on the array will immediately be rendered unreadable and unrecoverable.

If my understanding is correct then it does not seem prudent to recover the array until all the (readable) data has been duplicated.

This leaves me with a process of:

  1. Duplicate data from array.
  2. Remove and replace failed drive.
  3. Rebuild array

Is there another process that would mitigate rebuild failures (aside from a second drive failure during rebuild)? Is it safe to rebuild array without duplicating the data first? Are my assumptions wrong such as the re build fails on URE but data is still available in degraded state?

Thomas
  • 4,155
  • 5
  • 21
  • 28
Damon
  • 429
  • 2
  • 11
  • If the data is precious, don't you already have a backup?! – David Schwartz Aug 07 '17 at 09:31
  • @DavidSchwartz I am unsure of what you mean by "already have a backup". The RAID 5 is providing a location for the most recent full with incremental backups and current mirrors of data from the primary server. The primary data is on multiple non-parity RAID arrays. Our most current backups and mirrors used to be on various drives and smaller RAID arrays and I have just consolidated and migrated the non-sense to a simple RAID 5. – Damon Aug 07 '17 at 13:25
  • So then why duplicate the data when the RAID 5 fails? It's already duplicated. Just start the rebuild so you can get the array back to a sane state as quickly as possible. – David Schwartz Aug 07 '17 at 16:56
  • If the RAID array has data that you care about losing and that data isn't replicated anywhere else, you're doing it very wrong. – David Schwartz Aug 07 '17 at 18:26
  • @DavidSchwartz While a large portion of the data will already have been duplicated to other Media or be availible from the primary server there is some data such as the most recent incrementals and few other ancillary items that will only exist in the backups that it would be preferable to not lose. Admittedly if the data was lost due to a failed rebuild it would be data that had been previously deleted by users or versions of current data that is out dated. Sometimes people need to go back though so trying to mitigate rebuild failure with an array that size on raid 5. – Damon Aug 07 '17 at 18:26
  • That is not the recommended approach based on cumulative experience of everyone in the industry. The recommendation is to make a copy of data you care about, not try to make the one place you've stored it super reliable. If this is your only copy of data you may want to access, it's *not* a backup. You have no backup. – David Schwartz Aug 07 '17 at 18:28
  • @DavidSchwartz I'm not however sure how any of this pertains to the actual question. Simply put, our first layer of backups and active mirrors reside on a raid 5 and I would like to find a way to mitigate a rebuild failure from a URE or other rebuild failure point. I can appreciate that there are always better ways of doing things overall and in this case you would choose another way. But this is not an XY problem; backups on a raid 5 and not wanting to lose data is still industry accepted practice for my understanding if the array size is kept in a reasonable range. – Damon Aug 07 '17 at 18:50
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/63451/discussion-between-damon-and-david-schwartz). – Damon Aug 07 '17 at 18:59

2 Answers2

6

You could prepare your self to the drive failure and to very all other troubles by implementing The 3-2-1 Backup plan, my personal opinion 3-2-1 should be in each business critical environment.

Following 3-2-1 Rule will make life easier, this obviously would cost $, but the outcome should worth it.

You could learn more here: https://knowledgebase.starwindsoftware.com/explanation/the-3-2-1-backup-rule/

https://www.veeam.com/blog/the-3-2-1-0-rule-to-high-availability.html

Net Runner
  • 5,626
  • 11
  • 29
  • 1
    Almost there; safe deposit box with tapes is on the agenda along with a possible VPS as a second offsite for daily use files and private high bandwidth cloud sync. Data loss on this RAID 5 would not cause an issue other than downtime and extended recovery time; otherwise a RAID 5 would not be used. But in an effort to maximize its potential, I would like to make sure I am using "best practices" with the rebuild and recovery of a degraded parity RAID. – Damon Aug 16 '17 at 00:51
0

I've realize UREs are a bit more complex and unknown to most as they relate to array failures..

The conclusion is UREs can cause arrays to fail, but not as often as that math in the articles say. But RAID 5 still is a very failure prone RAID array compared to ALL other RAID levels.

So back to basics, what are we mitigating during a RAID 5 rebuild? We are trying to get parity back before a second drive fails. THATs IT! This is a by-any-means-necessary endeavor.

This leads me to solidify my list

  1. Temporarily duplicate data from array, tape is cheapest if array is large and HDD space is not available.
  2. Remove and replace failed drive.
  3. Build new array with new drive from scratch.
  4. Reload files to new array from step 1.

This assumes the array can be taken offline which is not always the case. In the end though, some have found the same that building a new array from scratch and transferring data back in one fell swoop is easier and faster, than attempting a full rebuild on a large multi TB array.

Further, I suspect that reading the data and writing the data off the array sequentially in a degraded state effectively only once would greatly lower the chances of a second drive failure before the data is duplicated compared to a full thrashing rebuild, although the chance is still there.

In the end, its all about risk management which varies on the plethora of specific circumstances. In my particular case, I can usually find time within a 24 hour window to restore my array and thus freshly backing up, rebuilding, and restoring from the fresh backup would be best in my case.

Damon
  • 429
  • 2
  • 11