10

here is my situation.

I have a Dell Server with a Dell Perc 7i controller, (LSI Controller).

I had a drive give me an Failure Predicted warning so I called their support and they came out and replaced the drive and the array rebuilt itself, pretty standard.

Two weeks later, I have another drive giving me the Failure Predicted warning. I figured maybe it was a bad batch of drives or coincidence, etc. So I contact support and look more in-depth. I realize that there were bad blocks on one of the other drives that didn't fail and those bad blocks were copied over during the rebuild. So now I have bad blocks all over the place and they are slowly killing my array. I have come to find that this is called a Punctured Array.

So their advice was to replace all the drives, rebuild the array, and restore from backup. Except i've been having this issue for a few weeks which means my backups are bad...and if I restore from a backup from prior (a month ago) then I will be missing about 4 weeks worth of data from my database which is totally unacceptable for our office.

My question is...has anyone ever recovered from something like this without having to lose data or without the whole (throw it all out the window and start over) approach ?

I did find one link that covered my scenario, not sure if it sheds any light on the situation : http://www.theprojectbot.com/raid/what-is-a-punctured-raid-array/

Any help or direction would be appreciated ! What do you guys think?

user72593
  • 423
  • 2
  • 6
  • 14

3 Answers3

15

Your system I assume is still up, so the best thing to do is make an immediate backup, dump the disks/array, rebuild, and restore from the backup.

Bad blocks don't always mean your backups are also bad. If you haven't experienced any performance problems or damaged files, then your backups should still be complete enough to finish a restore.

To test, take your most recent backup and examine your most important data. If it's still intact, you likely have a good backup.

At this point, there is a risk involved as you cannot be 100% certain that your backups are good or that backing up now won't cause file loss. However, your array will eventually fail and force a restore anyway, so this is your only real option.

Nathan C
  • 14,901
  • 4
  • 42
  • 62
  • I see, right now everything appears to be working fine. So if i'm able to make a complete backup of my system right now, and I replace the drives, rebuild the array, and restore that complete backup...am I risking this failure come back ? Or am I better off reinstalling the OS and Software and only restoring databases to minimize risk? – user72593 May 22 '14 at 16:32
  • Bad blocks typically don't occur at a file level. I'd only do this if you found corrupted files. – Nathan C May 22 '14 at 16:40
  • @NathanC You don't get "bad blocks", you get corrupt data. – JamesRyan May 22 '14 at 16:54
  • @user72593 Just because you are able to backup the files today does not mean that they won't be missing parts. The only way to see what is good or not is to compare it to the backups. – JamesRyan May 22 '14 at 16:55
  • 1
    @JamesRyan The "bad blocks" can be anywhere in the disk, including swap, temp files, or previously used but now unused space. When a drive has bad blocks, it doesn't *always* mean data was lost. – Nathan C May 22 '14 at 17:03
  • @NathanC "The "bad blocks" can be anywhere in the disk," so how can you trust your data without checking it? Also even if they are in unused space they don't just sit there letting you ignore them, they break the array. It has to be wiped and recreated else they come back to cause problems time and again. – JamesRyan May 23 '14 at 09:20
  • @JamesRyan Yes, hence why he should re-create the array. The backups are not block-level (at least, OP hasn't mentioned this) so the restore will simply write to new blocks on a new array. In this case, you *have to* trust it. – Nathan C May 23 '14 at 11:58
8

Right this instant, do the following:

  • Stop rotating backups or deleting old ones for this system. You want to keep all of the backups you currently have.
  • Take a full backup of the server.

Hopefully the disks are still good enough that your data is intact, and you won't encounter any problems running the new full backup.

Then scrap those disks, and build a new RAID array. Once that's ready, try to restore from the backup you took just now. With any luck, that'll be all you need to do.

If that fails, try the next oldest, and the next oldest, etc. Be sure to test the functionality of the system - just because it boots, doesn't mean it's fully operational. Particularly, test the databases for corruption.

If you had to restore the entire system from an older backup, that's ok. Take the newest backups, and restore just the database files and other important files. Test them to make sure they work properly. Again, if that fails, try the next oldest.

Using this process minimizes the data loss.

Grant
  • 17,671
  • 14
  • 69
  • 101
  • I see, that answers my question. So as long as my backup is intact i'm good, if not, then...I have to deal with it. Thanks. – user72593 May 22 '14 at 16:37
4

The answers provided by Grant and Nathan C are great in regards to how you should proceed in handling backups/restoring, and addressing data integrity.

Here's some clearer detail on how to handle the RAID set when it comes time to recreate the virtual disk and restore from backup:

  • Verify that you have a good backup of the data
  • Delete the existing virtual disk; All disks should show in a "ready" state afterward
  • Recreate a new Virtual Disk; Recommended settings: adaptive read-ahead, write-back, and disk caching disabled
  • You should have an online Virtual Disk with a background initialization in progress.
  • Proceed with restoring from backup; Background initialization typically runs around 600GB/hr for 7.2K spindles, so give the init a head start if your backup restore can run faster than that, otherwise your backup software might have some issues with write latency when no new space is immediately available during the restore.

Note: If you've been using RAID5, you should SERIOUSLY consider using RAID6 this time. RAID5 is not reliable for business critical data according to current industry standard best-practices on an array of this size. Large capacity SATA/NL-SAS disks also have a higher risk of encountering a URE during rebuilds, which results in a puncture like the one you're dealing with. RAID6 vastly reduces this risk, and is generally acceptable for critical data with currently available drive capacities.

JimNim
  • 2,736
  • 12
  • 23