10

Today we hit some kind of worst case scenario and are open to any kind of good ideas.

Here is our problem:

We are using several dedicated storage servers to host our virtual machines. Before I continue, here are the specs:

  • Dedicated Server Machine
  • Areca 1280ml RAID controller, Firmware 1.49
  • 12x Samsung 1TB HDDs

We configured one RAID6-set with 10 discs that contains one logical volume. We have two hot spares in the system.

Today one HDD failed. This happens from time to time, so we replaced it. Upon rebuilding a second disc failed. Normally this is no fun. We stopped heavy IO-operations to ensure a stable RAID rebuild.

Sadly the hot-spare disc failed while rebuilding and the whole thing stopped.

Now we have the following situation:

  • The controller says that the raid set is rebuilding
  • The controller says that the volume failed

It is a RAID 6 system and two discs failed, so the data has to be intact, but we cannot bring the volume online again to access the data.

While searching we found the following leads. I don't know whether they are good or bad:

  1. Mirroring all the discs to a second set of drives. So we would have the possibility to try different things without loosing more than we already have.

  2. Trying to rebuild the array in R-Studio. But we have no real experience with the software.

  3. Pulling all drives, rebooting the system, changing into the areca controller bios, reinserting the HDDs one-by-one. Some people are saying that the brought the system online by this. Some are saying that the effect is zero. Some say, that they blew the whole thing.

  4. Using undocumented areca commands like "rescue" or "LeVel2ReScUe".

  5. Contacting a computer forensics service. But whoa... primary estimates by phone exceeded 20.000€. That's why we would kindly ask for help. Maybe we are missing the obvious?

And yes of course, we have backups. But some systems lost one week of data, thats why we'd like to get the system up and running again.

Any help, suggestions and questions are more than welcome.

Richard
  • 201
  • 1
  • 4
  • 3
    I would argue that whatever you do, your first step should be a `dd` mirror of all disks, just to prevent more damage and having a fallback plan when working on a real solution. – Sven Mar 14 '12 at 22:36
  • We will do this... – Richard Mar 14 '12 at 23:02
  • 1
    What about the hotspares? –  May 18 '12 at 05:43
  • 1
    Can you contact the vendor for support? Assuming you cannot (and you have used dd to mirror everything, per @SvenW's excellent suggestion), why not replace the failed drives, reboot, and see what happens? I would not necessarily pull all drives, only the failed ones. But really, your first bet is the vendor, they understand their software. – Jeremy May 23 '12 at 13:48
  • Did you figure out a solution? If so let us know what it was for future reference please! – Grant Jun 05 '12 at 01:56
  • You said you have two hot spares, but mention that you replaced the failed drive initially. So which drive failed, the hot spare that the controller was already trying to rebuild with at the time, or the one you popped in? You then mention that the hot spare failed... so... not sure what the sequence of events is to really help. Sadly I haven't had the best luck in the past with Areca controllers. Also, if you have vendor support, reach out to the vendor ASAP. – Matthew Jul 13 '12 at 01:59
  • I would check all hotspares health. I use ARC-1231ML for a very long time without any issues. Can you post smartctl -a /dev/sdX listings ? One year ago I've recovered RAID5 array with 2 failing drives and no-hotspares just by writing one clear sector on the incorrect sector causing the driver to fail so there was no data loss at all and finnally whole array was up and running. – Spacedust Sep 05 '12 at 20:15

2 Answers2

2

I think Option 1. is your best.

Take 12x new HDDs, 1x new RAID controller Try to mirror (dd if= of=) old disks to the new ones 1:1 using any linux box. Build a new server using the 1x new RAID controller plus the 12x new HDDs

Try to rebuild the array in the new server. Success? Great. Stop.
Rebuild failed? Mirror the old disks to new ones again, try Option i+1

cipy
  • 31
  • 3
0

This is a very common scenario unfortunately. There was a good Google study on this years ago, and it turns out that losing data with RAID can happen during rebuilding the array. This can impact different RAID systems with different severity. Here is the RAID6 scenario:

  • your array has 3 data and 2 parity disks.
  • if you lose one disk it is sure that all the data is recoverable.
  • if you lose 2 disks you lost data

Why is that?

Think about the following: let have some data, assume first 3 block of a file you have the following data blocks: A1 + A2 + A3 and the following parity: Ap + Ap sitting on hdd1...hdd5

If you lose any two disk between 1 and 3 you lost data because the data is not recoverable, you have 2 parity and 1 data block.

Now the same scenario with 10 disks might be different, but i guess it handled the same way that you split the data to 8 blocks and save the parity to 2 other drives and have 2 hot-spares. Do you know the details of your RAID controller configuration?

I would start to recover from offsite backup (I guess you have some), and the service is back try to recover as much data as possible, using Unix and dd the drives to images and using it as loop device for example.

http://wiki.edseek.com/guide:mount_loopback

You need to know what sort of metadata the RAID controller uses and if you lucky it is supported in some tool like dmraid.

But this does not mean you can recover data at all, since the files are distributed among many-many blocks usually, the recovery is likely to fail to bring back any of your data.

More about RAID:

https://raid.wiki.kernel.org/index.php/RAID_setup

Istvan
  • 2,562
  • 3
  • 20
  • 28