Degraded Array. RAID 6 with three disk failure

Question

I have an array of disk with RAID 6 and 16 Drives. Days ago three disk failed and the Array was marked as Degraded. I cannot access the data and I cannot boot into the Operative System. I need access to the data but I cannot do anything. Any advice? How can I recover or access the data? Could I use a Live cd to boot an OS? I'm using SAS Disk. Thanks in advance

RAID 6 has dual parity and can survive two concurrent disk failures. When third disk fails, the array is gone, and you'd need to restore the data from the backups. — Esa Jokinen, Oct 26 '19 at 05:23
As nearly everyone has already pointed out, a RAID6 can withstand only two drive failures. Are the drives actually dead though? If the controller or cables had temporary issues, for example, there's a good chance the data is a little corrupted but let 99% recoverable. If the drives are actually dead, it's an expensive trip to Data Recovery specialists. If the drives are okay but the RAID is technically just corrupted, it's just a slightly less expensive trip. If the data is important, then you will have backups. — zaTricky, Oct 28 '19 at 14:42
Are you sure all three drives failed at the *same time*? This is a critical piece of information. Often what happens in cases like this is that nobody was paying attention when the first two failed because the system kept working. Only when the fatal blow is struck to the third drive does anyone stop to have a look at what happened - then you've got a puncture and it's game over. — J..., Oct 28 '19 at 15:05

score 42 · Answer 1 · answered Oct 26 '19 at 16:45

42

As said before, if more than two disks in a RAID-6 array die, the array is unrecoverable.

However, three simultaneous disk failures are quite an unlikely event: it might very well be a case of a faulty enclosure, backplane and/or controller.

You should try removing and re-inserting the disks, replacing the controller and/or the enclosure, and even putting the disks in a different server with the same controller (if you have one available).

answered Oct 26 '19 at 16:45

Massimo

68,714
56
196
319

I did read a study that said that concurrent drive failures are more common than what mathematics gives you based on the independent failure probability. (Can't seem to find a link right now.). 3 concurrent failures (of 16?) does sound like an unlikely event though. – 0fnt Oct 27 '19 at 05:44
12

Or what I did with a two disk "failure" on a RAID5--imaged everything and used software to recover the data from it. The drives were actually all fine, the controller wasn't. Or the day I lost 5 of 5 on a RAID5--in dying the controller wrote a block of zeroes to the start of each drive. All the data was intact but the replacement controller wouldn't recognize that the disks were part of an array. Same fix. – Loren Pechtel Oct 27 '19 at 06:36
5

@0fnt it wouldn't be the maths at fault but the assumptions made when creating a mathematical model to approximate reality. For instance, most of the time all disks are bought at the same time. As you said, if you model faults as purely random, it does not matter, but if you model faults as more probable as disk ages then it makes a big difference. Without even removing the independent failure assumption. – spectras Oct 27 '19 at 08:25
2

I've seen it recommended to buy RAID disks from different manufacturers to lessen the likelyhood of multiple drive failures. Sounds sensible to me, but I am not sure if it's really true. – Almo Oct 27 '19 at 16:12
5

0fnt - might not be that unlikely if there was a problem in the case - e.g., failed fan or 2 - causing heat buildup affecting multiple discs – davidbak Oct 27 '19 at 21:52
@Almo I've seen that advice before too. I've also seen countervailing advice claiming that heterogeneous disks can cause performance degradation by breaking the controllers assumptions about all drives having identical timing/layout characteristics. I assume such problems would be most likely if the drives differed to the extent of having different numbers of platters. – Dan Is Fiddling By Firelight Oct 27 '19 at 23:55
@DanNeely very interesting, thanks for the info! – Almo Oct 28 '19 at 03:05
6

It is more likely that the array went unchecked for years and 3 hard drives failed over time. My guess is the controller is not bad and it is in fact 3 failed drives. – Joe Oct 28 '19 at 13:34
@davidbak I would think so. I didn't critically analyse the study but my guess is that if done correctly, it would have considered only a span of 1-2 days as being concurrent failures. As such, it's very unlikely that typically two disks that last years choose the same day to fail. A problem with the system sounds like the only reasonable answer here. But that makes me wonder- if this were common (and since multi disk failure is pretty catastrophic), isn't there hardware which houses multiple disks in their own separate cabinets? Like 2 + 2 + 2? or something – 0fnt Oct 29 '19 at 06:38
@Joe: This is one reason I have a small RAID0 partition on one part of my disks in a home server, before the main RAID5 partition. Even if I ignore an email or kernel message about degraded RAID5, I'll get actual I/O errors on `/var/tmp` and notice, and swap the drive out before I lose data! – Peter Cordes Oct 29 '19 at 07:37
1

In reality, RAID6 can safely tolerate *one* disk failure, not two. The second parity slice is critical during the rebuild since it saves you from the single URE of death. This is also why RAID5 is so fragile these days - one URE during rebuild an you're toast; and the odds of a single URE over a large array is not trivial. – J... Oct 30 '19 at 13:28

score 19 · Answer 2 · answered Oct 26 '19 at 14:46

You don't give any details on the server type, RAID controller type or anything specific.

Try turning everything off for 10 minutes... Remove power from the server. Let the drives spin down.

Power the server back on and see if the RAID controller re-recognizes the drives and is able to boot.

score 14 · Answer 3 · answered Oct 26 '19 at 15:33

As stated in the comment, RAID6 can sustain up to two disk failures; if a third disk fails, your array is toast.

The most obvious thing is to restore from backup. If this is not possible and at least one of the failed disk is still readable (albeit with read errors), you can try to do a block-level copy of each failed disk on another, healty disk (eg: via ddrescue <failed_disk> <new_disk>) and to re-start the array using these copies (plus the other good disks).

You will end with a punctured arrays where some original data can be lost/corrupted; however, with any luck, the greatest part of data should be accessible.

If you have no backup and none of the failed disk is readable, you need to contact a data rescue service.

score 7 · Answer 4 · answered Oct 27 '19 at 16:55

You probably don't have a software RAID, no matter what the tag says. You cannot boot OS from a software RAID6.
3 disks out of 16 failing together are quite rare occurence, except when you drop the server on the floor. It is either 3 disks failing one by one over a large timespan and no one noticing or a failed controller, failed cable, failed power supply, failed backplane or a firmware bug kicking in. It is important to determine which case you have, because the recovery strategy is different. There may be BIOS or RAID controller logs accessible.
In either case, you start by backing up every single disk on another media, using a different, known to work controller. In the process, you will see how many of the disks are actually broken and how much.
Most (probably all) hardware RAID controllers are crap. I learned the hard way. A "disk failed" condition may actually be a single bad sector and most (or even all) data could be recoverable.
A "degraded" array is an array that still has all the data accessible. What you describe is a "failed" or "offline" array, rather than "degraded". If you are not experienced in these matters, call someone who IS.
Starting from a recovery/live CD may or may not be a part of the process. If you don't know how to mount a filesystem in read-only mode, call someone who knows. It is possible to kill a perfectly recoverable data by such a mistake.

After a lot of sleepless nights I design my servers in such a way that everything stops working when the FIRST disk fails. THIS is the only error message that no one ignores.

Sure you can boot from software RAID. Linux machines do it all the time - not sure how they do it, I'd expect them to cheat and place the boot loader with the RAID drivers in a mirrored partition. Linuxers generally prefer software RAID because with hardware RAID, you can't simply plug the disks into other hardware (which may have a different controller, or the same controller but the problem is a firmware bug). — toolforger, Oct 28 '19 at 06:50
Linux machines boot pretty well from software RAID1 (mirrored) boot partition (Windows machines are no different). Not much of complexity here. I can imagine a bootloader that is aware of a software RAID 0/4/5/6, but I have yet to see one. And I am not sure if it is at all possible to handle correctly a degraded array in the bootloader. — fraxinus, Oct 28 '19 at 11:29
I still disagree with your point 1. It's quite possible that the system is on a mirrored partition and just data disks are on actual RAID6. I'd apply the "ask somebody who knows the diagnostic procedures to find out whether it's SW or HW RAID", i.e. apply points 5/6 as well. --- Point 2 is debatable. The story that disks from the same manufacturer tend to fail simultaneously under the same environmental and load conditions is plausible enough and pops up regularly enough that I tend to believe it. I agree that proper diagnostics is king. --- Agreeing with your other points BTW. — toolforger, Oct 28 '19 at 14:54
It is not only possible, it is a reasonable configuration to have RAID1 for OS and RAID 5/6/10/whatever for data. But OP said RAID6 and nothing more. — fraxinus, Oct 28 '19 at 15:35

score 5 · Answer 5 · answered Oct 26 '19 at 16:24

5

Recover from backup. You won’t see your data on this RAID LUN again.

answered Oct 26 '19 at 16:24

RiGiD5

837
1
6
10

3

Also, a data recovery service may be able to get the data back, but at high cost. This answer just isn't correct. – Almo Oct 27 '19 at 19:44
6

Unless you're looking for something very particular and small like CC records, super-small files comparable to the typical hardware RAID or MDADM chunk size of 64-256KB etc your chances to recover are extremely low. TL;DR: @RiGiD5 gave a little bit strict but still 100% correct answer. – BaronSamedi1958 Oct 28 '19 at 09:41

score 2 · Answer 6 · answered Oct 26 '19 at 14:05

RAID 6 can only survive two failed hard drives. If you do not have any backups and need the data, I would recommend hiring a hard drive recovery company. I would not try and recover the data on your own because the more you work the hard drives, the higher the chances are the data will not be recoverable.

score 0 · Answer 7 · answered Mar 28 '20 at 17:21

as a last resort option (after trying everything others have already posted as answer here), you could attempt to force one drive as online/not-degraded.

I just had the case that 3 of 6 very old drives in a hardware raid 6 failed. I was lucky and able to recover some of the data:

removed 2 failed drives
in the options of my hardware raid controller I forced the third failed drive as online (not degraded)
put in 2 new drives
rebuilt the array
and now removed the last failed drive

I was lucky and lost no relevant data, but of course there is the risk of data corruption/loss with this approach, but the data on the raid is lost otherwise anyway, so might be worth a shot if the raid controller gives that option.

Degraded Array. RAID 6 with three disk failure

7 Answers7