Failed volumes on RAID - how to handle?

Question

I've got a situation on a RAID that I've taken over responsibility for recently and really could use some advice. I hope I haven't mucked it up to much.

A couple server on the ad-hoc cluster I'm administering started reporting disk problems.I ran fsck on one, xfs_repair on the other. The first seemed to be fixed, the 2nd didn't report problems. They could be mounted read-write, and would give read errors on certain files.

I traced the disks back to a single RAID:

JetStor 416iS
16 750GB drives
Single Volume Group with many data volumes, RAID 6

Looking at the JetStor admin interface:

two drives werer listed as failed
6 drives listed as defect
three user volumes were listed as failed (two are more important to me than the other)

Here's what I've done:

Remounted all partitions as read-only or unmounted them. (Even though JetStor support said this is not necessary. The unit is out of warranty but they answered this one question for me)
Replaced (hot-swap) the two failed drives and let them rebuild.
Replaced (hot-swap) two of the drives labeled 'defect' and let them rebuild. These two drives were associated with the two more important failed user volumes in the JetStor admin panel.
Created a couple new user volumes to act as larger replacement volumes and act as intermediary storage.
Tried remounting the two failed volumes. Now they won't mount at all.
Running xfs_repair on the one now generated errors about bad superblocks and some repair attempts, and a dump into the lost+found dir with a lot of files, but not a fix of a corrupted one I'd been hoping for. I'm going to recover what I can for this disk from backup and reconstruct the rest (it holds the catalog for my backup system, so yikes!)

So my question is regarding the second user volume (type ext3). I haven't tried reparing it yet b/c of what happened to the xfs volume (i.e. the dump into lost+found). I have a partial backup of this volume covering the most critical files, but it'd be great to get all the others back (that weren't already corrupted). If recovered files do get dumped to lost+found it would be a lot better than nothing of course.

I tried to dd it, but that failed just a few gigs in (it's a 500GB volume):

dd if=/dev/sdf of=/homLocDump/sdfDump.img conv=noerror,sync 

dd: reading `/dev/sdf': Input/output error
15002344+0 records in
15002344+0 records out
7681200128 bytes (7.7 GB) copied, 493.416 seconds, 15.6 MB/s
dd: writing to `/homLocDump/sdfDump.img': Read-only file system
15002344+1 records in
15002344+0 records out
7681200128 bytes (7.7 GB) copied, 493.417 seconds, 15.6 MB/s

fsck shows this:

[root@shank ~]# fsck.ext3 -nv /dev/sdf
e2fsck 1.39 (29-May-2006)
Couldn't find ext2 superblock, trying backup blocks...
fsck.ext3: Bad magic number in super-block while trying to open /dev/sdf

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

I tried with '-b' options, with blocks 8193, 16384 and 32768, and then more superblocks for a 4k block fs (I'm assuming it's 4k blocksize like other devices in this system) but got the same.

dumpe2fs:

[root@shank ~]# dumpe2fs /dev/sdf
dumpe2fs 1.39 (29-May-2006)
dumpe2fs: Bad magic number in super-block while trying to open /dev/sdf
Couldn't find valid filesystem superblock.

Can I even really try fsck on this volume any more? Beyond the superblock issue, I'm not sure now about the appropriateness of running fsck on raid volumes.

Is it possible to replace the old defect drive in the RAID temporarily to get a state where the volume can be mounted and recover some files?

Also, I'm curious how a volume can go bad like this in a raid - shouldn't the raid be protecting the integrity? If two drives fail in RAID 6, isn't it supposed to tolerate that?

You didn't have 2 failed drives. You had 8. Drives listed as defect have returned errors. — longneck, Feb 18 '13 at 15:45
Yes, thanks. I've held off on replacing the other drives listed 'defect' since after what I have done, the user volumes in question were in even worse shape. Should I go ahead and replace the other defect drives at this point? — Michael S, Feb 18 '13 at 15:52
No, don't touch anything. Unless your next step is to wipe the entire array and start from scratch. — longneck, Feb 18 '13 at 16:02
16 drives in the Jetstor. 8 of them dead or failed. **That is half of the drives!** Just what happened to this unit? And more importantly, will it happen again once/if you replace the drives? (Assuming it is not just from old age, in which case then monitoring will help. If it happened slowly over time fix the procedures for handover of tasks.). — Hennes, Feb 18 '13 at 16:05
I just got told that 'defect' does not always mean 'totally dead' but that it could mean 'working but with defects on drives'. That makes quite a difference. Still going to stress monitoring things after you replaced the failed drives and after you restored from backup. — Hennes, Feb 18 '13 at 16:22
Yes, that was my understanding, that 'defect' doesn't mean totally dead. If 8 drives were dead I'd have lost a lot more data from what I understand. — Michael S, Feb 18 '13 at 19:02
Regarding having half drives either defect or failed: the system is old, at least 6 years, and I wouldn't be surprised if they're all the original disks. The logs show there have been some errors starting at least a couple years ago. I'll certainly be monitoring more closely in the future, I'm really just getting up to speed on how this whole sprawling system is setup (3 large raids, ~6 admin servers, and ~30 desktop pc's connected as a kind of cluster. Do you have any particular tips on 'stress monitoring', or do you mean just pay close attention to raid system logs and s.m.a.r.t status? — Michael S, Feb 18 '13 at 19:06
@MichaelS Walk before you run - set up basic status monitoring before you start thinking about performance monitoring. It's more important to prevent this kind of problem from happening again than it is to performance tune your systems. — HopelessN00b, Feb 18 '13 at 19:19
"Stress monitoring" isn't a thing. Michael S was stressing the importance of proper monitoring. — longneck, Feb 18 '13 at 19:19
Yes, proper monitoring. Definitely next on my todo list! Thanks everyone. — Michael S, Feb 19 '13 at 19:13

score 4 · Answer 1 · answered Feb 18 '13 at 15:59

I think it's pretty clear your array is essentially failed, and unless you have backups, a good chunk of your data is lost. If you do have backups, all replace the failed drives and restore from backups. If you don't, and your employers think it's worth the money, have a professional data recovery firm try to recover what they can (and for the love of God, stop doing anything with these drives as you're only making it worse), but this is a rather expensive option.

At this point, the best thing you can do, aside from going to backups and/or having professionals try to recover your data, is to set up monitoring systems and processes to make sure you don't end up with a failed array again, by replacing drives as they fail, rather than after it's too late and too many have failed to recover all your data.

I'd also seriously consider a job elsewhere. An environment that's been allowed to decay to that kind of state is a special hell.

Yes, I'll be setting up monitoring processes. I believe the system was setup pretty well years ago, but in the interim a couple short-term sysadmins (much like myself, maybe) let some things slack. I can see some scripts for regular monitoring but they aren't working - haven't yet figured that out as I find my way around here. Also, before I took over part-time a few months ago, there was no sysadmin for 6-12 months. I'll try to stick with it for a couple years as I'm finishing up a degree program and this job has great tuition benefits, as well as otherwise being actually pretty chill. — Michael S, Feb 18 '13 at 19:12

score 4 · Answer 2 · answered Feb 18 '13 at 16:01

At this point, it's pretty clear that your volumes are lost. You now have a decision to make: How badly do you need this data?

If you have time and don't mind further data loss, feel free to continue experimenting.
If you need it badly, power down the entire array. Mark the drives in their current positions. Also mark the drives you removed during the rebuilds with where they came from. Call a data recovery specialist like OnTrack and arrange to ship the array to them for recovery.
If you don't need the data, I suggest it's time to start over from backups. But make sure you replace ALL of the drives that have returned errors.While you're at it, look at the SMART logs for all the drives and replace any that have more errors than the others. You will probably need to delete the existing volumes.

In the long run, I recommend reconfiguring your array. 16 drives in a RAID5 or RAID6 configuration is too many. I recommend splitting your drives in to two groups of 8 running RAID6, and RAID0 over those drives. JetStor may do this for you automatically, and might call it RAID60.

Thanks. I won't be needing all the data - I'm getting the backup catalog recreated and will get whatever I can from there. Thanks for the raid configuration tip, I'll look into it. I'll soon be setting up another 16 disk raid for a new cluster we're putting in. — Michael S, Feb 18 '13 at 19:07

Failed volumes on RAID - how to handle?

2 Answers2