I've got a situation on a RAID that I've taken over responsibility for recently and really could use some advice. I hope I haven't mucked it up to much.
A couple server on the ad-hoc cluster I'm administering started reporting disk problems.I ran fsck on one, xfs_repair on the other. The first seemed to be fixed, the 2nd didn't report problems. They could be mounted read-write, and would give read errors on certain files.
I traced the disks back to a single RAID:
- JetStor 416iS
- 16 750GB drives
- Single Volume Group with many data volumes, RAID 6
Looking at the JetStor admin interface:
- two drives werer listed as failed
- 6 drives listed as defect
- three user volumes were listed as failed (two are more important to me than the other)
Here's what I've done:
- Remounted all partitions as read-only or unmounted them. (Even though JetStor support said this is not necessary. The unit is out of warranty but they answered this one question for me)
- Replaced (hot-swap) the two failed drives and let them rebuild.
- Replaced (hot-swap) two of the drives labeled 'defect' and let them rebuild. These two drives were associated with the two more important failed user volumes in the JetStor admin panel.
- Created a couple new user volumes to act as larger replacement volumes and act as intermediary storage.
- Tried remounting the two failed volumes. Now they won't mount at all.
- Running xfs_repair on the one now generated errors about bad superblocks and some repair attempts, and a dump into the lost+found dir with a lot of files, but not a fix of a corrupted one I'd been hoping for. I'm going to recover what I can for this disk from backup and reconstruct the rest (it holds the catalog for my backup system, so yikes!)
So my question is regarding the second user volume (type ext3). I haven't tried reparing it yet b/c of what happened to the xfs volume (i.e. the dump into lost+found). I have a partial backup of this volume covering the most critical files, but it'd be great to get all the others back (that weren't already corrupted). If recovered files do get dumped to lost+found it would be a lot better than nothing of course.
I tried to dd it, but that failed just a few gigs in (it's a 500GB volume):
dd if=/dev/sdf of=/homLocDump/sdfDump.img conv=noerror,sync
dd: reading `/dev/sdf': Input/output error
15002344+0 records in
15002344+0 records out
7681200128 bytes (7.7 GB) copied, 493.416 seconds, 15.6 MB/s
dd: writing to `/homLocDump/sdfDump.img': Read-only file system
15002344+1 records in
15002344+0 records out
7681200128 bytes (7.7 GB) copied, 493.417 seconds, 15.6 MB/s
fsck shows this:
[root@shank ~]# fsck.ext3 -nv /dev/sdf
e2fsck 1.39 (29-May-2006)
Couldn't find ext2 superblock, trying backup blocks...
fsck.ext3: Bad magic number in super-block while trying to open /dev/sdf
The superblock could not be read or does not describe a correct ext2
filesystem. If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device>
I tried with '-b' options, with blocks 8193, 16384 and 32768, and then more superblocks for a 4k block fs (I'm assuming it's 4k blocksize like other devices in this system) but got the same.
dumpe2fs:
[root@shank ~]# dumpe2fs /dev/sdf
dumpe2fs 1.39 (29-May-2006)
dumpe2fs: Bad magic number in super-block while trying to open /dev/sdf
Couldn't find valid filesystem superblock.
Can I even really try fsck on this volume any more? Beyond the superblock issue, I'm not sure now about the appropriateness of running fsck on raid volumes.
Is it possible to replace the old defect drive in the RAID temporarily to get a state where the volume can be mounted and recover some files?
Also, I'm curious how a volume can go bad like this in a raid - shouldn't the raid be protecting the integrity? If two drives fail in RAID 6, isn't it supposed to tolerate that?