38

Folks please help - I am a newb with a major headache at hand (perfect storm situation).

I have a 3 1tb hdd on my ubuntu 11.04 configured as software raid 5. The data had been copied weekly onto another separate off the computer hard drive until that completely failed and was thrown away. A few days back we had a power outage and after rebooting my box wouldn't mount the raid. In my infinite wisdom I entered

mdadm --create -f...

command instead of

mdadm --assemble

and didn't notice the travesty that I had done until after. It started the array degraded and proceeded with building and syncing it which took ~10 hours. After I was back I saw that that the array is successfully up and running but the raid is not

I mean the individual drives are partitioned (partition type f8 ) but the md0 device is not. Realizing in horror what I have done I am trying to find some solutions. I just pray that --create didn't overwrite entire content of the hard driver.

Could someone PLEASE help me out with this - the data that's on the drive is very important and unique ~10 years of photos, docs, etc.

Is it possible that by specifying the participating hard drives in wrong order can make mdadm overwrite them? when I do

mdadm --examine --scan 

I get something like ARRAY /dev/md/0 metadata=1.2 UUID=f1b4084a:720b5712:6d03b9e9:43afe51b name=<hostname>:0

Interestingly enough name used to be 'raid' and not the host hame with :0 appended.

Here is the 'sanitized' config entries:

DEVICE /dev/sdf1 /dev/sde1 /dev/sdd1

CREATE owner=root group=disk mode=0660 auto=yes

HOMEHOST <system>

MAILADDR root


ARRAY /dev/md0 metadata=1.2 name=tanserv:0 UUID=f1b4084a:720b5712:6d03b9e9:43afe51b


Here is the output from mdstat

cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid5 sdd1[0] sdf1[3] sde1[1]
1953517568 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>


fdisk shows the following:

fdisk -l

Disk /dev/sda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000bf62e

Device Boot Start End Blocks Id System
/dev/sda1 * 1 9443 75846656 83 Linux
/dev/sda2 9443 9730 2301953 5 Extended
/dev/sda5 9443 9730 2301952 82 Linux swap / Solaris

Disk /dev/sdb: 750.2 GB, 750156374016 bytes
255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000de8dd

Device Boot Start End Blocks Id System
/dev/sdb1 1 91201 732572001 8e Linux LVM

Disk /dev/sdc: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00056a17

Device Boot Start End Blocks Id System
/dev/sdc1 1 60801 488384001 8e Linux LVM

Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000ca948

Device Boot Start End Blocks Id System
/dev/sdd1 1 121601 976760001 fd Linux raid autodetect

Disk /dev/dm-0: 1250.3 GB, 1250254913536 bytes
255 heads, 63 sectors/track, 152001 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/dm-0 doesn't contain a valid partition table

Disk /dev/sde: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x93a66687

Device Boot Start End Blocks Id System
/dev/sde1 1 121601 976760001 fd Linux raid autodetect

Disk /dev/sdf: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xe6edc059

Device Boot Start End Blocks Id System
/dev/sdf1 1 121601 976760001 fd Linux raid autodetect

Disk /dev/md0: 2000.4 GB, 2000401989632 bytes
2 heads, 4 sectors/track, 488379392 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 524288 bytes / 1048576 bytes
Disk identifier: 0x00000000

Disk /dev/md0 doesn't contain a valid partition table

Per suggestions I did clean up the superblocks and re-created the array with --assume-clean option but with no luck at all.

Is there any tool that will help me to revive at least some of the data? Can someone tell me what and how the mdadm --create does when syncs to destroy the data so I can write a tool to un-do whatever was done?

After the re-creating of the raid I run fsck.ext4 /dev/md0 and here is the output

root@tanserv:/etc/mdadm# fsck.ext4 /dev/md0 e2fsck 1.41.14 (22-Dec-2010) fsck.ext4: Superblock invalid, trying backup blocks... fsck.ext4: Bad magic number in super-block while trying to open /dev/md0

The superblock could not be read or does not describe a correct ext2 filesystem. If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock: e2fsck -b 8193


Per Shanes' suggestion I tried

root@tanserv:/home/mushegh# mkfs.ext4 -n /dev/md0
mke2fs 1.41.14 (22-Dec-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=128 blocks, Stripe width=256 blocks
122101760 inodes, 488379392 blocks
24418969 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=0
14905 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
    4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
    102400000, 214990848

and run fsck.ext4 with every backup block but all returned the following:

root@tanserv:/home/mushegh# fsck.ext4 -b 214990848 /dev/md0
e2fsck 1.41.14 (22-Dec-2010)
fsck.ext4: Invalid argument while trying to open /dev/md0

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

Any suggestions?

Regards!

Brigadieren
  • 481
  • 1
  • 4
  • 6
  • 1
    Perhaps one day people may realise why RAID5 is a terrible idea. Until then, 1) people will lose data. 2) We'll get questions like these. – Tom O'Connor Jan 07 '12 at 11:13
  • 12
    @Tom O'Connor ... because clearly, RAID5 is to blame for user error. :rolleyes: – Reality Extractor Jan 07 '12 at 12:04
  • 2
    Hopefully, Shane's answer can save the data, but, again, proof why RAID alone is not best for storage. Need backups too. (but +1 for the question and epic answer that resulted) – tombull89 Jan 08 '12 at 12:23
  • 4
    I know it gets repeated often, but *raid is not a backup solution*. The message really needs driving home. – Sirex Jan 18 '12 at 07:53

5 Answers5

96

Ok - something was bugging me about your issue, so I fired up a VM to dive into the behavior that should be expected. I'll get to what was bugging me in a minute; first let me say this:

Back up these drives before attempting anything!!

You may have already done damage beyond what the resync did; can you clarify what you meant when you said:

Per suggestions I did clean up the superblocks and re-created the array with --assume-clean option but with no luck at all.

If you ran a mdadm --misc --zero-superblock, then you should be fine.

Anyway, scavenge up some new disks and grab exact current images of them before doing anything at all that might do any more writing to these disks.

dd if=/dev/sdd of=/path/to/store/sdd.img

That being said.. it looks like data stored on these things is shockingly resilient to wayward resyncs. Read on, there is hope, and this may be the day that I hit the answer length limit.


The Best Case Scenario

I threw together a VM to recreate your scenario. The drives are just 100 MB so I wouldn't be waiting forever on each resync, but this should be a pretty accurate representation otherwise.

Built the array as generically and default as possible - 512k chunks, left-symmetric layout, disks in letter order.. nothing special.

root@test:~# mdadm --create /dev/md0 --chunk=512 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
root@test:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdd1[3] sdc1[1] sdb1[0]
      203776 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>

So far, so good; let's make a filesystem, and put some data on it.

root@test:~# mkfs.ext4 /dev/md0
mke2fs 1.41.14 (22-Dec-2010)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
Stride=512 blocks, Stripe width=1024 blocks
51000 inodes, 203776 blocks
10188 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=67371008
25 block groups
8192 blocks per group, 8192 fragments per group
2040 inodes per group
Superblock backups stored on blocks:
        8193, 24577, 40961, 57345, 73729

Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 30 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
root@test:~# mkdir /mnt/raid5
root@test:~# mount /dev/md0 /mnt/raid5
root@test:~# echo "data" > /mnt/raid5/datafile
root@test:~# dd if=/dev/urandom of=/mnt/raid5/randomdata count=10000
10000+0 records in
10000+0 records out
5120000 bytes (5.1 MB) copied, 0.706526 s, 7.2 MB/s
root@test:~# sha1sum /mnt/raid5/randomdata
847685a5d42524e5b1d5484452a649e854b59064  /mnt/raid5/randomdata

Ok. We've got a filesystem and some data ("data" in datafile, and 5MB worth of random data with that SHA1 hash in randomdata) on it; let's see what happens when we do a re-create.

root@test:~# umount /mnt/raid5
root@test:~# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
root@test:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
unused devices: <none>
root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 21:07:06 2012
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 21:07:06 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 21:07:06 2012
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid5 sdd1[2] sdc1[1] sdb1[0]
      203776 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>

The resync finished very quickly with these tiny disks, but it did occur. So here's what was bugging me from earlier; your fdisk -l output. Having no partition table on the md device is not a problem at all, it's expected. Your filesystem resides directly on the fake block device with no partition table.

root@test:~# fdisk -l
...
Disk /dev/md1: 208 MB, 208666624 bytes
2 heads, 4 sectors/track, 50944 cylinders, total 407552 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 524288 bytes / 1048576 bytes
Disk identifier: 0x00000000

Disk /dev/md1 doesn't contain a valid partition table

Yeah, no partition table. But...

root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
/dev/md1: clean, 12/51000 files, 12085/203776 blocks

Perfectly valid filesystem, after a resync. So that's good; let's check on our data files:

root@test:~# mount /dev/md1 /mnt/raid5/
root@test:~# cat /mnt/raid5/datafile
data
root@test:~# sha1sum /mnt/raid5/randomdata
847685a5d42524e5b1d5484452a649e854b59064  /mnt/raid5/randomdata

Solid - no data corruption at all! But this is with the exact same settings, so nothing was mapped differently between the two RAID groups. Let's drop this thing down before we try to break it.

root@test:~# umount /mnt/raid5
root@test:~# mdadm --stop /dev/md1

Taking a Step Back

Before we try to break this, let's talk about why it's hard to break. RAID 5 works by using a parity block that protects an area the same size as the block on every other disk in the array. The parity isn't just on one specific disk, it's rotated around the disks evenly to better spread read load out across the disks in normal operation.

The XOR operation to calculate the parity looks like this:

DISK1  DISK2  DISK3  DISK4  PARITY
1      0      1      1    = 1
0      0      1      1    = 0
1      1      1      1    = 0

So, the parity is spread out among the disks.

DISK1  DISK2  DISK3  DISK4  DISK5
DATA   DATA   DATA   DATA   PARITY
PARITY DATA   DATA   DATA   DATA
DATA   PARITY DATA   DATA   DATA

A resync is typically done when replacing a dead or missing disk; it's also done on mdadm create to assure that the data on the disks aligns with what the RAID's geometry is supposed to look like. In that case, the last disk in the array spec is the one that is 'synced to' - all of the existing data on the other disks is used for the sync.

So, all of the data on the 'new' disk is wiped out and rebuilt; either building fresh data blocks out of parity blocks for what should have been there, or else building fresh parity blocks.

What's cool is that the procedure for both of those things is the exact same: an XOR operation across the data from the rest of the disks. The resync process in this case may have in its layout that a certain block should be a parity block, and think it's building a new parity block, when in fact it's re-creating an old data block. So even if it thinks it's building this:

DISK1  DISK2  DISK3  DISK4  DISK5
PARITY DATA   DATA   DATA   DATA
DATA   PARITY DATA   DATA   DATA
DATA   DATA   PARITY DATA   DATA

...it may just be rebuilding DISK5 from the layout above.

So, it's possible for data to stay consistent even if the array's built wrong.


Throwing a Monkey in the Works

(not a wrench; the whole monkey)

Test 1:

Let's make the array in the wrong order! sdc, then sdd, then sdb..

root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --raid-devices=3 /dev/sdc1 /dev/sdd1 /dev/sdb1
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:06:34 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:06:34 2012
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:06:34 2012
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid5 sdb1[3] sdd1[1] sdc1[0]
      203776 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>

Ok, that's all well and good. Do we have a filesystem?

root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
fsck.ext4: Superblock invalid, trying backup blocks...
fsck.ext4: Bad magic number in super-block while trying to open /dev/md1

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

Nope! Why is that? Because while the data's all there, it's in the wrong order; what was once 512KB of A, then 512KB of B, A, B, and so forth, has now been shuffled to B, A, B, A. The disk now looks like jibberish to the filesystem checker, it won't run. The output of mdadm --misc -D /dev/md1 gives us more detail; It looks like this:

Number   Major   Minor   RaidDevice State
   0       8       33        0      active sync   /dev/sdc1
   1       8       49        1      active sync   /dev/sdd1
   3       8       17        2      active sync   /dev/sdb1

When it should look like this:

Number   Major   Minor   RaidDevice State
   0       8       17        0      active sync   /dev/sdb1
   1       8       33        1      active sync   /dev/sdc1
   3       8       49        2      active sync   /dev/sdd1

So, that's all well and good. We overwrote a whole bunch of data blocks with new parity blocks this time out. Re-create, with the right order now:

root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:11:08 2012
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:11:08 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:11:08 2012
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
/dev/md1: clean, 12/51000 files, 12085/203776 blocks

Neat, there's still a filesystem there! Still got data?

root@test:~# mount /dev/md1 /mnt/raid5/
root@test:~# cat /mnt/raid5/datafile
data
root@test:~# sha1sum /mnt/raid5/randomdata
847685a5d42524e5b1d5484452a649e854b59064  /mnt/raid5/randomdata

Success!

Test 2

Ok, let's change the chunk size and see if that gets us some brokenness.

root@test:~# umount /mnt/raid5
root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
root@test:~# mdadm --create /dev/md1 --chunk=64 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:21:19 2012
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:21:19 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:21:19 2012
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
fsck.ext4: Superblock invalid, trying backup blocks...
fsck.ext4: Bad magic number in super-block while trying to open /dev/md1

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

Yeah, yeah, it's hosed when set up like this. But, can we recover?

root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:21:51 2012
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:21:51 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:21:51 2012
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
/dev/md1: clean, 12/51000 files, 12085/203776 blocks
root@test:~# mount /dev/md1 /mnt/raid5/
root@test:~# cat /mnt/raid5/datafile
data
root@test:~# sha1sum /mnt/raid5/randomdata
847685a5d42524e5b1d5484452a649e854b59064  /mnt/raid5/randomdata

Success, again!

Test 3

This is the one that I thought would kill data for sure - let's do a different layout algorithm!

root@test:~# umount /mnt/raid5
root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --layout=right-asymmetric --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:32:34 2012
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:32:34 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:32:34 2012
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid5 sdd1[3] sdc1[1] sdb1[0]
      203776 blocks super 1.2 level 5, 512k chunk, algorithm 1 [3/3] [UUU]

unused devices: <none>
root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
fsck.ext4: Superblock invalid, trying backup blocks...
Superblock has an invalid journal (inode 8).

Scary and bad - it thinks it found something and wants to do some fixing! Ctrl+C!

Clear<y>? cancelled!

fsck.ext4: Illegal inode number while checking ext3 journal for /dev/md1

Ok, crisis averted. Let's see if the data's still intact after resyncing with the wrong layout:

root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:33:02 2012
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:33:02 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:33:02 2012
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
/dev/md1: clean, 12/51000 files, 12085/203776 blocks
root@test:~# mount /dev/md1 /mnt/raid5/
root@test:~# cat /mnt/raid5/datafile
data
root@test:~# sha1sum /mnt/raid5/randomdata
847685a5d42524e5b1d5484452a649e854b59064  /mnt/raid5/randomdata

Success!

Test 4

Let's also just prove that that superblock zeroing isn't harmful real quick:

root@test:~# umount /mnt/raid5
root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
root@test:~# mdadm --misc --zero-superblock /dev/sdb1 /dev/sdc1 /dev/sdd1
root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
/dev/md1: clean, 12/51000 files, 12085/203776 blocks
root@test:~# mount /dev/md1 /mnt/raid5/
root@test:~# cat /mnt/raid5/datafile
data
root@test:~# sha1sum /mnt/raid5/randomdata
847685a5d42524e5b1d5484452a649e854b59064  /mnt/raid5/randomdata

Yeah, no big deal.

Test 5

Let's just throw everything we've got at it. All 4 previous tests, combined.

  • Wrong device order
  • Wrong chunk size
  • Wrong layout algorithm
  • Zeroed superblocks (we'll do this between both creations)

Onward!

root@test:~# umount /mnt/raid5
root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
root@test:~# mdadm --misc --zero-superblock /dev/sdb1 /dev/sdc1 /dev/sdd1
root@test:~# mdadm --create /dev/md1 --chunk=64 --level=5 --raid-devices=3 --layout=right-symmetric /dev/sdc1 /dev/sdd1 /dev/sdb1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid5 sdb1[3] sdd1[1] sdc1[0]
      204672 blocks super 1.2 level 5, 64k chunk, algorithm 3 [3/3] [UUU]

unused devices: <none>
root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
fsck.ext4: Superblock invalid, trying backup blocks...
fsck.ext4: Bad magic number in super-block while trying to open /dev/md1

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1

The verdict?

root@test:~# mdadm --misc --zero-superblock /dev/sdb1 /dev/sdc1 /dev/sdd1
root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid5 sdd1[3] sdc1[1] sdb1[0]
      203776 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>

root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
/dev/md1: clean, 13/51000 files, 17085/203776 blocks
root@test:~# mount /dev/md1 /mnt/raid5/
root@test:~# cat /mnt/raid5/datafile
data
root@test:~# sha1sum /mnt/raid5/randomdata
847685a5d42524e5b1d5484452a649e854b59064  /mnt/raid5/randomdata

Wow.

So, it looks like none of these actions corrupted data in any way. I was quite surprised by this result, frankly; I expected moderate odds of data loss on the chunk size change, and some definite loss on the layout change. I learned something today.


So .. How do I get my data??

As much information as you have about the old system would be extremely helpful to you. If you know the filesystem type, if you have any old copies of your /proc/mdstat with information on drive order, algorithm, chunk size, and metadata version. Do you have mdadm's email alerts set up? If so, find an old one; if not, check /var/spool/mail/root. Check your ~/.bash_history to see if your original build is in there.

So, the list of things that you should do:

  1. Back up the disks with dd before doing anything!!
  2. Try to fsck the current, active md - you may have just happened to build in the same order as before. If you know the filesystem type, that's helpful; use that specific fsck tool. If any of the tools offer to fix anything, don't let them unless you're sure that they've actually found the valid filesystem! If an fsck offers to fix something for you, don't hesitate to leave a comment to ask whether it's actually helping or just about to nuke data.
  3. Try building the array with different parameters. If you have an old /proc/mdstat, then you can just mimic what it shows; if not, then you're kinda in the dark - trying all of the different drive orders is reasonable, but checking every possible chunk size with every possible order is futile. For each, fsck it to see if you get anything promising.

So, that's that. Sorry for the novel, feel free to leave a comment if you have any questions, and good luck!

footnote: under 22 thousand characters; 8k+ shy of the length limit

Shane Madden
  • 112,982
  • 12
  • 174
  • 248
  • 9
    That is one amazing answer. – Antoine Benkemoun Jan 08 '12 at 08:55
  • 4
    I don't even know what to say... BRAVO!!! Kudos to Shane Madden. I am going to backup the disks and get started with your suggestions. Thanks, no really a big thanks!!! – Brigadieren Jan 08 '12 at 09:55
  • Brilliant write-up. I have been there and done that myself, but I wouldn't be able to write it down so clearly. – Tonny Jan 08 '12 at 10:30
  • 3
    I just...wow. Brilliant answer. I think the only answer to break the 30,000 character limit is Evan Andersons "How Does Subnetting Work" essay. – tombull89 Jan 08 '12 at 12:21
  • 4
    Best answer on SF ever as far as I'm concerned. – Chopper3 Jan 08 '12 at 13:24
  • 14
    You, sir, win the internet. – Mark Henderson Jan 08 '12 at 20:23
  • 1
    You deserve a cape to wear in the server room on sysadmin day. – Bart Silverstrim Jan 09 '12 at 13:43
  • 2
    FYI: for a while, its been possible to have partitioned md arrays. Originally, they had their own mdp devices, but as of Linux 2.6.28, all md arrays are partitionable. The mdadm manpage has some details. So the OP may need to reconstruct a partition table as well... – derobert Jan 10 '12 at 16:45
  • Epic answer. Take a bow sir – Stewart Robinson Jan 10 '12 at 19:58
  • What can I say that the others haven't said? Kudos, Shane, reading answers and articles in general like this is poetry, no it's pure magic! – Zlatko Jan 11 '12 at 10:19
  • Hi Shane/folks, I just finished backing up the disks and tried to re-create the raid with what I think were the original params = the defaults with the disk order which I think is the right one. The process finished ok but again no superblock in the raid. As far as I can tel the only diff in the params is the name of the raid. Can it mess things up? I am in the process of re-creating the raid with the same params and the original name in hopes it will fix the problem.:( – Brigadieren Jan 18 '12 at 03:32
  • 1
    @Brigadieren Name shouldn't matter. Did you find an old copy of `/proc/mdstat` (check the `root` account's `/var/spool/mail`?), or an old command in the bash history to verify the layout? And what do you mean by "no superblock in the RAID"? – Shane Madden Jan 18 '12 at 03:40
  • I don't have the mdstat and mail. What I have is the bash history. The cmd line I used when created the array is quite straightforward - mdadm --create --verbose /dev/md0 --level=5 --raid-devices=3 /dev/sdd1 /dev/sde1 /dev/sdf1 --force. I run this last night after zeroing the superblocks on the raid disks and it finished fine but mdadm --examine /dev/md0 shows no superblock. Also, fsck also tells me that there is no superblock. – Brigadieren Jan 18 '12 at 05:01
  • @Brigadieren `--examine` is for physical disks, not the `md` volumes - it definitely should not have an `md` superblock (use `--detail` for the `/dev/md0` device). Did you run the `fsck` appropriate for the filesystem that was on the disk? (your `mkfs` command should be in bash history too) – Shane Madden Jan 18 '12 at 05:34
  • Hi Shane, sorry for dumb qs. The array is still rebuilding but I remember running fsck.ext4 /dev/md0 and it again gave me no superblock error. DiskUtility (the UI tool for ubuntu) showed the raid was running fine but the partition was empty. I know I am swamping the forum with qs so if you don't mind could we possibly take it offline and I will post the answer later? My email is mushegh at hotmail dot com. Also it seems that I have formatted the partition through the UI too. – Brigadieren Jan 18 '12 at 05:43
  • @Brigadieren `fsck.ext4` may not be appropriate, depending on what filesystem it was built as. Do you mean that you formatted the partition originally via the UI, or that you've formatted it in the UI during this rebuild process? – Shane Madden Jan 18 '12 at 05:59
  • I did the original formatting in the UI. My fstab shows the following entry: /dev/md0 /media/Raid ext4 defaults 1 2 I used this to remount the md0 to /media/Raid after reboot so I assume the UI has formatted it to ext4 – Brigadieren Jan 18 '12 at 06:02
  • Gotcha. Can you provide the exact output of the `fsck.ext4 /dev/md0` command? Maybe edit it onto the bottom of your original question so that it can be formatted correctly. – Shane Madden Jan 18 '12 at 06:08
  • Thanks Shane, I just appended the output to the bottom of the original post. While the array is rebuilding the output is still the same as it was before I kicked off re-creating – Brigadieren Jan 18 '12 at 06:14
  • Well, that's not good. Have you allowed that partition manager tool to make any changes (like, say, creating a partition table on the `md` block device) or anything like that? Try `mkfs.ext4 -n /dev/md0` (make sure you have the `-n`!!), note the backup superblock locations that it spits out, and try `fsck.ext4 -b /dev/md0` with a couple of the number that were listed for backup superblocks. – Shane Madden Jan 18 '12 at 18:15
  • Attached the info at the bottom of the original post - I just run fsck.ext4 with every returned block but the result is the same. Regarding the disk utility.. I just used it to create and mount the raid partition from UI. I a not sure what exactly did that do to the assembled Raid though. – Brigadieren Jan 18 '12 at 23:26
  • You'll need to clarify "create and mount raid partition" - did you create a partition instead of simply building the `md` device? – Shane Madden Jan 18 '12 at 23:30
  • As I remember, after I created the raid 5 using mdadm, it just showed empty partition in the disk utility. I formatted it and mounted it to /media/Raid using the disk utility UI – Brigadieren Jan 19 '12 at 00:21
  • That `invalid argument` error is strange, seems like the `-b` isn't taking. Try `e2fsck -b 214990848 /dev/md0` instead? – Shane Madden Jan 19 '12 at 00:43
  • Sorry it's been a while I have been out of town. I tried the e2fsck -b xxx /dev/md0 but the result is the same - invalid superblock. I have tried adding the drives to windows machine and running ZAR, ReclaiMe, R-Studio = each of these works for a couple of days and then comes up with weird results. I am back to trying to re-create the array using linux in hopes that it will be possible to recover something. I have another question... Is it possible to get the mdadm to re-create the array from 2 drives (assuming 1 is faulty) and use a disk image on a large drive? – Brigadieren Feb 22 '12 at 07:39
  • 1
    Also gets my vote for the best answer I have ever read on the site. – liamf Apr 26 '12 at 10:04
  • If this was on StackOverflow I would've just given you 500 rep with a bounty... this answer is absolutely amazing. – user541686 Jul 19 '12 at 06:23
6

I had a similar problem:
after a failure of a software RAID5 array I fired mdadm --create without giving it --assume-clean, and could not mount the array anymore. After two weeks of digging I finally restored all data. I hope the procedure below will save someone's time.

Long Story Short

The problem was caused by the fact that mdadm --create made a new array that was different from the original in two aspects:

  • different order of partitions
  • different RAID data offset

As it's been shown in the brilliant answer by Shane Madden, mdadm --create does not destroy the data in most cases! After finding the partition order and data offset I could restore the array and extract all data from it.

Prerequisites

I had no backups of RAID superblocks, so all I knew was that it was a RAID5 array on 8 partitions created during installation of Xubuntu 12.04.0. It had an ext4 filesystem. Another important piece of knowledge was a copy of a file that was also stored on the RAID array.

Tools

Xubuntu 12.04.1 live CD was used to do all the work. Depending on your situation, you might need some of the following tools:

version of mdadm that allows to specify data offset

sudo apt-get install binutils-dev git
git clone -b data_offset git://neil.brown.name/mdadm
cd mdadm
make

bgrep - searching for binary data

curl -L 'https://github.com/tmbinc/bgrep/raw/master/bgrep.c' | gcc -O2 -x c -o bgrep -

hexdump, e2fsck, mount and a hexadecimal calculator - standard tools from repos

Start with Full Backup

Naming of device files, e.g. /dev/sda2 /dev/sdb2 etc., is not persistent, so it's better to write down your drives' serial numbers given by

sudo hdparm -I /dev/sda

Then hook up an external HDD and back up every partition of your RAID array like this:

sudo dd if=/dev/sda2 bs=4M | gzip > serial-number.gz

Determine Original RAID5 Layout

Various layouts are described here: http://www.accs.com/p_and_p/RAID/LinuxRAID.html
To find how strips of data were organized on the original array, you need a copy of a random-looking file that you know was stored on the array. The default chunk size currently used by mdadm is 512KB. For an array of N partitions, you need a file of size at least (N+1)*512KB. A jpeg or video is good as it provides relatively unique substrings of binary data. Suppose our file is called picture.jpg. We read 32 bytes of data at N+1 positions starting from 100k and incrementing by 512k:

hexdump -n32 -s100k -v -e '/1 "%02X"' picture.jpg ; echo
DA1DC4D616B1C71079624CDC36E3D40E7B1CFF00857C663687B6C4464D6C77D2
hexdump -n32 -s612k -v -e '/1 "%02X"' picture.jpg ; echo
AB9DDDBBB05CA915EE2289E59A116B02A26C82C8A8033DD8FA6D06A84B6501B7
hexdump -n32 -s1124k -v -e '/1 "%02X"' picture.jpg ; echo
BC31A8DC791ACDA4FA3E9D3406D5639619576AEE2E08C03C9EF5E23F0A7C5CBA
...

We then search for occurrences of all of these bytestrings on all of our raw partitions, so in total (N+1)*N commands, like this:

sudo ./bgrep DA1DC4D616B1C71079624CDC36E3D40E7B1CFF00857C663687B6C4464D6C77D2 /dev/sda2
sudo ./bgrep DA1DC4D616B1C71079624CDC36E3D40E7B1CFF00857C663687B6C4464D6C77D2 /dev/sdb2
...
sudo ./bgrep DA1DC4D616B1C71079624CDC36E3D40E7B1CFF00857C663687B6C4464D6C77D2 /dev/sdh2
/dev/sdh2: 52a7ff000
sudo ./bgrep AB9DDDBBB05CA915EE2289E59A116B02A26C82C8A8033DD8FA6D06A84B6501B7 /dev/sda2
/dev/sdb2: 52a87f000
...

These commands can be run in parallel for different disks. Scan of a 38GB partition took around 12 minutes. In my case, every 32-byte string was found only once among all eight drives. By comparing offsets returned by bgrep you obtain a picture like this:

| offset \ partition | b | d | c | e | f | g | a | h |
|--------------------+---+---+---+---+---+---+---+---|
| 52a7ff000          | P |   |   |   |   |   |   | 1 |
| 52a87f000          | 2 | 3 | 4 | 5 | 6 | 7 | 8 | P |
| 52a8ff000          |   |   |   |   |   |   | P | 9 |

We see a normal left-symmetric layout, which is default for mdadm. More importantly, now we know the order of partitions. However, we don't know which partition is the first in the array, as they can be cyclicly shifted.

Note also the distance between found offsets. In my case it was 512KB. The chunk size can actually be smaller than this distance, in which case the actual layout will be different.

Find Original Chunk Size

We use the same file picture.jpg to read 32 bytes of data at different intervals from each other. We know from above that the data at offset 100k is lying on /dev/sdh2, at offset 612k is at /dev/sdb2, and at 1124k is at /dev/sdd2. This shows that the chunk size is not bigger than 512KB. We verify that it is not smaller than 512KB. For this we dump the bytestring at offset 356k and look on which partition it sits:

hexdump -n32 -s356k -v -e '/1 "%02X"' P1080801.JPG ; echo
7EC528AD0A8D3E485AE450F88E56D6AEB948FED7E679B04091B031705B6AFA7A
sudo ./bgrep 7EC528AD0A8D3E485AE450F88E56D6AEB948FED7E679B04091B031705B6AFA7A /dev/sdb2
/dev/sdb2: 52a83f000

It is on the same partition as offset 612k, which indicates that the chunk size is not 256KB. We eliminate smaller chunk sizes in the similar fashion. I ended up with 512KB chunks being the only possibility.

Find First Partition in Layout

Now we know the order of partitions, but we don't know which partition should be the first, and which RAID data offset was used. To find these two unknowns, we will create a RAID5 array with correct chunk layout and a small data offset, and search for the start of our file system in this new array.

To begin with, we create an array with the correct order of partitions, which we found earlier:

sudo mdadm --stop /dev/md126
sudo mdadm --create /dev/md126 --assume-clean --raid-devices=8 --level=5  /dev/sdb2 /dev/sdd2 /dev/sdc2 /dev/sde2 /dev/sdf2 /dev/sdg2 /dev/sda2 /dev/sdh2

We verify that the order is obeyed by issuing

sudo mdadm --misc -D /dev/md126
...
Number   Major   Minor   RaidDevice State
   0       8       18        0      active sync   /dev/sdb2
   1       8       50        1      active sync   /dev/sdd2
   2       8       34        2      active sync   /dev/sdc2
   3       8       66        3      active sync   /dev/sde2
   4       8       82        4      active sync   /dev/sdf2
   5       8       98        5      active sync   /dev/sdg2
   6       8        2        6      active sync   /dev/sda2
   7       8      114        7      active sync   /dev/sdh2

Now we determine offsets of the N+1 known bytestrings in the RAID array. I run a script for a night (Live CD doesn't ask for password on sudo :):

#!/bin/bash
echo "1st:"
sudo ./bgrep DA1DC4D616B1C71079624CDC36E3D40E7B1CFF00857C663687B6C4464D6C77D2 /dev/md126
echo "2nd:"
sudo ./bgrep AB9DDDBBB05CA915EE2289E59A116B02A26C82C8A8033DD8FA6D06A84B6501B7 /dev/md126
echo "3rd:"
sudo ./bgrep BC31A8DC791ACDA4FA3E9D3406D5639619576AEE2E08C03C9EF5E23F0A7C5CBA /dev/md126
...
echo "9th:"
sudo ./bgrep 99B5A96F21BB74D4A630C519B463954EC096E062B0F5E325FE8D731C6D1B4D37 /dev/md126

Output with comments:

1st:
/dev/md126: 2428fff000 # 1st
2nd:
/dev/md126: 242947f000 # 480000 after 1st
3rd:                   # 3rd not found
4th:
/dev/md126: 242917f000 # 180000 after 1st
5th:
/dev/md126: 24291ff000 # 200000 after 1st
6th:
/dev/md126: 242927f000 # 280000 after 1st
7th:
/dev/md126: 24292ff000 # 300000 after 1st
8th:
/dev/md126: 242937f000 # 380000 after 1st
9th:
/dev/md126: 24297ff000 # 800000 after 1st

Based on this data we see that the 3rd string was not found. This means that the chunk at /dev/sdd2 is used for parity. Here is an illustration of the parity positions in the new array:

| offset \ partition | b | d | c | e | f | g | a | h |
|--------------------+---+---+---+---+---+---+---+---|
| 52a7ff000          |   |   | P |   |   |   |   | 1 |
| 52a87f000          | 2 | P | 4 | 5 | 6 | 7 | 8 |   |
| 52a8ff000          | P |   |   |   |   |   |   | 9 |

Our aim is to deduce which partition to start the array from, in order to shift the parity chunks into the right place. Since parity should be shifted two chunks to the left, the partition sequence should be shifted two steps to the right. Thus the correct layout for this data offset is ahbdcefg:

sudo mdadm --stop /dev/md126
sudo mdadm --create /dev/md126 --assume-clean --raid-devices=8 --level=5  /dev/sda2 /dev/sdh2 /dev/sdb2 /dev/sdd2 /dev/sdc2 /dev/sde2 /dev/sdf2 /dev/sdg2 

At this point our RAID array contains data in the right form. You might be lucky so that the RAID data offset is the same as it was in the original array, and then you will most likely be able to mount the partition. Unfortunately this was not my case.

Verify Data Consistency

We verify that the data is consistent over a strip of chunks by extracting a copy of picture.jpg from the array. For this we locate the offset for the 32-byte string at 100k:

sudo ./bgrep DA1DC4D616B1C71079624CDC36E3D40E7B1CFF00857C663687B6C4464D6C77D2 /dev/md126

We then substract 100*1024 from the result and use the obtained decimal value in skip= parameter for dd. The count= is the size of picture.jpg in bytes:

sudo dd if=/dev/md126 of=./extract.jpg bs=1 skip=155311300608 count=4536208

Check that extract.jpg is the same as picture.jpg.

Find RAID Data Offset

A sidenote: default data offset for mdadm version 3.2.3 is 2048 sectors. But this value has been changed over time. If the original array used a smaller data offset than your current mdadm, then mdadm --create without --assume-clean can overwrite the beginning of the file system.

In the previous section we created a RAID array. Verify which RAID data offset it had by issuing for some of the individual partitions:

sudo mdadm --examine /dev/sdb2
...
    Data Offset : 2048 sectors
...

2048 512-byte sectors is 1MB. Since chunk size is 512KB, the current data offset is two chunks.

If at this point you have a two-chunk offset, it is probably small enough, and you can skip this paragraph.
We create a RAID5 array with the data offset of one 512KB-chunk. Starting one chunk earlier shifts the parity one step to the left, thus we compensate by shifting the partition sequence one step to the left. Hence for 512KB data offset, the correct layout is hbdcefga. We use a version of mdadm that supports data offset (see Tools section). It takes offset in kilobytes:

sudo mdadm --stop /dev/md126
sudo ./mdadm --create /dev/md126 --assume-clean --raid-devices=8 --level=5  /dev/sdh2:512 /dev/sdb2:512 /dev/sdd2:512 /dev/sdc2:512 /dev/sde2:512 /dev/sdf2:512 /dev/sdg2:512 /dev/sda2:512

Now we search for a valid ext4 superblock. The superblock structure can be found here: https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#The_Super_Block
We scan the beginning of the array for occurences of the magic number s_magic followed by s_state and s_errors. The bytestrings to look for are:

53EF01000100
53EF00000100
53EF02000100
53EF01000200
53EF02000200

Example command:

sudo ./bgrep 53EF01000100 /dev/md126
/dev/md126: 0dc80438

The magic number starts 0x38 bytes into the superblock, so we substract 0x38 to calculate the offset and examine the entire superblock:

sudo hexdump -n84 -s0xDC80400 -v /dev/md126
dc80400 2000 00fe 1480 03f8 cdd3 0032 d2b2 0119
dc80410 ab16 00f7 0000 0000 0002 0000 0002 0000
dc80420 8000 0000 8000 0000 2000 0000 b363 51bd
dc80430 e406 5170 010d ffff ef53 0001 0001 0000
dc80440 3d3a 50af 0000 0000 0000 0000 0001 0000
dc80450 0000 0000                              

This seems to be a valid superblock. s_log_block_size field at 0x18 is 0002, meaning that the block size is 2^(10+2)=4096 bytes. s_blocks_count_lo at 0x4 is 03f81480 blocks which is 254GB. Looks good.

We now scan for the occurrences of the first bytes of the superblock to find its copies. Note the byte flipping as compared to hexdump output:

sudo ./bgrep 0020fe008014f803d3cd3200 /dev/md126
/dev/md126: 0dc80400    # offset by 1024 bytes from the start of the FS        
/dev/md126: 15c80000    # 32768 blocks from FS start
/dev/md126: 25c80000    # 98304
/dev/md126: 35c80000    # 163840
/dev/md126: 45c80000    # 229376
/dev/md126: 55c80000    # 294912
/dev/md126: d5c80000    # 819200
/dev/md126: e5c80000    # 884736
/dev/md126: 195c80000
/dev/md126: 295c80000

This aligns perfectly with the expected positions of backup superblocks:

sudo mke2fs -n /dev/md126
...
Block size=4096 (log=2)
...
Superblock backups stored on blocks: 
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
    4096000, 7962624, 11239424, 20480000, 23887872

Hence the file system starts at the offset 0xdc80000, i.e. 225792KB from the partition start. Since we have 8 partitions of which one is for parity, we divide the offset by 7. This gives 33030144 bytes offset on every partition, which is exactly 63 RAID chunks. And since the current RAID data offset is one chunk, we conclude that the original data offset was 64 chunks, or 32768KB. Shifting hbdcefga 63 times to the right gives the layout bdcefgah.

We finally build the correct RAID array:

sudo mdadm --stop /dev/md126
sudo ./mdadm --create /dev/md126 --assume-clean --raid-devices=8 --level=5  /dev/sdb2:32768 /dev/sdd2:32768 /dev/sdc2:32768 /dev/sde2:32768 /dev/sdf2:32768 /dev/sdg2:32768 /dev/sda2:32768 /dev/sdh2:32768
sudo fsck.ext4 -n /dev/md126
e2fsck 1.42 (29-Nov-2011)
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/md126: clean, 423146/16654336 files, 48120270/66589824 blocks
sudo mount -t ext4 -r /dev/md126 /home/xubuntu/mp

Voilà!

Anton Stolbunov
  • 201
  • 2
  • 3
  • 1
    Excellent walkthrough. One note - 53EF00000100 doesn't seem to be a valid anchor for EXT4 header. According to https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#The_Super_Block the two bytes after 53EF could be only 0100, 0200 or 0400. – matt Jul 23 '16 at 11:43
  • I found this answer years ago, and from time to time, I go back to it as one might go back to read a good book once again. It's my all-time favorite StackExchange answer. – matt Aug 23 '20 at 20:05
5

If you are lucky you might have some success with getting your files back with recovery software that can read a broken RAID-5 array. Zero Assumption Recovery is one I have had success with before.

However, I'm not sure if the process of creating a new array has gone and destroyed all the data, so this might be a last chance effort.

Mark Henderson
  • 68,316
  • 31
  • 175
  • 255
  • Thanks a lot Mark. I will give it a try. Do you know if it modifies the drives? If so I will make a disk copy and also try with other tools. – Brigadieren Jan 07 '12 at 08:05
  • @Brigadieren - no, sorry, I'm not familiar enough with the intricacies of how RAID5 works. – Mark Henderson Jan 07 '12 at 09:58
  • @Brigadieren According to the [mdadm documentation](http://linux.die.net/man/8/mdadm), the create process won't destroy data, just resync - but if it's chosen a geometry that didn't match with your original, then it may have destroyed data with the writing of new parity. If I have some free time later on I might see about re-creating your situation in a VM, to see if I can add any insight. I'd recommend grabbing full copies of the drives before attempting any recovery steps that write to the disks at all - you may want to look into getting extra drives to make copies. – Shane Madden Jan 07 '12 at 18:35
  • I am just not sure what caused the sync - the fact that the array was degraded in the first place (due to power outage) or something else? I wonder if mdadm --create makes any distinction whether I specify the drive order differently than was originally given? – Brigadieren Jan 07 '12 at 21:35
  • @Brigadieren Sync always occurs on create. – Shane Madden Jan 08 '12 at 05:51
  • Thanks Shane for the clarification. So I assume then the order of the disks makes the difference right? If so, does anyone know what and how sync does so I can try to write a tool to un-do what's been done? – Brigadieren Jan 08 '12 at 08:11
0

I had a similar issue. I formatted and reinstalled my OS/boot drive with a clean install of Ubuntu 12.04, then ran the mdadm --create... command and couldn't mount it.

It said it didn't have a valid superblock or partition.

Moreover, when I stopped the mdadm raid, I could no longer mount the regular device.

I was able to repair the superblock with mke2fs and e2fsck:

root@blackbox:~# mke2fs -n /dev/sdc1
mke2fs 1.42 (29-Nov-2011)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
91578368 inodes, 366284000 blocks
18314200 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
11179 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
  32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
  4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
  102400000, 214990848

Then ran:

e2fsck -b 32768 -y /dev/sdc1

That restored the superblock so I could mount and read the drive.

To get the array working without destroying the superblock or partitions I used build:

mdadm --build /dev/md0 --level=mirror --assume-clean --raid-devices=2  /dev/sdc1 missing 

After verifying the data, I will add the other drive:

mdadm --add /dev/md0 /sdd1
0

I'm just updating some of the information given earlier. I had a 3-disk raid5 array working ok when my motherboard died. The array held /dev/md2 as the /home partition 1.2TB and /dev/md3 as the /var partition 300GB.

I had two backups of "important" stuff and a bunch of random things I had grabbed from various parts of the internet that I really should have gone through and selectively dumped. Most of the backups were broken into .tar.gz files of 25GB or less, and a separate copy of /etc was backed up also.

The rest of the filesystem was held on two small raid0 disks of 38GB.

My new machine was similar to the old hardware, and I got the machine up and running simply by plugging all five disks in and selecting an old generic kernel. So I had five disks with clean filesystems, though I could not be certain that the disks were in the right order, and needed to install a new version of Debian Jessie to be sure that I could upgrade the machine when needed, and sort out other problems.

With the new generic system installed on two Raid0 disks, I began to put the arrays back together. I wanted to be sure that I had the disks in the right order. What I should have done was to issue :

mdadm --assemble /dev/md3 -o --no-degraded --uuid=82164ae7:9af3c5f1:f75f70a5:ba2a159a

But I didn't. It seems that mdadm is pretty smart and given a uuid, can figure out which drives go where. Even if the bios designates /dev/sdc as /sda, mdadm will put it together correctly (YMMV though).

Instead I issued: mdadm --create /dev/md2 without the --assume-clean, and allowed the resync on /dev/sde1 to complete. The next mistake I made was to work on /dev/sdc1 instead of the last drive in the /dev/md2, /sde1. Anytime mdadm thinks there is a problem it is the last drive that gets kicked out or re-synced.

After that, mdadm could not find any superblock, and e2fsck -n couldn't either.

After I found this page, I went through the procedure of trying to find the sequence for the drives (done), check for valid data (verified 6MB of a 9MB file), got the disks in the right sequence, cde, grabbed the uuid's of /md2 and /md3 from the old /etc/mdadm.conf and tried assembling.

Well, /dev/md3 started, and mdadm --misc -D /dev/md3 showed three healthy partitions, and the disks in the right order. /dev/md2 also looked good, until I tried to mount the filesystem.

# mdadm --create /dev/md2 --raid-devices=3 --level=5 --uuid=c0a644c7:e5bcf758:ecfbc8f3:ee0392b7 /dev/sdc1 /dev/sdd1 /dev/sde1
mdadm: /dev/sdc1 appears to be part of a raid array:
       level=raid5 devices=3 ctime=Wed Feb  3 14:05:36 2016
mdadm: /dev/sdd1 appears to contain an ext2fs file system
       size=585936896K  mtime=Thu Jan  1 01:00:00 1970
mdadm: /dev/sdd1 appears to be part of a raid array:
       level=raid5 devices=3 ctime=Wed Feb  3 14:05:36 2016
mdadm: /dev/sde1 appears to contain an ext2fs file system
       size=585936896K  mtime=Thu Jan  1 01:00:00 1970
mdadm: /dev/sde1 appears to be part of a raid array:
       level=raid5 devices=3 ctime=Wed Feb  3 14:05:36 2016
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md2 started.

The filesystem refused to be mounted, and e2fsck couldn't find any superblocks. Further, when checking for superblocks as described above, the total block count found as a880 0076 or a880 0076 or 5500 1176 did not match the disk capacity size of 1199.79 reported my mdadm. Also none of the locations of the "superblocks" aligned with the data in the posts above.

I backed up all of /var, and prepared to wipe the disks. To see if it was possible to wipe just /md2, (I had nothing else to lose at this point) I dis the following:

root@ced2:/home/richard# mdadm --create /dev/md2 --raid-devices=3 --level=5 --uuid=c0a644c7:e5bcf758:ecfbc8f3:ee0392b7 /dev/sdc1 /dev/sdd1 /dev/sde1
mdadm: /dev/sdc1 appears to be part of a raid array:
       level=raid5 devices=3 ctime=Wed Feb  3 14:05:36 2016
mdadm: /dev/sdd1 appears to contain an ext2fs file system
       size=585936896K  mtime=Thu Jan  1 01:00:00 1970
mdadm: /dev/sdd1 appears to be part of a raid array:
       level=raid5 devices=3 ctime=Wed Feb  3 14:05:36 2016
mdadm: /dev/sde1 appears to contain an ext2fs file system
       size=585936896K  mtime=Thu Jan  1 01:00:00 1970
mdadm: /dev/sde1 appears to be part of a raid array:
       level=raid5 devices=3 ctime=Wed Feb  3 14:05:36 2016
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md2 started.
# mkfs.ext3 /dev/md2
mke2fs 1.42.12 (29-Aug-2014)
Creating filesystem with 292902912 4k blocks and 73228288 inodes
Filesystem UUID: a54e252f-78db-4ebb-b7ca-7dcd2edf57a4
Superblock backups stored on blocks: 
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
    4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
    102400000, 214990848

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 


# hexdump -n84 -s0x00000400 -v /dev/md2
0000400 6000 045d 5800 1175 7799 00df 6ff0 112e
0000410 5ff5 045d 0000 0000 0002 0000 0002 0000
0000420 8000 0000 8000 0000 2000 0000 10d3 56b2
0000430 10d3 56b2 0002 ffff ef53 0001 0001 0000
0000440 0c42 56b2 0000 0000 0000 0000 0001 0000
0000450 0000 0000                              
0000454

#  ./bgrep 00605D0400587511 /dev/md2
/dev/md2: 00000400
/dev/md2: 08000000
/dev/md2: 18000000
/dev/md2: 28000000
/dev/md2: 38000000
/dev/md2: 48000000
/dev/md2: c8000000
/dev/md2: d8000000
/dev/md2: 188000000
/dev/md2: 288000000
/dev/md2: 3e8000000
/dev/md2: 798000000
/dev/md2: ab8000000
etc

All seemed ok, except for the change to the uuid. So after a couple more checks, I wrote 600GB of backed up data onto /dev/md2. Then, unmounted and tried to re-mount the drive:

# mdadm --assemble /dev/md2 uuid=c0a644c7:e5bcf758:ecfbc8f3:ee0392b7
mdadm: cannot open device uuid=c0a644c7:e5bcf758:ecfbc8f3:ee0392b7: No such file or directory
mdadm: uuid=c0a644c7:e5bcf758:ecfbc8f3:ee0392b7 has no superblock - assembly aborted

Are you ********* kidding me? what about my 600GB on the file?

# mdadm --assemble /dev/md2 
mdadm: /dev/md2 not identified in config file.

Ah - easily fixed. uncommented one line in /etc/mdadm.conf

# mdadm --assemble /dev/md2 
mdadm: /dev/md2 has been started with 3 drives.

# e2fsck -n /dev/md2
e2fsck 1.42.12 (29-Aug-2014)
/dev/md2: clean, 731552/73228288 files, 182979586/292902912 blocks

Yippie!

Jakuje
  • 9,145
  • 2
  • 40
  • 44