Perc H740P: raid5 volume split into two foreign configurations, cannot import

Question

After a power supply replacement we had a cabling issue which caused some disks to be missing, after debugging and fixing this in the BIOS, on the first boot, the preexisting raid5 volume was split into two foreign configurations - one contained 5 disks, the other contained the 2 remaining disks of the 7 volume raid5 (see below).

We are unable to import the foreign configuratins with perccli /c0/fall import:

Status = Failure
Description = Incomplete foreign configuration

So all disks are there, but somehow the controller thinks it's two different drive groups. Is there a way to recover from this situation and merge the configs into one, or something like that?

----------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT      Size PDC  PI SED DS3  FSpace TR 
----------------------------------------------------------------------------
 0 -   -   -        -   RAID5 Frgn  N  54.571 TB dsbl N  N   dflt N      N  
 0 0   -   -        -   RAID5 Frgn  N  54.571 TB dsbl N  N   dflt N      N  
 0 0   0   67:0     0   DRIVE Frgn  N   9.094 TB dsbl N  N   dflt -      N  
 0 0   1   67:0     1   DRIVE Frgn  N   9.094 TB dsbl N  N   dflt -      N  
 0 0   2   67:0     2   DRIVE Frgn  N   9.094 TB dsbl N  N   dflt -      N  
 0 0   3   67:0     3   DRIVE Frgn  N   9.094 TB dsbl N  N   dflt -      N  
 0 0   4   -        -   DRIVE Msng  -   9.094 TB -    -  -   -    -      N  
 0 0   5   -        -   DRIVE Msng  -   9.094 TB -    -  -   -    -      N  
 0 0   6   67:0     5   DRIVE Frgn  N   9.094 TB dsbl N  N   dflt -      N  
 1 -   -   -        -   RAID5 Frgn  N  54.571 TB dsbl N  N   dflt N      N  
 1 0   -   -        -   RAID5 Frgn  N  54.571 TB dsbl N  N   dflt N      N  
 1 0   0   -        -   DRIVE Msng  -   9.094 TB -    -  -   -    -      N  
 1 0   1   -        -   DRIVE Msng  -   9.094 TB -    -  -   -    -      N  
 1 0   2   -        -   DRIVE Msng  -   9.094 TB -    -  -   -    -      N  
 1 0   3   -        -   DRIVE Msng  -   9.094 TB -    -  -   -    -      N  
 1 0   4   67:0     6   DRIVE Frgn  N   9.094 TB dsbl N  N   dflt -      N  
 1 0   5   67:0     4   DRIVE Frgn  N   9.094 TB dsbl N  N   dflt -      N  
 1 0   6   -        -   DRIVE Msng  -   9.094 TB -    -  -   -    -      N  
----------------------------------------------------------------------------


Foreign VD List :
===============

---------------------------------
DG  VD      Size Type  Name      
---------------------------------
 0 255 54.571 TB RAID5 RV5 
 1 255 54.571 TB RAID5 RV5 
---------------------------------

Update:

I disconnected the whole expander and booted. This showed all disks in the foreign config (there are a number of single raid1 volumes, too):

-----------------------------------------
DG EID:Slot Type  State       Size NoVDs 
-----------------------------------------
 0 -        RAID0 Frgn    9.094 TB     1 
 1 -        RAID0 Frgn   10.913 TB     1 
 2 -        RAID0 Frgn   10.913 TB     1 
 3 -        RAID0 Frgn   10.913 TB     1 
 4 -        RAID0 Frgn    9.094 TB     1 
 5 -        RAID0 Frgn  278.875 GB     1 
 6 -        RAID0 Frgn   14.551 TB     1 
 7 -        RAID0 Frgn   16.370 TB     1 
 8 -        RAID0 Frgn    9.094 TB     1 
 9 -        RAID5 Frgn   54.571 TB     1 
10 -        RAID5 Frgn   54.571 TB     1 
-----------------------------------------

I was able to successfully /c0/fall import all. Unfortunately, this ended up in the sdame situation as before, with the other volumes being there and the raid5 being split into two foreign configurations (i.e. importing all foreign configs created tow new foreign configs).

Update 2:

Attaching the disks to a GNU/Linux system shows this, which to me basically says the same thing as the perc controller: there are two raid volumes with 5 and 7 disks. So this seems to be the result of a firmware bug where trhe raid controller actually split the volume group into two dysfunctional ones, and therefore, mergign seems impossible.

Personalities : [raid0] [linear] [multipath] [raid1] [raid6] [raid5] [raid4] [raid10] 
md125 : inactive sdi[0]
      9765912576 blocks super external:/md127/2
       
md126 : inactive sdg[1](S) sdf[0](S)
      1048576 blocks super external:ddf
       
md127 : inactive sdm[4](S) sdi[3](S) sdh[2](S) sdk[1](S) sdl[0](S)
      2621440 blocks super external:ddf

unused devices:

I'm trying to recover from here, but thew question now is: Can I recreate the array in either the raid controller or GNU/Linux again so the raid controller would recognise the array? Restoring from backup takes a rather long time.

** Update 3:**

Since it was asked for - I don't have the examine/detail info anymore, but here is the dump of what my own tool printed, which gives a bit more structure, and clearly shows how corrupted the info was. The DDF data includes more disks than just the ones in the array, but my tool only dumped info related to the array config I wanted to recover. Note that I have solved my problem by recreating the array after a minor odyssey, so this is just informational.

/dev/sdf
    refno 66fee9c8
    guid 'ATA 999901019c64177c25b6'
    pd   1 6d67850c 'ATA 9999010198734b845e34'
    pd   2 2c442eef 'ATA 99990101a3ff6b169fb3'
    pd   3 859c2a72 'ATA 9999010140f57d7b1911'
    pd   4 2a25447d 'ATA 9999010181a40ea27a38'
    pd   5 6db9e402 'SmrtStor        P^A^W1^@tfM-8'
    pd   6 0176ebaa 'ATA 99990101bd73575777e4'
    pd   7 a63ba301 'ATA 999901017d605c6aadf6'
    pd   8 5254f474 'ATA 999901014ecf2257f8f4'
    pd   9 80e8a86d 'ATA 999901014c775ca92a87'
    pd  10 49416c50 'ATA 99990101d79cd13a1e1e'
    pd  11 fa44428b 'ATA 9999010198bd2187a552'
    pd  12 66fee9c8 'ATA 999901019c64177c25b6'
    pd  13 4a94daa9 'ATA 99990101679d1776307e'
    part 0
        guid 'Dell    ^P'
        size 117190950912
        blocks 19531825152
        disk 0 start 0 ref a63ba301
        disk 1 start 0 ref 5254f474
        disk 2 start 0 ref 80e8a86d
        disk 3 start 0 ref 49416c50
        disk 4 start 0 ref fa44428b
        disk 5 start 0 ref 66fee9c8
        disk 6 start 0 ref 4a94daa9

/dev/sdg
    refno fa44428b
    guid 'ATA 9999010198bd2187a552'
    pd   1 6d67850c 'ATA 9999010198734b845e34'
    pd   2 2c442eef 'ATA 99990101a3ff6b169fb3'
    pd   3 859c2a72 'ATA 9999010140f57d7b1911'
    pd   4 2a25447d 'ATA 9999010181a40ea27a38'
    pd   5 6db9e402 'SmrtStor        P^A^W1^@tfM-8'
    pd   6 0176ebaa 'ATA 99990101bd73575777e4'
    pd   7 a63ba301 'ATA 999901017d605c6aadf6'
    pd   8 5254f474 'ATA 999901014ecf2257f8f4'
    pd   9 80e8a86d 'ATA 999901014c775ca92a87'
    pd  10 49416c50 'ATA 99990101d79cd13a1e1e'
    pd  11 fa44428b 'ATA 9999010198bd2187a552'
    pd  12 66fee9c8 'ATA 999901019c64177c25b6'
    pd  13 4a94daa9 'ATA 99990101679d1776307e'
    part 0
        guid 'Dell    ^P'
        size 117190950912
        blocks 19531825152
        disk 0 start 0 ref a63ba301
        disk 1 start 0 ref 5254f474
        disk 2 start 0 ref 80e8a86d
        disk 3 start 0 ref 49416c50
        disk 4 start 0 ref fa44428b
        disk 5 start 0 ref 66fee9c8
        disk 6 start 0 ref 4a94daa9

/dev/sdh
    refno 4a94daa9
    guid 'ATA 99990101974a122c9311'
    pd   1 6d67850c 'ATA 99990101be1d53ed8c7d'
    pd   2 2c442eef 'ATA 99990101ff58714b7f1b'
    pd   3 859c2a72 'ATA 99990101fa3ac0b94ef7'
    pd   4 2a25447d 'ATA 999901017e74d11eb6e6'
    pd   5 0176ebaa 'ATA 99990101f19b3355ec56'
    pd   6 a63ba301 'ATA 99990101f391d36e91f9'
    pd   7 5254f474 'ATA 99990101fa6d3d5b6c49'
    pd   8 80e8a86d 'ATA 99990101b7ad5947d5c0'
    pd   9 49416c50 'ATA 99990101d2e6918871bb'
    pd  10 4a94daa9 'ATA 99990101974a122c9311'
    pd  11 6db9e402 'SmrtStor        P^A^W1^@tfM-8'
    part 0
        guid 'Dell    ^P'
        size 117190950912
        blocks 19531825152
        disk 0 start 0 ref a63ba301
        disk 1 start 0 ref 5254f474
        disk 2 start 0 ref 80e8a86d
        disk 3 start 0 ref 49416c50
        disk 6 start 0 ref 4a94daa9

/dev/sdi
    refno 49416c50
    guid 'ATA 99990101d2e6918871bb'
    pd   1 2a25447d 'ATA 999901017e74d11eb6e6'
    pd   2 0176ebaa 'ATA 99990101f19b3355ec56'
    pd   3 49416c50 'ATA 99990101d2e6918871bb'
    pd   4 6db9e402 'SmrtStor        P^A^W1^@tfM-8'
    part 0
        guid 'Dell    ^P'
        size 117190950912
        blocks 19531825152
        disk 3 start 0 ref 49416c50

/dev/sdk
    refno 80e8a86d
    guid 'ATA 99990101b7ad5947d5c0'
    pd   1 2a25447d 'ATA 999901017e74d11eb6e6'
    pd   2 0176ebaa 'ATA 99990101f19b3355ec56'
    pd   3 a63ba301 'ATA 99990101f391d36e91f9'
    pd   4 5254f474 'ATA 99990101fa6d3d5b6c49'
    pd   5 80e8a86d 'ATA 99990101b7ad5947d5c0'
    pd   6 49416c50 'ATA 99990101d2e6918871bb'
    pd   7 6db9e402 'SmrtStor        P^A^W1^@tfM-8'
    part 0
        guid 'Dell    ^P'
        size 117190950912
        blocks 19531825152
        disk 0 start 0 ref a63ba301
        disk 1 start 0 ref 5254f474
        disk 2 start 0 ref 80e8a86d
        disk 3 start 0 ref 49416c50

/dev/sdl
    refno 5254f474
    guid 'ATA 99990101fa6d3d5b6c49'
    pd   1 2a25447d 'ATA 999901017e74d11eb6e6'
    pd   2 0176ebaa 'ATA 99990101f19b3355ec56'
    pd   3 a63ba301 'ATA 99990101f391d36e91f9'
    pd   4 5254f474 'ATA 99990101fa6d3d5b6c49'
    pd   5 80e8a86d 'ATA 99990101b7ad5947d5c0'
    pd   6 49416c50 'ATA 99990101d2e6918871bb'
    pd   7 6db9e402 'SmrtStor        P^A^W1^@tfM-8'
    part 0
        guid 'Dell    ^P'
        size 117190950912
        blocks 19531825152
        disk 0 start 0 ref a63ba301
        disk 1 start 0 ref 5254f474
        disk 2 start 0 ref 80e8a86d
        disk 3 start 0 ref 49416c50

/dev/sdm
    refno a63ba301
    guid 'ATA 99990101f391d36e91f9'
    pd   1 2a25447d 'ATA 999901017e74d11eb6e6'
    pd   2 0176ebaa 'ATA 99990101f19b3355ec56'
    pd   3 a63ba301 'ATA 99990101f391d36e91f9'
    pd   4 5254f474 'ATA 99990101fa6d3d5b6c49'
    pd   5 80e8a86d 'ATA 99990101b7ad5947d5c0'
    pd   6 49416c50 'ATA 99990101d2e6918871bb'
    pd   7 6db9e402 'SmrtStor        P^A^W1^@tfM-8'
    part 0
        guid 'Dell    ^P'
        size 117190950912
        blocks 19531825152
        disk 0 start 0 ref a63ba301
        disk 1 start 0 ref 5254f474
        disk 2 start 0 ref 80e8a86d
        disk 3 start 0 ref 49416c50

seq  0 refno a63ba301 dev /dev/sdm
seq  1 refno 5254f474 dev /dev/sdl
seq  2 refno 80e8a86d dev /dev/sdk
seq  3 refno 49416c50 dev /dev/sdi
seq  4 refno fa44428b dev /dev/sdg
seq  5 refno 66fee9c8 dev /dev/sdf
seq  6 refno 4a94daa9 dev /dev/sdh

What `mdadm --detail /dev/RAID` and `mdadm --examine /dev/COMPONENT` says about each item? — Nikita Kipriyanov, May 27 '22 at 03:18
I've added what info I sitll had - also see my answer to what I eventually did to resolve this. Basically, the config was corrupted and I had to recreate the array, but didn't have to restore form backup. — Remember Monica, May 27 '22 at 23:32

Remember Monica · Answer 1 · 2022-05-30T17:20:31.277

Ok, here is what I did. May it help the next person.

Fact Finding

First, I attached all disks to an HBA. GNU/Linux tried to assemble the raid, but indeed found (at least) two raid volumes (and a bit extra). I then made a backup of the first 32 and last 32MB of each disk, indexed by their WWID/WWN.

I then downloaded the SNIA DDF specification (https://www.snia.org/tech_activities/standards/curr_standards/ddf) because I knew that megaraid/dell (partially) implemented it (the ddf anchor block magic is not de11de11 by chance :), and then wrote a very ugly script to decode the data and make sense of it.

This showed me that the array was, in fact, split into three different configurations, one that included one disk, another that included that disk and 4 more, and another one that contained the remaining 2 disks.

The script itself is not very useful without understanding what you are doing, so I didn't include it here.

Eventually, this allowed me to eke out the correct original order of the disks. Hint: after creating an array, write down the order of WWNs (perccli /c0/s0 show all | grep WWN) and the strip size, at least.

This process also gave me the start offset (always 0) and size of the partitions (19531825152 sectors).

The raid5 variant used by the H740P (and probably all megaraid controllers) is called left-symmetric or "RAID-5 Rotating Parity N with Data Continuation (PRL=05, RLQ=03)".

Re-assembling the disks for testing

I then tried to test-reassemble the raid using mdadm --build. Unfortunately, mdadm refuses to assemble raid5 arrays - you have to write to the array and destroy data :(

As a workaround, to test out whether the order is correct, I started a kvm in snapshot mode with some random GNU/Linux boot image as /dev/sda and the disks as virtio disks:

exec kvmb -snapshot -m 16384 \
         -drive file=linux.img,snapshot=off \
         -drive file=/dev/sdm,if=virtio,snapshot=on \
         -drive file=/dev/sdl,if=virtio,snapshot=on \
         -drive file=/dev/sdk,if=virtio,snapshot=on \
         -drive file=/dev/sdi,if=virtio,snapshot=on \
         -drive file=/dev/sdg,if=virtio,snapshot=on \
         -drive file=/dev/sdf,if=virtio,snapshot=on \
         -drive file=/dev/sdh,if=virtio,snapshot=on

This made the disks appear in the specified order as /dev/vda, /dev/vdb and so on, and allowed me to test out various options easily. The first try inside the VM succeeded:

mdadm --create /dev/md0 -f \
   --metadata 1.0 \
   --raid-devices 7 \
   -z $((19531825152/2))K -c 256K \
   -l raid5 -p ddf-N-continue \
   --assume-clean -k resync \
   /dev/vd?

For raid5, the partition size is uncritical - if it is larger, your GPT partition table is corrupt and you have extra data, but the rest of the disk should still be readable.

I verified the correctness of data by mounting the partition (which should not throw errors, but might succeed even if the order is wrong), and using btrfs scrub, which checks checksums of metadata and data blocks, which is the ultimate test, and a major plus of btrfs.

I then ran the backzp again.

I then wrote down the WWN of all the disks in-order, so I can recreate it with perccli. I also made a backup of the first and last 1GB of data of the volume itself, in case the raid controller would overwrite those.

Moving the volume back into the raid controller

Since about 14TB of the data was not backed up (because the data can be retrieved from elsewhnere with some effort and I was too imnpatient to wait for a copy), making a full restore was not an option I looked forward to, so I tried to move the array back into the controller.

My first attempt was to format the array as a DDF container with the raid5 volume inside, using the same parameters as the controller uses, but unfortunately, the megaraid controller - while using DDF itself - does not support "foreign" DDF for imports and showed the disks simply as "unconfigured good".

I then tried to recreate the array simply by adding it again, e.g.:

perccli /c0 add vd r5 name=XXX drives=3,6,9,1,2,3,0 pdcache=off wb ra strip=256

Doing this on a booted sytem with perccli ensures that the raid controller will do a background initialise, which is not destructive and, with RAID5, will not even destroy data when the disk order or strip size is wrong, as long as you use exactly all the disks from the original array in any order, without leaving one out or giving too many.

This is where I failed - somehow, I bungled the order of disks completely, and also managed to corrupt the first 1.5MB of the volume. I have absolutely no idea what went wrong, but I tried many permutations and didn't see the correct data, to the point where I thought the raid controller would somehow reorder my disks (but it doesn't, it exactly takes the order as specified).

Long story short, I attached the disks to the HBA again and tried and failed to make sense of it. This is where my original backup came handy: although I lost the order of disks, I had a sharp look at the backup, and lowered the potential order to two possible permutations simply by staring at hexdumps. Creating the array with mdadm and testing the data have me the correct ordering.

I then again wrote down the order of WWNs, attached the disks to the controller, booted and did perccli /c0 add.... I then restored the first 1.5MB of the volume (which included GPT partition and LVM labels, and some old leftover garbage data that was very useful during guessing what the order could be). A certain level of confidence in being able to undo mistakes is helpful in this situation.

Result: array is back, btrfs is consistent and con troller is now background-initialising, which makes the whole system slow for a few days, but is a small price to pay.

Things Learned

I learned a great deal!

The perc controllers (and likely all megaraid controllers) don't cope well with frequent quick and intermittent disk problems - I suspect the disks going away and coming back quickly triggered a race condition where the controller was trying to write the new configuration to the disks and only partially succeeded with some disks, eventually splitting the raid into two. This is clearly a firmware bug. But then, who would expect power cables to be faulty...
mdadm is not very helpful in understanding or displaying DDF headers - I simply couldn't make sense of the displayed data, and as I found out when decoding the headers myself, this is because a lot of information is missing from --detail and --examine output. It is also not very helpful in experimenting, as it refuses to do a non-destructive read-only assemble.
perc/megaraid controllers use SNIA DDF format internally, and this being a publicly accessible specification, was extremely useful, although in the end I figured out what I needed without this information.
Being able to guess the correct order of raid strips from data alone is very useful. Leftover garbage and other data that can help with this is also very useful. I will consider writing "disk 1", "disk 2" and so on into "empty" areas of my RAID volume headers from now on (there are long stretches of 0 bytes in the first 2MB).
It is very easy to fuck up - device names, raid member numbers, WWNs, slot numbers and so on all being different can mean a lot of data to manage, and WWNs are long and my old eyes are not that good anymore. Plus, I am not well-organised and overly self-confident :/
Creating and deleting an array using disks with data on it will not erase the data, at least with RAID5 and using background initialisation. Foreground initialisation will almost certainly zero out the disks. That means that you cna create and delete the array as many times as you wish without risking data loss, with one possible exception: deleting an array sometimes requires the force option because the RAID controller thinks it is "in use" due to a valid partition label. And this might zero out the GPT label - YMMV, and make sure you have a backup of the first few megabytes just in case.
Perc/megaraid don't understand non-dell/megaraid DDF containers. At least I didn't find out how to make my controller accept mdadm-created DDF containers. Being able to format the disk in GNU/Linux and moving them back into the controller would have helped a lot and would have avoided many hours of grief on my side.

Summary

I got back everything without restoring from backup, at the expense of a few days of slow background initialisation time. I wrote down my solution above, in the case that some of it might be useful to other people in similar situations.

Fantastic! And very useful. Incidently, I had an experience of re-creating MegaRAID array and the data was not erased, but that was RAID1 or RAID10, which I removed accidentally from the OS using megacli (the system immediately died because it lived there), then I impressed my colleagues how fast I turned it back to life. I just recreated the array :). And, as a commentary, *mdadm ... refuses to do a non-destructive read-only assemble* — this is where overlays come into play and make everything non-destructive, and so these are mandatory in my opinion, so no problem with mdadm either. — Nikita Kipriyanov, May 28 '22 at 03:26
I think tools should not refuse arbitrarily because they want to "protect" users. It's ok to have a level of protection, but there should be a force option of some kind - lvm2 is a good example: by default it protects you very well against mistakes, but if you know what you are doing you can force your way. Certainly, formatting a raid with data on it counts as "you have to know what you are doing". Just my opinion, of course. — Remember Monica, May 30 '22 at 17:19

Nikita Kipriyanov · Answer 2 · 2022-05-27T03:22:44.043

You can try detaching all drives from the powered off server, then remove both groups, then reattach disks. That should reset all disks into "foreign" state. And then try to import them all in a single operation.

In principle, this controller should use SNIA DDF on-disk format. HBA (not a RAID controller) won't interpret metadata, allowing software to access it. So if you were able to connect it to Linux machine using HBA, it could detect and assemble this array using its MD RAID (Linux can understand DDF and IMSM metadata in addition to its own), so you at least will be able to access the data on it. For example, if these drives are SATA, you can just connect them to the motherboard.

As a precaution I'd dump all disks using HBA to some backup storage. Just in case something goes wrong.

Update: seen your progress, I can suggest further.

You can try to tweak the metadata with hex editor. Probably, something like manual setting them to the same UUID is needed.

Another idea could be recreate array with mdadm --assume-clean, which only writes metadata and assembles an array, but skips zeroing components.

First, guess the correct order and layout; this could be inferred from current metadata.
Build the overlays as described in the wiki, this will grant you infinite attempts
When you succeeded, repeat the successfull assembly over the drives, not overlay devices

Also, before reassembly, I'd take another set of drives (free of valuable data), simulate (try to repeat) the problem with them and then try to repair them first following these instructions.

Good Advice - I know I can try to recoiver using mdadm, but of cours eI would like to avoid this. I have tried your suggestion and have updated my question - basically, when I do as you suggest, I get lots of foreign disks that I can import, after which I again have two foreign configs with the raid5 disks. — Remember Monica, May 26 '22 at 19:02
Hi! I just wrote my own answer and only now saw your update. Hex editing is out as the DDF data is CRC-protected and I was not able to replicate the CRCs easily. Also, AFAIK, --assume-clean only avoids a resync, mdadm does not zero data, ever. In essense, though, I independently came up with the method you suggested as well, with somewhat different methods (kvm vs. overlays etc.). Most importantly, I finally solved some mysteries, e.g. whether perccli add/del erases data or not, and my understanding of megaraid on-disk format is now quite good :) Anyway, thanks for your solid suggestions! — Remember Monica, May 27 '22 at 23:22