5

I had a raid5 array on /dev/md0 created with mdadm and it was working fine for about a year. It consisted of three HDDs of 1TB each. A few days ago there was a power failure and UPS failure. It was not the first time unfortunately.

The OS is on a separate SSD disk (/dev/sda) which is not part of the raid array, so it boots but it cannot mount the array anymore. Sometimes /dev/md0 does not exists at all. Also I did something that might have get the things worse. I ran fsck /dev/sdb -y which wrote many many times on the disk.

I am afraid that I wont recover my files. Can you help me solve this problem?

Thanks.

mount /dev/md0 /mnt/raid5

mount: /dev/md0: can't read superblock

syslog:

Feb 25 15:59:53 pve kernel: [  365.559378] EXT4-fs (md0): unable to read superblock
Feb 25 15:59:53 pve kernel: [  365.560118] EXT4-fs (md0): unable to read superblock
Feb 25 15:59:53 pve kernel: [  365.560216] EXT4-fs (md0): unable to read superblock
Feb 25 15:59:53 pve kernel: [  365.560310] FAT-fs (md0): unable to read boot sector

cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4] 
unused devices: <none>

fdisk -l

Disk /dev/sdb: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x75633c0d

Device     Boot Start        End    Sectors  Size Id Type
/dev/sdb1        2048 1950353407 1950351360  930G fd Linux raid autodetect

Disk /dev/sdc: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: F397C12B-1549-45EA-97EA-6A41B713B58F

Device     Start        End    Sectors  Size Type
/dev/sdc1   2048 1950353407 1950351360  930G Linux RAID

Disk /dev/sdd: 931.5 GiB, 1000203804160 bytes, 1953523055 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xcced27e3

Device     Boot Start        End    Sectors  Size Id Type
/dev/sdd1        2048 1950353407 1950351360  930G fd Linux raid autodetect

sometimes fdisk -l

-bash: /sbin/fdisk: Input/output error

syslog:

Feb 25 16:03:25 pve kernel: [  577.221547] ata1.00: SRST failed (errno=-16)
Feb 25 16:03:25 pve kernel: [  577.232569] ata1.00: reset failed, giving up
Feb 25 16:03:25 pve kernel: [  577.232640] ata1.00: disabled
Feb 25 16:03:25 pve kernel: [  577.232643] ata1.01: disabled
Feb 25 16:03:25 pve kernel: [  577.232658] ata1: EH complete
Feb 25 16:03:25 pve kernel: [  577.232683] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 25 16:03:25 pve kernel: [  577.232697] sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 05 13 a0 38 00 00 08 00
Feb 25 16:03:25 pve kernel: [  577.232702] blk_update_request: I/O error, dev sda, sector 85172280
Feb 25 16:03:25 pve kernel: [  577.232784] Buffer I/O error on dev dm-6, logical block 9255, lost sync page write
Feb 25 16:03:25 pve kernel: [  577.232928] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 25 16:03:25 pve kernel: [  577.232936] sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 02 88 6a 10 00 00 68 00
Feb 25 16:03:25 pve kernel: [  577.232941] blk_update_request: I/O error, dev sda, sector 42494480
Feb 25 16:03:25 pve kernel: [  577.232948] EXT4-fs error (device dm-6): kmmpd:176: comm kmmpd-dm-6: Error writing to MMP block

EDIT 1:


sudo mdadm --examine /dev/sdb1

mdadm: No md superblock detected on /dev/sdb1.

sudo mdadm --examine /dev/sdc1

/dev/sdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 34c11bda:11bbb8c9:c4cf5f56:7c38e1c3
           Name : pve:0
  Creation Time : Sun Jun  5 21:06:33 2016
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 1950089216 (929.88 GiB 998.45 GB)
     Array Size : 1950089216 (1859.75 GiB 1996.89 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : active
    Device UUID : be76ecf7:b0f28a7d:718c3d58:3afae9f7

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Feb 20 14:48:51 2017
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : ffbc1988 - correct
         Events : 2901112

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : .AA ('A' == active, '.' == missing, 'R' == replacing)

sudo mdadm --examine /dev/sdd1

/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x9
     Array UUID : 34c11bda:11bbb8c9:c4cf5f56:7c38e1c3
           Name : pve:0
  Creation Time : Sun Jun  5 21:06:33 2016
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 1950089216 (929.88 GiB 998.45 GB)
     Array Size : 1950089216 (1859.75 GiB 1996.89 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : active
    Device UUID : 7b9ed6e0:ffad7603:b226e752:355765a8

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Feb 20 14:48:51 2017
  Bad Block Log : 512 entries available at offset 72 sectors - bad blocks present.
       Checksum : 19b6f3da - correct
         Events : 2901112

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : .AA ('A' == active, '.' == missing, 'R' == replacing)
kapcom01
  • 211
  • 1
  • 2
  • 7

4 Answers4

6

Thanks to all I recovered the data.

I ran sudo mdadm --verbose --assemble --force /dev/md0 /dev/sdc1 /dev/sdd1 to assemble the array from the two remaining good HDDs and it worked!

Then I formated sdb and re-added it to the array with sudo mdadm --manage /dev/md0 --add /dev/sdb1 and I am going to buy a new one to replace it soon. Also I am looking on backup solutions..

kapcom01
  • 211
  • 1
  • 2
  • 7
3

If you gave input/output errors i think you have one or more bad disk. You need to check SMART attributes of all disks by command smartctl -a /dev/sdx. Check status and Update Time of each disk by command mdadm --examine /dev/sdx1. Choose one worst disk witch has more bad smart attributes and oldest Update Time and remove it from array.

If you have two bad disks you need to choose less bad disk and it must be recovered to new disk by program ddrecovery. Remove this bad disk and insert the new recovered disk to the same place.

Then you will can restore RAID 5 array with one missed disk (by example sdc) by command:

mdadm --verbose --create /dev/md0 --chunk=512 --level=5 --raid-devices=3 /dev/sdb1 missing /dev/sdd1

Be sure that chunk parameter is the same as on good disks.

Also you have bad sda disk, which isn't a member of RAID 5 array.

Be carefully with each command. There's only way to restore your RAID array.

Read this by example.

Mikhail Khirgiy
  • 2,003
  • 9
  • 7
  • hello, I have edited my question to include mdadm --examine results. Is it safe to --create? Shouldn't be better to --assemble? Thanks. – kapcom01 Feb 27 '17 at 13:42
  • Also SMART shows sdb to have many errors. sdc and sdd seems ok. – kapcom01 Feb 27 '17 at 13:59
  • 1
    I ran: `sudo mdadm --verbose --assemble --force /dev/md0 /dev/sdc1 /dev/sdd1` and it worked! I then formated the /dev/sdb and readded to the array with `sudo mdadm --manage /dev/md0 --add /dev/sdb1` because I dont have a new one to replace it. I will buy a new one soon. I can mark your Answer if it is possible for you to add the --assemble as an option. Thanks. – kapcom01 Feb 27 '17 at 14:24
  • You must be sure that RAID disks are good by SMART attributes. – Mikhail Khirgiy Feb 27 '17 at 14:57
2

Running fsck was the right idea, but I think you ran it on the wrong device. Try running fsck on /dev/md0 using a backup superblock. This link will give you some tips on how to find the backup and repair when you do. In particular, running dumpe2fs is your best bet for finding the filesystem block size. Even if the first backup is corrupted, ext4 will have created others.

Jeremy Dover
  • 318
  • 1
  • 6
  • Oops...meant to make this a comment, since there's no certainty it will work given the fsck run on the underlying disk. But with the error mounting the RAID, this is a good thing to try. – Jeremy Dover Feb 25 '17 at 15:23
2

You have several problems.

First, you say that /dev/sda is your system disk, not part of a RAID array, with the OS on it. Well, look in the exact syslog snippet you showed us:

Feb 25 16:03:25 pve kernel: [  577.232702] blk_update_request: I/O error, dev sda, sector 85172280
Feb 25 16:03:25 pve kernel: [  577.232941] blk_update_request: I/O error, dev sda, sector 42494480

Two I/O errors during writes, reported within a millisecond of each other, to two different locations, on the system disk. Your system disk is having serious problems; get it replaced immediately. Might be well worth it replacing the cabling to it too, while you are at it. In my experience, I/O errors are usually indicative of either cabling or disk problems (though the HBA can be at fault). Expect data on the system disk to be corrupted to at least some degree as a result of this problem.

Second, fsck /dev/sdb -y very likely scribbled all over your RAID data in attempting to make sense of partial file system data and automatically writing out whatever it thought looked right. I would suggest physically disconnecting that disk, remove it from the system, and place it some place safe for now. Treat it as dead.

Thankfully, you are lucky; the system is still talking to all three disks, and the metadata looks sane on the two disks out of the three that still hold md metadata.

Grab three new disks, and use ddrescue to copy everything that you can from the two remaining disks onto two new ones. Unplug the old disks and set them together with what used to be /dev/sdb (make sure you keep track of which disk is which), and plug in the two new disks along with the third new, blank, one.

Feed the resulting array to mdadm and pray to your deity of choice that md will be able to make sense of the resulting situation. If you are lucky, it will be able to, and will restore most of the data to readable condition now that there are no read errors (since you brought in new disks). Again, there may be some corruption in places.

Third, figure out what caused the UPS failure and correct that, and set up regular backups so that if the worst happens, at least you will have a backup that you can restore onto new media. Consider this incident a learning experience illustrating why RAID is not a backup.

user
  • 4,267
  • 4
  • 32
  • 70
  • Thank you! I saw that sda also gives errors.. I will replace everything eventually. The raid is working now! – kapcom01 Feb 28 '17 at 18:55