0

Running Fedora 32 connected to a 4-port e-sata. One of the drives is clearly failing, with this message in the logs:

smartd[1169]: Device: /dev/sdd [SAT], FAILED SMART self-check. BACK UP DATA NOW!

Here's mdadm:

mdadm -D /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Fri Mar 13 16:46:35 2020
        Raid Level : raid10
        Array Size : 2930010928 (2794.28 GiB 3000.33 GB)
     Used Dev Size : 1465005464 (1397.14 GiB 1500.17 GB)
      Raid Devices : 4
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Mon Jun  8 17:33:23 2020
             State : clean, degraded
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 8K

Consistency Policy : resync

              Name : ourserver:0  (local to host ourserver)
              UUID : 88b9fcb6:52d0f235:849bd9d6:c079cfc8
            Events : 898705

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync set-A   /dev/sda1
       -       0        0        1      removed
       3       8       49        2      active sync set-A   /dev/sdd1
       -       0        0        3      removed

What I'm not understanding is what happened to our 2 other drives that were a part of the RAID10?

lsblk
NAME                     MAJ:MIN RM   SIZE RO TYPE   MOUNTPOINT
sda                        8:0    0   1.4T  0 disk
└─sda1                     8:1    0   1.4T  0 part
  └─md0                    9:0    0   2.7T  0 raid10
sdb                        8:16   0   1.4T  0 disk
└─sdb1                     8:17   0   1.4T  0 part
sdc                        8:32   0   1.8T  0 disk
└─sdc1                     8:33   0   1.8T  0 part
sdd                        8:48   0   1.4T  0 disk
└─sdd1                     8:49   0   1.4T  0 part
  └─md0                    9:0    0   2.7T  0 raid10

and:

blkid
/dev/sda1: UUID="88b9fcb6-52d0-f235-849b-d9d6c079cfc8" UUID_SUB="7df3d233-060a-aac3-04eb-9f3a65a9119e" LABEL="ourserver:0" TYPE="linux_raid_member" PARTUUID="0001b5c0-01"
/dev/sdb1: UUID="88b9fcb6-52d0-f235-849b-d9d6c079cfc8" UUID_SUB="64e3cedc-90db-e299-d786-7d096896f28f" LABEL="ourserver:0" TYPE="linux_raid_member" PARTUUID="00ff416d-01"
/dev/sdc1: UUID="88b9fcb6-52d0-f235-849b-d9d6c079cfc8" UUID_SUB="6d0134e3-1358-acfd-9c86-2967aec370c2" LABEL="ourserver:0" TYPE="linux_raid_member" PARTUUID="7da9b00e-01"
/dev/sdd1: UUID="88b9fcb6-52d0-f235-849b-d9d6c079cfc8" UUID_SUB="b1dd6f8b-a8e4-efa7-72b7-f987e71edeb2" LABEL="ourserver:0" TYPE="linux_raid_member" PARTUUID="b3de33a7-b2ea-f24e-903f-bae80136d543"


 cat /proc/mdstat
Personalities : [raid10]
md0 : active raid10 sda1[0] sdd1[3]
      2930010928 blocks super 1.2 8K chunks 2 near-copies [4/2] [U_U_]

unused devices: <none>

Originally I used these 2 commands to create the RAID10:

mdadm -E /dev/sda1 /dev/sdb1 /dev/sdd1 /dev/sdg1

mdadm --grow /dev/md0 --level=10 --backup-file=/home/backup-md0 --raid-devices=4 --add /dev/sdb1 /dev/sdd1 /dev/sdg1

After a few reboots the /dev/sdX (where X is a drive letter) convention changed. For the moment I don't have a mdadm.conf file and I ran mdadm --assemble --force /dev/md0 /dev/sd[abcd]1 to at least get the data back, and that's how /dev/sdb and /dev/sdc no longer have the RAID10 Type and no md0 under /dev/sdb1 and /dev/sdc1 (from the lsblk command above). How can I at least get back the 2 other drives, /dev/sdb and /dev/sdc, back into the RAID10 and then just fail /dev/sdd until I get a replacement? Or is there a better approach?

You can see from fdisk -l the 2 drives are formatted to be a part of the RAID10:

Disk /dev/sda: 1.37 TiB, 1500301910016 bytes, 2930277168 sectors
Disk model: ST31500341AS
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x0001b5c0

Device     Boot Start        End    Sectors  Size Id Type
/dev/sda1        2048 2930277167 2930275120  1.4T fd Linux raid autodetect

Disk /dev/sdb: 1.37 TiB, 1500301910016 bytes, 2930277168 sectors
Disk model: ST31500341AS
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00ff416d

Device     Boot Start        End    Sectors  Size Id Type
/dev/sdb1        2048 2930277167 2930275120  1.4T fd Linux raid autodetect

Disk /dev/sdc: 1.84 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: ST2000DM001-1ER1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x7da9b00e

Device     Boot Start        End    Sectors  Size Id Type
/dev/sdc1        2048 3907029167 3907027120  1.8T fd Linux raid autodetect

Disk /dev/sdd: 1.37 TiB, 1500301910016 bytes, 2930277168 sectors
Disk model: ST31500341AS
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: DC9A2601-CFE8-4ADD-85CD-FCBEBFCD8FAF

Device     Start        End    Sectors  Size Type
/dev/sdd1     34 2930277134 2930277101  1.4T Linux RAID

And examining all of the 4 drives shows they are active:

mdadm --examine /dev/sda1
/dev/sda1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 88b9fcb6:52d0f235:849bd9d6:c079cfc8
           Name : ourserver :0  (local to host ourserver )
  Creation Time : Fri Mar 13 16:46:35 2020
     Raid Level : raid10
   Raid Devices : 4

 Avail Dev Size : 2930010944 (1397.14 GiB 1500.17 GB)
     Array Size : 2930010928 (2794.28 GiB 3000.33 GB)
  Used Dev Size : 2930010928 (1397.14 GiB 1500.17 GB)
    Data Offset : 264176 sectors
   Super Offset : 8 sectors
   Unused Space : before=264096 sectors, after=16 sectors
          State : clean
    Device UUID : 7df3d233:060aaac3:04eb9f3a:65a9119e

    Update Time : Mon Jun  8 17:33:23 2020
  Bad Block Log : 512 entries available at offset 16 sectors
       Checksum : 6ad0f3f7 - correct
         Events : 898705

         Layout : near=2
     Chunk Size : 8K

   Device Role : Active device 0
   Array State : A.A. ('A' == active, '.' == missing, 'R' == replacing)

mdadm --examine /dev/sdb1
/dev/sdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 88b9fcb6:52d0f235:849bd9d6:c079cfc8
           Name : ourserver :0  (local to host ourserver )
  Creation Time : Fri Mar 13 16:46:35 2020
     Raid Level : raid10
   Raid Devices : 4

 Avail Dev Size : 2930010944 (1397.14 GiB 1500.17 GB)
     Array Size : 2930010928 (2794.28 GiB 3000.33 GB)
  Used Dev Size : 2930010928 (1397.14 GiB 1500.17 GB)
    Data Offset : 264176 sectors
   Super Offset : 8 sectors
   Unused Space : before=263896 sectors, after=16 sectors
          State : clean
    Device UUID : 64e3cedc:90dbe299:d7867d09:6896f28f

    Update Time : Wed Mar 18 11:50:09 2020
  Bad Block Log : 512 entries available at offset 264 sectors
       Checksum : aa48b164 - correct
         Events : 37929

         Layout : near=2
     Chunk Size : 8K

   Device Role : Active device 3
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

mdadm --examine /dev/sdc1
/dev/sdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 88b9fcb6:52d0f235:849bd9d6:c079cfc8
           Name : ourserver :0  (local to host ourserver )
  Creation Time : Fri Mar 13 16:46:35 2020
     Raid Level : raid10
   Raid Devices : 4

 Avail Dev Size : 3906762944 (1862.89 GiB 2000.26 GB)
     Array Size : 2930010928 (2794.28 GiB 3000.33 GB)
  Used Dev Size : 2930010928 (1397.14 GiB 1500.17 GB)
    Data Offset : 264176 sectors
   Super Offset : 8 sectors
   Unused Space : before=263896 sectors, after=976752016 sectors
          State : active
    Device UUID : 6d0134e3:1358acfd:9c862967:aec370c2

    Update Time : Sun May 10 16:22:39 2020
  Bad Block Log : 512 entries available at offset 264 sectors
       Checksum : df218e12 - correct
         Events : 97380

         Layout : near=2
     Chunk Size : 8K

   Device Role : Active device 1
   Array State : AAA. ('A' == active, '.' == missing, 'R' == replacing)

mdadm --examine /dev/sdd1
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 88b9fcb6:52d0f235:849bd9d6:c079cfc8
           Name : ourserver :0  (local to host ourserver )
  Creation Time : Fri Mar 13 16:46:35 2020
     Raid Level : raid10
   Raid Devices : 4

 Avail Dev Size : 2930012925 (1397.14 GiB 1500.17 GB)
     Array Size : 2930010928 (2794.28 GiB 3000.33 GB)
  Used Dev Size : 2930010928 (1397.14 GiB 1500.17 GB)
    Data Offset : 264176 sectors
   Super Offset : 8 sectors
   Unused Space : before=263896 sectors, after=1997 sectors
          State : clean
    Device UUID : b1dd6f8b:a8e4efa7:72b7f987:e71edeb2

    Update Time : Mon Jun  8 17:33:23 2020
  Bad Block Log : 512 entries available at offset 264 sectors
       Checksum : 8da0376 - correct
         Events : 898705

         Layout : near=2
     Chunk Size : 8K

   Device Role : Active device 2
   Array State : A.A. ('A' == active, '.' == missing, 'R' == replacing)

Can I try the --force and --assemble options as mentioned by this user or can I try the --replace option mentioned here?

Edit: Now I'm seeing this after the resync:

mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Fri Mar 13 16:46:35 2020
        Raid Level : raid10
        Array Size : 2930010928 (2794.28 GiB 3000.33 GB)
     Used Dev Size : 1465005464 (1397.14 GiB 1500.17 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

       Update Time : Tue Jun  9 15:51:31 2020
             State : clean, degraded
    Active Devices : 3
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 1

            Layout : near=2
        Chunk Size : 8K

Consistency Policy : resync

              Name : ourserver:0  (local to host ourserver)
              UUID : 88b9fcb6:52d0f235:849bd9d6:c079cfc8
            Events : 1083817

    Number   Major   Minor   RaidDevice State
       0       8       81        0      active sync set-A   /dev/sdf1
       4       8       33        1      active sync set-B   /dev/sdc1
       3       8       17        2      active sync set-A   /dev/sdb1
       -       0        0        3      removed

       5       8        1        -      spare   /dev/sda1

cat /proc/mdstat
Personalities : [raid10]
md0 : active raid10 sda1[5](S)(R) sdf1[0] sdb1[3] sdc1[4]
      2930010928 blocks super 1.2 8K chunks 2 near-copies [4/3] [UUU_]
unused devices: <none>

Now I'm seeing this in the logs:

Jun  9 15:51:31 ourserver kernel: md: recovery of RAID array md0
Jun  9 15:51:31 ourserver kernel: md/raid10:md0: insufficient working devices for recovery.
Jun  9 15:51:31 ourserver kernel: md: md0: recovery interrupted.
Jun  9 15:51:31 ourserver kernel: md: super_written gets error=10
Jun  9 15:53:23 ourserver kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

And trying to fail /dev/sdb results in:

mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm: set device faulty failed for /dev/sdb1:  Device or resource busy

How do I promote the spare drive and fail /dev/sdb?

RobbieTheK
  • 390
  • 5
  • 15
  • The drives that are in the array have an events count of `898705`. The ones that aren't have a count of `97380` and `37929`. Whatever kicked them out may have done it quite some time ago. Check back in the logs, possibly going back to when you created it. – Mike Andrews Jun 09 '20 at 19:39
  • Thanks @mike-andrews, just updated the original post, any idea how to fail /dev/sdb and promote /dev/sda? – RobbieTheK Jun 09 '20 at 20:49
  • Check around to make sure that `/dev/sdb1` isn't being used directly by anything else, not through md. Like, make sure there isn't a filesystem on it that's mounted somewhere, or that LVM hasn't somehow detected a volume on there. – Mike Andrews Jun 10 '20 at 01:59
  • And, as @shodanshok says below, make sure you have backups before doing anything else. – Mike Andrews Jun 10 '20 at 02:00

1 Answers1

1

You are effectively running without no redundancy and with a soon-to-be-failing disk.

Before doing anything, take backups! If you have many files to backup, I recommend to first take a block level copy of the failing disk via ddrescue /dev/sdd </dev/anotherdisk> where /dev/anotherdisk is an additional disk (even an USB one).

After having both file and block level backups, you can try to salvage the array by issuing the following command:

mdadm /dev/md0 --add /dev/sdb /dev/sdc

However, please strongly consider to totally recreate the array as you are using a very small chunk size (8K), which will severely impair performance (a good default chunk size is 512K).

UPDATE: I just noticed you further damaged the array with a forced assembly and setting sda as spare. Moreover, an extraneous sdf appeared. By forcing the array assembly which such out-of-date disks, you probably lost any chances to recover the array. I strongly advise you to contact a proper specialist.

shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • The server was rebooted, so `/dev/sdf` was just renamed as the UUID was not used. I'm hoping to replace the bad `/dev/sdb` and then assemble and resync. I could not recover the files ones the input/output errors started so that's why I did `mdadm --stop /dev/md0`. Would it be possible to start the RAID 10 with 2 disks [as mentioned here](https://serverfault.com/a/43712/359447)? – RobbieTheK Jun 10 '20 at 03:20
  • @RobbieTheK yes, a 2-disk RAID 10 array *can* be startable if you lose the "right" two disks. However, in this case, I restate the suggestion to contact a proper technician. – shodanshok Jun 10 '20 at 07:46
  • I am the technician.I was able to create a new `/dev/md1` and mounted 2 drives with 2 missing drives as noted in [this thread](https://serverfault.com/a/43712/359447). What I'm stuck at is this comment: "2) Format and mount. The `/dev/md1` should be immediately usable, but need to be formatted and then mounted." Wouldn't formatting kill the data? – RobbieTheK Jun 11 '20 at 01:26