How to recover an mdadm array on Synology NAS with drive in "E" state?

Question

Synology has a customized version the md driver and mdadm toolsets that adds a 'DriveError' flag to the rdev->flags structure in the kernel.

Net effect - if you are unfortunate enough to get a array failure (first drive), combined with an error on a second drive - the array gets into the state of not letting you repair/reconstruct the array even though reads from the drive are working fine.

At this point, I'm not really worried about this question from the point of view of THIS array, since I've already pulled content off and am intending to reconstruct, but more from wanting to have a resolution path for this in the future, since it's the second time I've been bit by it, and I know I've seen others asking similar questions in forums.

Synology support has been less than helpful (and mostly non-responsive), and won't share any information AT ALL on dealing with the raidsets on the box.

Contents of /proc/mdstat:

ds1512-ent> cat /proc/mdstat 
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md2 : active raid5 sdb5[1] sda5[5](S) sde5[4](E) sdd5[3] sdc5[2]
      11702126592 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/4] [_UUUE]

md1 : active raid1 sdb2[1] sdd2[3] sdc2[2] sde2[4] sda2[0]
      2097088 blocks [5/5] [UUUUU]

md0 : active raid1 sdb1[1] sdd1[3] sdc1[2] sde1[4] sda1[0]
      2490176 blocks [5/5] [UUUUU]

unused devices: <none>

Status from an mdadm --detail /dev/md2:

/dev/md2:
        Version : 1.2
  Creation Time : Tue Aug  7 18:51:30 2012
     Raid Level : raid5
     Array Size : 11702126592 (11160.02 GiB 11982.98 GB)
  Used Dev Size : 2925531648 (2790.00 GiB 2995.74 GB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent

    Update Time : Fri Jan 17 20:48:12 2014
          State : clean, degraded
 Active Devices : 4
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

           Name : MyStorage:2
           UUID : cbfdc4d8:3b78a6dd:49991e1a:2c2dc81f
         Events : 427234

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       21        1      active sync   /dev/sdb5
       2       8       37        2      active sync   /dev/sdc5
       3       8       53        3      active sync   /dev/sdd5
       4       8       69        4      active sync   /dev/sde5

       5       8        5        -      spare   /dev/sda5

As you can see - /dev/sda5 has been re-added to the array. (It was the drive that outright failed) - but even though md sees the drive as a spare, it won't rebuild to it. /dev/sde5 in this case is the problem drive with the (E) DiskError state.

I have tried stopping the md device, running force reassembles, removing/readding sda5 from the device/etc. No change in behavior.

I was able to completely recreate the array with the following command:

mdadm --stop /dev/md2
mdadm --verbose \
   --create /dev/md2 --chunk=64 --level=5 \
   --raid-devices=5 missing /dev/sdb5 /dev/sdc5 /dev/sdd5 /dev/sde5

which brought the array back to this state:

md2 : active raid5 sde5[4] sdd5[3] sdc5[2] sdb5[1]
      11702126592 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/4] [_UUUU]

I then re-added /dev/sda5:

mdadm --manage /dev/md2 --add /dev/sda5

after which it started a rebuild:

md2 : active raid5 sda5[5] sde5[4] sdd5[3] sdc5[2] sdb5[1]
      11702126592 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/4] [_UUUU]
      [>....................]  recovery =  0.1% (4569508/2925531648) finish=908.3min speed=53595K/sec

Note the position of the "missing" drive matching the exact position of the missing slot.

Once this finishes, I think I'll probably pull the questionable drive and have it rebuild again.

I am looking for any suggestions as to whether there is any "less scary" way to do this repair - or if anyone has gone through this experience with a Synology array and knows how to force it to rebuild other than taking the md device offline and recreating the array from scratch.

I find myself in a similar situation. Did you resolve this successfully? — dvorak, Jan 30 '14 at 21:10
Yes, I was able to get array rebuilt following above steps. I did follow it up with clearing and changing from R5 to R6 though - cause at this point, I'm seriously unhappy with the "tank the whole array" behavior of Synology that I wanted to make sure to tolerate more than one drive "failing". In our case, second drive that had the "glitch" error passed extended smart tests without even a single issue. — Nathan Neulinger, Jan 31 '14 at 21:36
Thanks for the helpful guide. I'm not too confident fiddling with all this, I'm no raid specialist. I now face the same issue but in my case, I have a single disk RAID 1 array (/dev/md3) with /dev/sde3 being marked with the dreaded [E]. I assume that it should be possible for me to follow the same steps as you did, but since that's the single disk of the array I don't know what it'll do ;-). Anyhow the mdadm --stop /dev/md3 command fails (Device or resource busy). I guess I'll Google a bit longer.. =) — dSebastien, May 17 '15 at 20:23
If you can't stop the array, sounds like something is using it - i.e. it's mounted, or there is some other task running against that device. — Nathan Neulinger, May 18 '15 at 21:28
Fortunately for me Synology helped me fix the issue. They were kind enough to provide me with the commands they ran. I've put the information on my blog in case someone else runs into this issue: http://www.dsebastien.net/2015/05/19/recovering-a-raid-array-in-e-state-on-a-synology-nas/ — dSebastien, May 19 '15 at 09:43
@dSebastien I have a similar issue. The blog no longer opens up, can you please help out here https://serverfault.com/questions/1073904/synology-nas-volume-crashed-shr-raid — Gaurav Shah, Aug 09 '21 at 03:54
Blog link via Wayback Machine: https://web.archive.org/web/20210226133602/http://www.dsebastien.net/2015/05/19/recovering-a-raid-array-in-e-state-on-a-synology-nas/ — StanTastic, Nov 25 '21 at 10:21

score 3 · Answer 1 · answered Oct 31 '16 at 08:01

Just an addition to the solution that I found after I experienced the same issue. I followed dSebastien's blog post on how to re-create the array:

I found that that method of recreating the array worked better than this above method. However after re-creating the array, the volume was still not showing on the web interface. None of my LUN's were showing. Basically showing a new array with nothing configured. I contacted Synology support, and they remoted in to fix the issue. Unfortunately, they remoted in whilst I was away from the console. I did manage to capture the session though, and looked through what they did. Whilst trying to recover some of my data, the drive crashed again, and I was back at the same situation. I recreated the array as in dSebastien's blog and then looked through the synology session to perform their update. After running the below commands, my array and LUN's appeared on the web interface, and I was able to work with them. I have practically zero experience in linux, but these were the commands I performed in my situation. Hope this can help someone else, but please use this at your own risk. It would be best to contact Synology support and get them fix this for you, as this situation might be different from yours

DiskStation> synocheckiscsitrg
synocheckiscsitrg: Pass 

DiskStation> synocheckshare
synocheckshare: Pass SYNOICheckShare()
synocheckshare: Pass SYNOICheckShareExt()
synocheckshare: Pass SYNOICheckServiceLink()
synocheckshare: Pass SYNOICheckAutoDecrypt()
synocheckshare: Pass SYNOIServiceShareEnableDefaultDS()

DiskStation> spacetool --synoblock-enum
****** Syno-Block of /dev/sda ******
//I've removed the output. This should display info about each disk in your array

DiskStation> vgchange -ay
  # logical volume(s) in volume group "vg1" now active

DiskStation> dd if=/dev/vg1/syno_vg_reserved_area of=/root/reserved_area.img
24576+0 records in
24576+0 records out

DiskStation> synospace --map_file -d
Success to dump space info into '/etc/space,/tmp/space'

DiskStation> synocheckshare
synocheckshare: Pass SYNOICheckShare()
synocheckshare: Pass SYNOICheckShareExt()
synocheckshare: Pass SYNOICheckServiceLink()
synocheckshare: Pass SYNOICheckAutoDecrypt()
synocheckshare: Pass SYNOIServiceShareEnableDefaultDS()

DiskStation> synocheckiscsitrg
synocheckiscsitrg: Not Pass, # conflict 

DiskStation> synocheckiscsitrg
synocheckiscsitrg: Pass

GWu · Answer 2 · 2018-11-11T16:38:03.050

Another addition: I've hit a very similar issue with my one-disk / RAID level 0 device.

Synology support was very helpful and restored my device. Here's what happened, hope this helps others:

My disk had read errors on one particular block, messages in system log (dmesg) were:

[4421039.097278] ata1.00: read unc at 105370360
[4421039.101579] lba 105370360 start 9437184 end 5860528064
[4421039.106917] sda3 auto_remap 0
[4421039.110097] ata1.00: exception Emask 0x0 SAct 0x2 SErr 0x0 action 0x6
[4421039.116744] ata1.00: edma_err_cause=00000084 pp_flags=00000003, dev error, EDMA self-disable
[4421039.125410] ata1.00: failed command: READ FPDMA QUEUED
[4421039.130767] ata1.00: cmd 60/00:08:b8:d2:47/02:00:06:00:00/40 tag 1 ncq 262144 in
[4421039.130772]          res 41/40:00:f8:d2:47/00:00:06:00:00/40 Emask 0x409 (media error) <F>
[4421039.146855] ata1.00: status: { DRDY ERR }
[4421039.151064] ata1.00: error: { UNC }
[4421039.154758] ata1: hard resetting link
[4421039.667234] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl F300)
[4421039.887286] ata1.00: configured for UDMA/133
[4421039.891777] ata1: UNC RTF LBA Restored
[4421039.895745] ata1: EH complete

A few seconds later I received the dreadful Volume 1 has crashed mail from my device.

-- Disclaimer: Be sure to replace the device name by your's and do not simply copy&paste these commands, as this might make things worse! --

After stopping smb I was able to re-mount the partition read only and run e2fsk with badblocks check (-c):

umount /dev/md2
e2fsck -C 0 -v -f -c /dev/md2

(one could also use e2fsck -C 0 -p -v -f -c /dev/md2 to run as unattended as possible, although this didn't work out in my case, because the errors had to be fixed manually. So I had to restart e2fsck. Conclusio: -p doesn't make much sense in case of disk error)

Although e2fsck was able to fix the errors and smartctl also showed no more increase in Raw_Read_Error_Rate, the volume still wouldn't mount in read-write mode by the device. DSM still showed "volume crashed"

So I opened a ticket with support. It took quite a while to get things going first, but in the end they fixed it by rebuilding the RAID array with:

synospace --stop-all-spaces
syno_poweroff_task -d 
mdadm -Sf /dev/md2
mdadm -AfR /dev/md2 /dev/sda3

Be sure to check your device names (/dev/mdX and /dev/sdaX) before doing anything. cat /proc/mdstat will show the relevant information.

I can't thank you enough for this guide. It's the 3 last commands that did the trick for me, plus a btrfsck :-) — StanTastic, Nov 25 '21 at 10:51

How to recover an mdadm array on Synology NAS with drive in "E" state?

2 Answers2

Linked