I recently installed three new disks in my QNAP TS-412 NAS.
These three new disks should be combined with the already present disk into a 4 disk RAID5 array, so I started the migration process.
After multiple tries (each taking about 24 hours) the migration seemed to work but resulted in a non-responsive NAS.
At that point I reset the NAS. Everything went downhill from there:
- The NAS boots but marks the first disk as failed and removes it from all arrays, leaving them limp.
- I ran checks on the disk and can't find any issues with it (which would be weird anyway, as it's almost new).
- The admin interface didn't offer any recovery options, so I figured I'd just do it manually.
I've successfully rebuilt all QNAP internal RAID1 arrays using mdadm
(being /dev/md4
, /dev/md13
and /dev/md9
), leaving only the RAID5 array; /dev/md0
:
I've tried this multiple times now, using these commands:
mdadm -w /dev/md0
(Required as the array was mounted read-only by the NAS after removing /dev/sda3
from it. Can't modify the array in RO mode).
mdadm /dev/md0 --re-add /dev/sda3
After which the array starts rebuilding. It stalls at 99.9% though, while the system is extremely slow and/or unresponsive. (Login in using SSH fails most of the time).
Current state of things:
[admin@nas01 ~]# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md4 : active raid1 sdd2[2](S) sdc2[1] sdb2[0]
530048 blocks [2/2] [UU]
md0 : active raid5 sda3[4] sdd3[3] sdc3[2] sdb3[1]
8786092608 blocks super 1.0 level 5, 64k chunk, algorithm 2 [4/3] [_UUU]
[===================>.] recovery = 99.9% (2928697160/2928697536) finish=0.0min speed=110K/sec
md13 : active raid1 sda4[0] sdb4[1] sdd4[3] sdc4[2]
458880 blocks [4/4] [UUUU]
bitmap: 0/57 pages [0KB], 4KB chunk
md9 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
530048 blocks [4/4] [UUUU]
bitmap: 2/65 pages [8KB], 4KB chunk
unused devices: <none>
(It's stalled at 2928697160/2928697536
for hours now)
[admin@nas01 ~]# mdadm -D /dev/md0
/dev/md0:
Version : 01.00.03
Creation Time : Thu Jan 10 23:35:00 2013
Raid Level : raid5
Array Size : 8786092608 (8379.07 GiB 8996.96 GB)
Used Dev Size : 2928697536 (2793.02 GiB 2998.99 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Mon Jan 14 09:54:51 2013
State : clean, degraded, recovering
Active Devices : 3
Working Devices : 4
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 64K
Rebuild Status : 99% complete
Name : 3
UUID : 0c43bf7b:282339e8:6c730d6b:98bc3b95
Events : 34111
Number Major Minor RaidDevice State
4 8 3 0 spare rebuilding /dev/sda3
1 8 19 1 active sync /dev/sdb3
2 8 35 2 active sync /dev/sdc3
3 8 51 3 active sync /dev/sdd3
After inspecting /mnt/HDA_ROOT/.logs/kmsg
it turns out that the actual issue appears to be with /dev/sdb3
instead:
<6>[71052.730000] sd 3:0:0:0: [sdb] Unhandled sense code
<6>[71052.730000] sd 3:0:0:0: [sdb] Result: hostbyte=0x00 driverbyte=0x08
<6>[71052.730000] sd 3:0:0:0: [sdb] Sense Key : 0x3 [current] [descriptor]
<4>[71052.730000] Descriptor sense data with sense descriptors (in hex):
<6>[71052.730000] 72 03 00 00 00 00 00 0c 00 0a 80 00 00 00 00 01
<6>[71052.730000] 5d 3e d9 c8
<6>[71052.730000] sd 3:0:0:0: [sdb] ASC=0x0 ASCQ=0x0
<6>[71052.730000] sd 3:0:0:0: [sdb] CDB: cdb[0]=0x88: 88 00 00 00 00 01 5d 3e d9 c8 00 00 00 c0 00 00
<3>[71052.730000] end_request: I/O error, dev sdb, sector 5859367368
<4>[71052.730000] raid5_end_read_request: 27 callbacks suppressed
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246784 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246792 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246800 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246808 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246816 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246824 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246832 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246840 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246848 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5:md0: read error not correctable (sector 5857246856 on sdb3).
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
<4>[71052.730000] raid5: some error occurred in a active device:1 of md0.
The above sequence is repeated at a steady rate for various (random?) sectors in the 585724XXXX
range.
My questions are:
- Why is it stalled so close to the end, while still using so many resources that the system stalls (the
md0_raid5
andmd0_resync
processes are still running). - Is there any way to see what is causing it to fail/stall? <-- Likely due to the
sdb3
errors. - How can I get the operation to complete without losing my 3TB of data? (Like skipping the troublesome sectors on
sdb3
, but keeping the intact data?)