I created a raidz1-0 pool with three devices. Two where added by their /dev/disk/by-id
ID and somehow I decided to use /dev/sdg1
for the third one.
After a reboot years later, I can't get all three devices online again. Here's the current status:
# zpool status safe00
pool: safe00
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0 in 2h54m with 0 errors on Sun Jan 12 03:18:13 2020
config:
NAME STATE READ WRITE CKSUM
safe00 DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ata-ST3500418AS_9VM89VGD ONLINE 0 0 0
13759036004139463181 OFFLINE 0 0 0 was /dev/sdg1
ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1NYTHJF ONLINE 0 0 0
errors: No known data errors
The drives in this machine are:
# lsblk -f
NAME FSTYPE LABEL UUID MOUNTPOINT
sda
├─sda1 ext4 Ubuntu LTS 8a2a3c19-580a-474d-b248-bf0822cacab6 /
├─sda2 vfat B55A-693E /boot/efi
└─sda3 swap swap 7d1cf001-07a6-4534-9624-054d70a562d5 [SWAP]
sdb zfs_member dump 11482263899067190471
├─sdb1 zfs_member dump 866164895581740988
└─sdb9 zfs_member dump 11482263899067190471
sdc
sdd
├─sdd1 zfs_member dump 866164895581740988
└─sdd9
sde zfs_member dump 866164895581740988
├─sde1 zfs_member safe00 6143939454380723991
└─sde2 zfs_member dump 866164895581740988
sdf
├─sdf1 zfs_member dump 866164895581740988
└─sdf9
sdg
├─sdg1 zfs_member safe00 6143939454380723991
└─sdg9
sdh
├─sdh1 zfs_member safe00 6143939454380723991
└─sdh9
which is to say that the safe00
should contain the three devices: sde1
, sdg
& sdh
.
And just to get mapping to the by-id
and path
:
# cd /dev/disk/by-id
# ls -la ata* | cut -b 40- | awk '{split($0, a, " "); print a[3],a[2],a[1]}' | sort -h
../../sda1 -> ata-INTEL_SSDSC2KW120H6_BTLT712507HK120GGN-part1
../../sda2 -> ata-INTEL_SSDSC2KW120H6_BTLT712507HK120GGN-part2
../../sda3 -> ata-INTEL_SSDSC2KW120H6_BTLT712507HK120GGN-part3
../../sda -> ata-INTEL_SSDSC2KW120H6_BTLT712507HK120GGN
../../sdb1 -> ata-WDC_WD20EARX-00PASB0_WD-WCAZAE573068-part1
../../sdb9 -> ata-WDC_WD20EARX-00PASB0_WD-WCAZAE573068-part9
../../sdb -> ata-WDC_WD20EARX-00PASB0_WD-WCAZAE573068
../../sdc -> ata-SAMSUNG_HD204UI_S2H7JD1ZA21911
../../sdd1 -> ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0416553-part1
../../sdd9 -> ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0416553-part9
../../sdd -> ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0416553
../../sde1 -> ata-ST6000VN0033-2EE110_ZAD5S9M9-part1
../../sde2 -> ata-ST6000VN0033-2EE110_ZAD5S9M9-part2
../../sde -> ata-ST6000VN0033-2EE110_ZAD5S9M9
../../sdf1 -> ata-WDC_WD10EADS-00L5B1_WD-WCAU4C151323-part1
../../sdf9 -> ata-WDC_WD10EADS-00L5B1_WD-WCAU4C151323-part9
../../sdf -> ata-WDC_WD10EADS-00L5B1_WD-WCAU4C151323
../../sdg1 -> ata-ST3500418AS_9VM89VGD-part1
../../sdg9 -> ata-ST3500418AS_9VM89VGD-part9
../../sdg -> ata-ST3500418AS_9VM89VGD
../../sdh1 -> ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1NYTHJF-part1
../../sdh9 -> ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1NYTHJF-part9
../../sdh -> ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1NYTHJF
And zdb (with minor ANNOTATION by me)
# zdb -C safe00
MOS Configuration:
version: 5000
name: 'safe00'
state: 0
txg: 22826770
pool_guid: 6143939454380723991
errata: 0
hostname: 'filserver'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 6143939454380723991
children[0]:
type: 'raidz'
id: 0
guid: 9801294574244764778
nparity: 1
metaslab_array: 33
metaslab_shift: 33
ashift: 12
asize: 1500281044992
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 135921832921042063
path: '/dev/disk/by-id/ata-ST3500418AS_9VM89VGD-part1'
whole_disk: 1
DTL: 58
create_txg: 4
children[1]: ### THIS CHILD USED TO BE sdg1
type: 'disk'
id: 1
guid: 13759036004139463181
path: '/dev/sdg1'
whole_disk: 0
not_present: 1 ### THIS IS sde1 NOW
DTL: 52
create_txg: 4
offline: 1
children[2]: ### THIS CHILD IS NOW sdg1
type: 'disk'
id: 2
guid: 2522190573401341943
path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1NYTHJF-part1'
whole_disk: 1
DTL: 57
create_txg: 4
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
space map refcount mismatch: expected 178 != actual 177
Summary for the pool safe00
:
offline: sde1 --> ata-ST6000VN0033-2EE110_ZAD5S9M9-part1 <-- this likely was sdg1 before reboot
online: sdg1 --> ata-ST3500418AS_9VM89VGD
online: sdh1 --> ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1NYTHJF
Trying to online the device that's offline:
# zpool online safe00 ata-ST6000VN0033-2EE110_ZAD5S9M9-part1
cannot online ata-ST6000VN0033-2EE110_ZAD5S9M9-part1: no such device in pool
# zpool online safe00 /dev/sde1
cannot online /dev/sde1: no such device in pool
I also tried to replace the offline device with the real one:
# zpool replace safe00 13759036004139463181 ata-ST6000VN0033-2EE110_ZAD5S9M9-part1
invalid vdev specification
use '-f' to override the following errors:
/dev/disk/by-id/ata-ST6000VN0033-2EE110_ZAD5S9M9-part1 is part of active pool 'safe00'
# zpool replace safe00 /dev/sdg1 ata-ST6000VN0033-2EE110_ZAD5S9M9-part1
invalid vdev specification
use '-f' to override the following errors:
/dev/disk/by-id/ata-ST6000VN0033-2EE110_ZAD5S9M9-part1 is part of active pool 'safe00'
So, finally I tried to online the missing device using it's ID:
# zpool online safe00 13759036004139463181
warning: device '13759036004139463181' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present
This happily put the disk in FAULTED and a repair was started.
# zpool status safe00
pool: safe00
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub in progress since Sun Feb 23 11:19:00 2020
14.3G scanned out of 1.09T at 104M/s, 3h0m to go
0 repaired, 1.29% done
config:
NAME STATE READ WRITE CKSUM
safe00 DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ata-ST3500418AS_9VM89VGD ONLINE 0 0 0
13759036004139463181 FAULTED 0 0 0 was /dev/sdg1
ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1NYTHJF ONLINE 0 0 0
errors: No known data errors
What should I do to avoid this from happening again - how do I change the device's "path" property in zdb so it doesn't rely on Linux' enumeration of disks at bootup?