1

I have repeating problem with the zfs pool where zfs stops recognizing its own, properly labeled (or so it appears) physical devices.

Ubuntu 20.04.2 LTS
5.11.0-44-generic #48~20.04.2-Ubuntu SMP Tue Dec 14 15:36:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
libzfs2linux/now 0.8.3-1ubuntu12.11 amd64 [installed,upgradable to: 0.8.3-1ubuntu12.13]
zfs-zed/now 0.8.3-1ubuntu12.11 amd64 [installed,upgradable to: 0.8.3-1ubuntu12.13]
zfsutils-linux/now 0.8.3-1ubuntu12.11 amd64 [installed,upgradable to: 0.8.3-1ubuntu12.13]

Model examples.

  1. I can create a pool, hook up completely unrelated disk (e.g. usb, external) and upon rebooting (with the usb disk in) zfs reports one of the disks from its pool missing.
  2. Same seems to happen with the change the controller for one (or perhaps more) of the drives. All the physical disks are there, all the labels/uuids seems to be there, what changes is the device letter assignment.

It's hard to believe zfs assembles the pool based on the system device assignment order ignoring its labels/uuids but this is how it simply looks like.

    agatek@mmstorage:~$ zpool status
          pool: mmdata
         state: DEGRADED
        status: One or more devices could not be used because the label is missing or
            invalid.  Sufficient replicas exist for the pool to continue
            functioning in a degraded state.
        action: Replace the device using 'zpool replace'.
           see: http://zfsonlinux.org/msg/ZFS-8000-4J
          scan: scrub in progress since Sun Jan  9 13:03:23 2022
            650G scanned at 1.58G/s, 188G issued at 468M/s, 22.7T total
            0B repaired, 0.81% done, 0 days 14:00:27 to go
        config:

        NAME                                          STATE     READ WRITE CKSUM
        mmdata                                        DEGRADED     0     0     0
          raidz1-0                                    DEGRADED     0     0     0
            ata-HGST_HDN726040ALE614_K7HJG8HL         ONLINE       0     0     0
            6348126275544519230                       FAULTED      0     0     0  was /dev/sdb1
            ata-HGST_HDN726040ALE614_K3H14ZAL         ONLINE       0     0     0
            ata-HGST_HDN726040ALE614_K4K721RB         ONLINE       0     0     0
            ata-WDC_WD40EZAZ-00SF3B0_WD-WX12D514858P  ONLINE       0     0     0
            ata-ST4000DM004-2CV104_ZTT24X5R           ONLINE       0     0     0
            ata-WDC_WD40EZAZ-00SF3B0_WD-WX62D711SHF4  ONLINE       0     0     0
            sdi                                       ONLINE       0     0     0
    
    errors: No known data errors

agatek@mmstorage:~$ blkid 
/dev/sda1: UUID="E0FD-8D4F" TYPE="vfat" PARTUUID="7600a192-967b-417f-b726-7f5524be71a5"
/dev/sda2: UUID="9d8774ec-051f-4c60-aaa7-82f37dbaa4a4" TYPE="ext4" PARTUUID="425f31b2-f289-496a-911b-a2f8a9bb5c25"
/dev/sda3: UUID="e0b8852d-f781-4891-8e77-d8651f39a55b" TYPE="ext4" PARTUUID="a750bae3-c6ea-40a0-bdfa-0523e358018b"
/dev/sdb1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="13253481390530831214" TYPE="zfs_member" PARTLABEL="zfs-5360ecc220877e69" PARTUUID="57fe2215-aa69-2f46-b626-0f2057a2e4a7"
/dev/sdd1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="17929921080902463088" TYPE="zfs_member" PARTLABEL="zfs-f6ef14df86c7a6e1" PARTUUID="31a074a3-300d-db45-b9e2-3495f49c4bee"
/dev/sde1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="505855664557329830" TYPE="zfs_member" PARTLABEL="zfs-6326993c142e4a03" PARTUUID="37f4954d-67fd-8945-82e6-d0db1f2af12e"
/dev/sdg1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="1905592300789522892" TYPE="zfs_member" PARTLABEL="zfs-9d379d5bfd432a2b" PARTUUID="185eff00-196a-a642-9360-0d4532d54ec0"
/dev/sdi1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="15862525770363300383" TYPE="zfs_member" PARTLABEL="zfs-3c99aa22a45c59bf" PARTUUID="89f1600a-b58e-c74c-8d5e-6fdd186a6db0"
/dev/sdh1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="15292769945216849639" TYPE="zfs_member" PARTLABEL="zfs-ee9e1c9a5bde878c" PARTUUID="2e70d63b-00ba-f842-b82d-4dba33314dd5"
/dev/sdf1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="5773484836304595337" TYPE="zfs_member" PARTLABEL="zfs-ee40cf2140012e24" PARTUUID="e5cc3e2a-f7c9-d54e-96de-e62a723a9c3f"
/dev/sdc1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="6348126275544519230" TYPE="zfs_member" PARTLABEL="zfs-0d28f0d2715eaff8" PARTUUID="a328981a-7569-294a-bbf6-9d26660e2aad"`

For the above pool, what happened, one of the devices earlier failed. I hooked up a replacement disk to the second controller and performed the replacement. It was successful. The pool was ok. Next, the failed device was removed from the pool and physically replaced by the replacement disk (change of the controller). After rebooting I got it in the degraded state with one of the devices reported missing. The scrub was triggered by the command zpool clear.

So as it shows from blkid, there are 8 disks, all partitioned and labeled (I think) properly, but one of the devices is not recognized as the part of the pool. What to do in such situations? It is extremely annoying. Resilvering the pool takes days.

agatek
  • 31
  • 4

1 Answers1

3

If you add any device to the pool using /dev/sdX path, it is subject to changing, because Linux kernel does not guarantee any order for those drive entries.

In your output, you have /dev/sdi as a member of the pool. This can change at any point.

You should try zpool export mmdata to put the array offline, and then zpool import mmdata -d /dev/disk/by-id to import it again using the persistent IDs for the drives.

Tero Kilkanen
  • 34,499
  • 3
  • 38
  • 58
  • Thanks. I will try that. So why it labels/uuids all the disks/partitions and still relies on /dev/disk/by-id? I am trying to understand the logic. – agatek Jan 09 '22 at 10:42
  • ZFS does label the disks / partitions. However, it uses the device identifiers that were provided to it during `zpool create`. – Tero Kilkanen Jan 09 '22 at 10:45
  • That's exactly my point, why it does label them if it relies on the /dev/disk/by-id mapping? Why it does not scan the uuids and labels and assembles them based on this? Besides, what is says in the error message is kind of misleading at best: One or more devices could not be used because **the label is missing or invalid**. Well, the label is neither missing nor invalid. Perhaps this is a bug of some sort and should be reported? – agatek Jan 09 '22 at 11:00
  • Tero, exporting/importing the pool as you suggested worked well. Thank you again, but before doing this I exported the pool and just issued zpool import (no -d). This listed all the devices correctly, so the one marked above as faulted was with the correct physical id and not missing any longer. Interestingly, this device was never introduced to the pool by the id (it was /dev/sdX). Now, one might have assumed some proper scanning, but the last device (sdi) still remained sdi. Only your method, pointing to the directory, allowed to list all the device in the pool using the physical ids. Weird. – agatek Jan 09 '22 at 23:57