How to fix ZFS pool once spare replacement done or how to correct spare replacement

Question

I have a ZFS pool in the current state:

[root@SERVER-abc ~]# zpool status -v DATAPOOL
  pool: DATAPOOL
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 18.5M in 00:00:01 with 0 errors on Wed Jan  5 19:10:50 2022
config:`

        NAME                                              STATE     READ WRITE CKSUM
        DATAPOOL                                          DEGRADED     0     0     0
          raidz2-0                                        DEGRADED     0     0     0
            gptid/14c707c6-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0    17  too many errors
            spare-1                                       ONLINE       0     0    17
              gptid/168342c5-f16c-11e8-b117-0cc47a2ba44e  ONLINE       0     0     0
              gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e  ONLINE       0     0     0
            gptid/1875501a-f16c-11e8-b117-0cc47a2ba44e    ONLINE       0     0    30
            gptid/1a16d37c-f16c-11e8-b117-0cc47a2ba44e    ONLINE       0     0    29
        spares
          gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e      INUSE     currently in use

errors: Permanent errors have been detected in the following files:

        DATAPOOL/VMS/ubuntu_1804_LTS_ustrich-m6i87@auto-2022-01-04_11-41:<0x1>
        <0x1080a>:<0x1>
        <0x182a>:<0x1>
        DATAPOOL/VMS/ubuntu_1804_LTS_ustrich-m6i87:<0x1>
        <0x16fa>:<0x1>

This is a zpool with 4 + 1 spare drives. Something happened and suddenly the spare ist pairing automatically with the other drive as spare-1.

This is unexpected to me, as:

Why did the spare not replace the degraded drive?
How to find out why the spare jumped to spare-1?
Is it possible (or even recommended/possible) to get the spare back and then to replace the degraded drive?

Goal is to rescue the pool without having to get tons of data from the backup, but in core I want to understand what happened and why. And how to deal with those situations as in 'best practices'.

Tanks a bunch! :)

System is: SuperMicro, TrueNAS-12.0-U4.1, zfs-2.0.4-3

Edit: Changed output from zpool status -x to zpool status -v DATAPOOL

Edit2: As of now I understant that first 168342c5 seem to have an error and the spare (1bfaa607) jumped in. After that 14c707c6 degraded as well.

Edit3, Additional question: as all drives (except the one in spare-1) seem to have CKSUM errors - what does that indicate? Cabling? HBA? All drives are dying simultaneously?

Latest Update, after zpool clear and zpool scrub DATAPOOL it seems clear, that alot has happened and there is no way to rescue the pool:

  pool: DATAPOOL
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jan  6 16:18:05 2022
        1.82T scanned at 1.55G/s, 204G issued at 174M/s, 7.82T total
        40.8G resilvered, 2.55% done, 12:44:33 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        DATAPOOL                                          DEGRADED     0     0     0
          raidz2-0                                        DEGRADED     0     0     0
            gptid/14c707c6-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0   156  too many errors
            spare-1                                       DEGRADED     0     0     0
              gptid/168342c5-f16c-11e8-b117-0cc47a2ba44e  DEGRADED     0     0   236  too many errors
              gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e  ONLINE       0     0     0  (resilvering)
            gptid/1875501a-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0   182  too many errors
            gptid/1a16d37c-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0   179  too many errors
        spares
          gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e      INUSE     currently in use

I'll check all smart stats now.

score 1 · Answer 1 · answered Jan 06 '22 at 10:40

1

Is this a 4-disk RAIDZ2?

Did you choose that layout over ZFS mirrors?

Can you show the output of zpool status -v?

Please also run a zpool clear and follow the results/progress.

answered Jan 06 '22 at 10:40

ewwhite

194,921
91
434
799

1

`Is this a 4-disk RAIDZ2? Did you choose that layout over ZFS mirrors?` Yes, I was young and didn't know better m| – phaidros Jan 06 '22 at 11:24
1

`Please also run a zpool clear and follow the results/progress.` I just cleared and do a new scrub. As of now all looks okay in zpool status, but the first smart problems are incoming via email from the system. – phaidros Jan 06 '22 at 11:32

How to fix ZFS pool once spare replacement done or how to correct spare replacement

1 Answers1