ZFS: how to replace disk that faulted while resilvering after another fault?

Question

A disk replacement in ZFS went awry, and now the replacing disk, even though no longer physically present, is "stuck" in the pool, blocking further replacement attempts. How to remove it?

In a raidz3 pool with 11 disks on OmniOS r151010, one of the disks went bad. I took the problem disk offline, replaced it with a new disk, and got the new disk reconfigured. It started to resilver, and then the replacement disk had errors. Dmesg showed " SYNCHRONIZE CACHE command failed." I wondered if it might be a loose cable, so shut down the machine, reseated the disk and cables and started it up again. It started resilvering, and after a while had the same problem. At this point zpool status for the problem disk shows

replacing-0                UNAVAIL      0     0     0  insufficient replicas
    c4t5000C5004DC8693Fd0  OFFLINE      0     0     0
    c4t50014EE658315C1Dd0  FAULTED      0     0     0  too many errors

I decided to try another disk, and see if that made any difference. I suspected it wouldn't, but it was easy to try. I hot-swapped the disk, and then cfgadm -al showed

c8                             scsi-sas     connected    configured   unknown
c8::w50014ee6ad8f0df2,0        disk-path    connected    configured   unknown
c8::w50014ee658315c1d,0        disk-path    connected    unconfigured unknown

The new disk is there, but the old one hasn't gone away. I restarted the machine to clear out old state, then cfgadm -al showed just

c8                             scsi-sas     connected    configured   unknown
c8::w50014ee6ad8f0df2,0        disk-path    connected    configured   unknown

However, the zpool status still showed the old disk. I tried clearing the fault, and now the original disk and the 1st replacement are both offline

replacing-0                UNAVAIL      0     0     0  insufficient replicas
    c4t5000C5004DC8693Fd0  OFFLINE      0     0     0
    c4t50014EE658315C1Dd0  OFFLINE      0     0     0

At this point, what should I do to get the new replacement disk resilvering? Doing zpool replace on the original disk or the first replacement just yields the error (slightly shortened here) "cannot open 'c4t500....' no such device in /dev/dsk."

Doing a zpool remove on c4t50014EE658315C1Dd0 yields the error message "cannot remove c4t50014EE658315C1Dd0: only inactive hot spares, cache, top-level, or log devices can be removed"

RAIDZ3 is complicated. ZFS almost always works better with mirrors. It sounds like these are cheap disks and that they're not connected to a backplane, but via individual SATA cables. Are you sure the hardware is healthy otherwise? — ewwhite, Jul 31 '15 at 08:37
I suspect raidz3 and the SATA connectivity is not germane to the immediate problem, which is that a resilver never completed and the disk involved in that, even though no longer physically present, is now "stuck" in pool as the replacing disk, blocking further attempts to recover from the problem. — Willard, Jul 31 '15 at 12:32

score 3 · Answer 1 · answered Aug 11 '15 at 03:48

I figured it out. Use zdb on the pool to get the GUID of the original disk, then use format to find the name of the replacement disk, then do

# zpool replace <pool> <GUID of original disk> <name of replacement disk>

It looks like this while resilvering:

    NAME                         STATE     READ WRITE CKSUM
    raid                         DEGRADED     0     0     0
      raidz3-0                   DEGRADED     0     0     0
        replacing-0              UNAVAIL      0     0     0  insufficient replicas
          c4t5000C5004DC8693Fd0  OFFLINE      0     0     0
          c4t50014EE658315C1Dd0  OFFLINE      0     0     0
          c4t50014EE6AD8F0DF2d0  ONLINE       0     0     0  (resilvering)

and then back to normal once done.

ZFS: how to replace disk that faulted while resilvering after another fault?

1 Answers1