3

I am in the process of trying to recover a pool that had been degraded and neglected, then had a second mirror member fail, resulting in a faulted pool. For whatever reason, the spare never autoreplaced, even though that option was set for this pool, but that's beside the point.

This is on an OmniOS server. Pool info is as follows:

  pool: dev-sata1
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-JQ
  scan: resilvered 1.53T in 21h6m with 0 errors on Sat Jun 17 13:18:04 2017
config:

        NAME                       STATE     READ WRITE CKSUM
        dev-sata1                  UNAVAIL    227   623     0  insufficient replicas
          mirror-0                 ONLINE       0     0     0
            c1t5000C5003ECEEC42d0  ONLINE       0     0     0
            c1t5000C5003ED6D008d0  ONLINE       0     0     0
          mirror-1                 ONLINE       0     0     0
            c1t5000C500930358EAd0  ONLINE       0     0     0
            c1t5000C500930318E1d0  ONLINE       0     0     0
          mirror-3                 ONLINE       0     0     0
            c1t5000C5003F362DA7d0  ONLINE       0     0     0
            c1t5000C5003F365D94d0  ONLINE       0     0     0
          mirror-4                 ONLINE       0     0     0
            c1t5000C50064D11652d0  ONLINE       0     0     0
            c1t5000C500668EC894d0  ONLINE       0     0     0
          mirror-5                 ONLINE       0     0     0
            c1t5000C5007A2DBE23d0  ONLINE       0     0     0
            c1t5000C5007A2DF29Cd0  ONLINE       0     0     0
          mirror-6                 UNAVAIL    457 1.22K     5  insufficient replicas
            15606980839703210365   UNAVAIL      0     0     0  was /dev/dsk/c1t5000C5007A2E1359d0s0
            c1t5000C5007A2E1BAEd0  FAULTED     37 1.25K     5  too many errors
          mirror-7                 ONLINE       0     0     0
            c1t5000C5007A34981Bd0  ONLINE       0     0     0
            c1t5000C5007A3929B6d0  ONLINE       0     0     0
        logs
          mirror-2                 ONLINE       0     0     0
            c1t55CD2E404B740DD3d0  ONLINE       0     0     0
            c1t55CD2E404B7591BEd0  ONLINE       0     0     0
        cache
          c1t50025388A0952EB0d0    ONLINE       0     0     0
        spares
          c1t5000C5002CD7AFB6d0    AVAIL

The disk "c1t5000C5007A2E1BAEd0" is currently at a data recovery facility, but they have exhausted the supply of replacement heads, including those from donor disks we have supplied. The disk marked as missing was eventually found, and could potentially be recovered, but it's a last result because I have no idea how out of date it is compared to the rest, and what that would mean for consistency. To be considered a donor, the first 3 letters of the serial needs to match, as well as the site code. I have 4 other disks in the pool that match that criteria and were healthy at the time the pool went down.

So, on to my question: Is it possible for me to substitute the 4 other possibly-donor-compatible disks(based on serial number) disks with 4 new disks after using dd to copy the entire donor disk to the new disk for each?

I am not clear on whether the pool requires the WWN or serial to match what it has stored (if it stores anything besides the cache) when importing a disk, or whether it scans for metadata on each disk to determine if it can import a pool. If the latter is true, is my strategy to obtain 4 more donor disks feasible?

Dirk
  • 33
  • 4

1 Answers1

2

Definitely don't use dd! ZFS has a built-in command for this, which is described reasonably well in Oracle's docs. You should be able to use zpool replace tank <old device> <new device> to do the main part of the operation, but there are a couple other ancillary commands as well:

The following are the basic steps for replacing a disk:

  • Offline the disk, if necessary, with the zpool offline command.
  • Remove the disk to be replaced.
  • Insert the replacement disk.
  • Run the zpool replace command. For example: zpool replace tank c1t1d0
  • Bring the disk online with the zpool online command.

The man page also has some additional information:

zpool replace [-f]  pool device [new_device]

 Replaces old_device with new_device.  This is equivalent to attaching
 new_device, waiting for it to resilver, and then detaching
 old_device.

 The size of new_device must be greater than or equal to the minimum
 size of all the devices in a mirror or raidz configuration.

 new_device is required if the pool is not redundant. If new_device is
 not specified, it defaults to old_device.  This form of replacement
 is useful after an existing disk has failed and has been physically
 replaced. In this case, the new disk may have the same /dev path as
 the old device, even though it is actually a different disk.  ZFS
 recognizes this.

 -f  Forces use of new_device, even if its appears to be in use.
     Not all devices can be overridden in this manner.

Of course, it's probably best to try this first on a VM that has virtual disks in a similarly-configured zpool, rather than trying it for the first time on the pool with data you care about recovering.

By the way, this other part of the docs explains a bit more about hot spares and perhaps includes pointers to explain why yours didn't get used. It might be valuable to poke around a bit to make sure it doesn't crap out again next time :(.

Dan
  • 270
  • 2
  • 8
  • I should mention that the pool no longer appears to be imported. The system was rebooted after zpool and other management commands hung indefinitely. After doing so, I was forced to import the healthy pool and the faulted pool, but unable to import the latter, even with -F. I can still see pool information with zdb, but I don't know what else I might be able to do from a zpool perspective until I can get the one dead mirror at least back to a degraded state. I have no problem using dd if it will allow me to recover that dead disk. Both mirror-6 members are completely dead. – Dirk Oct 05 '18 at 03:18
  • I guess if it’s your only option, what choice do you have? dd will still copy the disk labels, so it’s possible it could work, although I’m not confident it won’t confuse ZFS. You could always try it before you actually hand off the disk, so you can plug the original disk back in if the copied one doesn’t work. – Dan Oct 05 '18 at 03:23
  • I will experiment with a VM as you suggested or another piece of hardware. That should let me know whether this will work at all. I didn't realize before I asked this question that the entire strategy hinges on this being possible, since the data recovery firm is attempting to image the failed disk. – Dirk Oct 05 '18 at 15:20
  • Im having this same question, as a backup plan. Currently i havea . raidz-2 setup, today i had 3 degraded disks (of 6). I cant imagine how the pool didnt fault. So i grabbed a new disk, and its resilvering. Im trying to verify my data is still there. Im hoping since the pool isn't faulted that it is. Im really not sure how its resilvering, so im doing research on this redundancy question. I may have been extremely lucky that the placement of the drives that failed weren't on the top, etc.. not sure. I was thinking of ddrescue'ing the failed drive if this resilver fails. – Brian Thomas Jan 22 '20 at 02:16