Replacing a recovered hard disk in a ZFS pool

Question

I have a pool that is in UNAVAL status (" One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning.") due to a recent disk failure.

I'm planning to have the failed disk repaired (ie, data recovery service) in order to get the pool back online long enough to migrate it, however there's one snag that I'm not sure how to work around.

The device names in my pool are using the serial number of my disks (/dev/disk/by-id/ style). I did this because I have a lot of disks, and the /dev/sd* names would move around at each boot, which of course wrought havoc on the pool. However in this case, since I'm going to be bringing the "same" disk (in terms of data, but not hardware) back online, but with a different device name, I don't think it's going to recognize it correctly automatically, and I'm not sure exactly how the "replace" command is going to treat the new disk. Perhaps it will just work, but based on the documentation, it might treat the new disk as "blank" instead of using it to repair the pool (or maybe ZFS looks at the content of the disk and acts accordingly, I'm just unsure).

In a nutshell, I want to take an offline disk that is registered in the pool by it's hardware device name, copy it to another physical disk, then bring the new disk online in place of the original.

I'm doing some experiments with non-production devices to suss this out, but any thoughts from those of you who know more about what ZFS does "under the hood" or have experience with this sort of recovery is greatly appreciated! Additionally, if there are papers, docs, etc. that get into this level of tweaking I'd be happy to study them as well.

To be crystal clear, this isn't intended to be a long-term configuration, just enough to evacuate the contents of the array, so I'm not opposed to solutions that are not suitable for long-term/production environments.

One question... Why did the failure of a single drive result in your pool becoming unavailable? Weren't you using RAID with *some* parity or mirroring features? — ewwhite, Oct 04 '13 at 14:16
Yes RAID was used, but this failure occurred before a previous failure could be recovered from. — jasongullickson, Oct 04 '13 at 15:06
First test result: straight-up "replace " won't work because it can't open the unavailable pool. — jasongullickson, Oct 04 '13 at 15:12
Second test: create a symbolic link to the new device using the old device name works to get the pool back to ONLINE, need to validate with some data transfers. — jasongullickson, Oct 04 '13 at 15:19

score 4 · Accepted Answer · answered Oct 04 '13 at 15:31

4

Looks like a simple symbolic link is the answer:

Stop ZFS
ln -s /dev/disk/by-id/new-disk-id /dev/disk/by-id/old-disk-id
Start ZFS
Pool should be back to ONLINE

Thanks Unix!

answered Oct 04 '13 at 15:31

jasongullickson

573
4
11

2

That scares the hell out of me, yet I can't think of any reason why it's a bad solution (other than remembering you did it so it doesn't bite you later). Good job? :) – Nex7 Oct 31 '13 at 07:04

Replacing a recovered hard disk in a ZFS pool

1 Answers1

Linked