How ZFS handles online replacement in a RAID-Z (theoretical)

Question

This is a somewhat theoretical question about ZFS and RAID-Z. I'll use a three disk single-parity array as an example for clarity, but the problem can be extended to any number of disks and any parity.

Suppose we have disks A, B, and C in the pool, and that it is clean.

Suppose now that we physically add disk D with the intention of replacing disk C, and that disk C is still functioning correctly and is only being replaced out of preventive maintenance. Some admins might just yank C and install D, which is a little more organized as devices need not change IDs - however this does leave the array degraded temporarily and so for this example suppose we install D without offlining or removing C. Solaris docs indicate that we can replace a disk without first offlining it, using a command such as:

zpool replace pool C D

This should cause a resilvering onto D. Let us say that resilvering proceeds "downwards" along a "cursor." (I don't know the actual terminology used in the internal implementation.)

Suppose now that midways through the resilvering, disk A fails. In theory, this should be recoverable, as above the cursor B and D contain sufficient parity and below the cursor B and C contain sufficient parity. However, whether or not this is actually recoverable depnds upon internal design decisions in ZFS which I am not aware of (and which the manual doesn't say in certain terms).

If ZFS continues to send writes to C below the cursor, then we are fine. If, however, ZFS internally treats C as though it were gone, resilvering D only from parity between A and B and only writing A and B below the cursor, then we're toast.

Some experimenting could answer this question but I was hoping maybe someone on here already knows which way ZFS handles this situation. Thank you in advance for any insight!

score 6 · Accepted Answer · answered Sep 26 '12 at 12:52

Testing with a file based pool (v28 on FreeBSD 8.3 using file-backed md devices) suggests that it should work. I was able to offline one of the remaining disks while the resilver was in progress. Ideally it'd need testing with real disks and to actually pull one to be 100% sure but ZFS was perfectly happy to let me offline the disk.

Before offlining md0, the pool was still fully ONLINE so it appears to me that ZFS is just mirroring the replaced disk to the new disk, but still treating the whole lot as available during the process.

    NAME                     STATE     READ WRITE CKSUM
    test                     DEGRADED     0     0     0
      raidz1-0               DEGRADED     0     0     0
        8480467682579886773  OFFLINE      0     0     0  was /dev/md0
        md1                  ONLINE       0     0     0
        replacing-2          ONLINE       0     0     0
          md2                ONLINE       0     0     0
          md3                ONLINE       0     0     0  (resilvering)

Thank you for the test! Clearly ZFS is doing the right thing here, which shouldn't be too surprising considering how well-engineered it seems in general. — Kevin, Sep 27 '12 at 09:05

score 2 · Answer 2 · answered Sep 26 '12 at 14:35

2

Disk C is still used in the RAIDZ exactly as it had been until it is removed from the VDev. As Matt points out, ZFS replaces a disk by making the replacement disk a mirror of the replacee, and resilvering the replacement disk. The RAIDZ VDev is never degraded, and never resilvered (until A fails, which is entirely separate from the replacement operation).

answered Sep 26 '12 at 14:35

Chris S

77,337
11
120
212

Wasn't the OP implying that there would be a failure of disk A? – ewwhite Sep 26 '12 at 15:19
Yes, but a single failure in the RAIDZ VDev wouldn't cause an interruption of operation. – Chris S Sep 26 '12 at 15:21
Okay. I think I understand. – ewwhite Sep 26 '12 at 15:27

score 1 · Answer 3 · answered Sep 26 '12 at 12:09

1

I'm not sure that this matters.

In most cases, you shouldn't be using RAIDZ, versus mirrors... If you do, you should be doing so with a spare.

Resilvering will fail if one of the disks it's reading from fails or is unavailable. Same as an Unrecoverable Read Error. Disk C would be gone by that point...

answered Sep 26 '12 at 12:09

ewwhite

194,921
91
434
799

Well raidz1 tolerates a single disk failure, and it's clear from Matt's demo that it continues to tolerate a single disk failure even during preventive maintenance. A second failure can destroy the pool, but whether this is tolerable depends on the use case (and can be made less likely by not working drives to failure). A second failure also destroys a two-way mirror. Any real-world use should never be without regular off-line backup, as even with double parity or triple mirrors, a single errant operation or natural disaster can destroy data. Thank you for the informative link, btw. :) – Kevin Sep 27 '12 at 08:15

How ZFS handles online replacement in a RAID-Z (theoretical)

3 Answers3