ZFS endless resilvering

Question

I have a large (> 100TB) ZFS (FUSE) pool on Debian that lost two drives. As the drives failed, I replaced them with spares until I could schedule an outage and physically replace the bad disks.

When I took the system down and replaced the drives, the pool started resilvering as expected, but when it gets to about 80% complete (this usually takes about 100 hours) it restarts again.

I'm not sure if replacing two drives at once created a race condition, or if due to the size of the pool the resilver takes so long that other system processes are interrupting it and causing it to restart, but there's no obvious indication in the results of 'zpool status' or the system logs that point to a problem.

I have since modified how I lay out these pools to improve resilvering performance but any leads or advice on getting this system back into production are appreciated.

zpool status output (the errors are new since last time I checked):

  pool: pod
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver in progress for 85h47m, 62.41% done, 51h40m to go
config:

    NAME                                                 STATE     READ WRITE CKSUM
    pod                                                  ONLINE       0     0 2.79K
      raidz1-0                                           ONLINE       0     0 5.59K
        disk/by-id/wwn-0x5000c5003f216f9a                ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CWPK    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQAM    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BPVD    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQ2Y    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CVA3    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQHC    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BPWW    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09X3Z    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQ87    ONLINE       0     0     0
        spare-10                                         ONLINE       0     0     0
          disk/by-id/scsi-SATA_ST3000DM001-1CH_W1F20T1K  ONLINE       0     0     0  1.45T resilvered
          disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09BJN  ONLINE       0     0     0  1.45T resilvered
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQG7    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQKM    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQEH    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09C7Y    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CWRF    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQ7Y    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0C7LN    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQAD    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CBRC    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BPZM    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BPT9    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQ0M    ONLINE       0     0     0
        spare-23                                         ONLINE       0     0     0
          disk/by-id/scsi-SATA_ST3000DM001-1CH_W1F226B4  ONLINE       0     0     0  1.45T resilvered
          disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CCMV  ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0D6NL    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CWA1    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CVL6    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0D6TT    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BPVX    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09BGJ    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0C9YA    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09B50    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0AZ20    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BKJW    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F095Y2    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F08YLD    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQGQ    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0B2YJ    ONLINE       0     0    39  512 resilvered
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQBY    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0C9WZ    ONLINE       0     0     0  67.3M resilvered
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQGE    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQ5C    ONLINE       0     0     0
        disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CWWH    ONLINE       0     0     0
    spares
      disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CCMV      INUSE     currently in use
      disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09BJN      INUSE     currently in use

errors: 572 data errors, use '-v' for a list

"errors: Permanent errors have been detected in the following files:" and then a list of about 12 files that have errors. — jasongullickson, Jul 15 '13 at 15:06
For the 'see:' part, you may read a more detailed explanation here: https://www.illumos.org/msg/ZFS-8000-8A — Raymond Tau, Jul 15 '13 at 15:26

score 61 · Accepted Answer · edited Jul 15 '13 at 21:26

61

Congratulations and uh-oh. You've stumbled across one of the better things about ZFS, but also committed a configuration sin.

First, since you are using raidz1, you only have one disk worth of parity data. However, you had two drives fail contemporaneously. The only possible result here is data loss. No amount of resilvering is going to fix that.

Your spares helped you out a little bit here and saved you from a completely catastrophic failure. I'm going to go out on a limb here and say that the two drives that failed did not fail at the same time and that the first spare only partially resilvered before the second drive failed.

That seems hard to follow. Here's a picture:

sequence of events

This is actually a good thing because if this were a traditional RAID array, your entire array would have simply gone offline as soon as the second drive failed and you would have NO chance of an in-place recovery. But because this is ZFS, it can still run using the pieces it has and simply returns block or file level errors for the pieces it doesn't.

Here is how you fix it: Short-term, get a list of damaged files from zpool status -v and copy those files from backup to their original locations. Or delete the files. This will allow the resilver to resume and complete.

Here is your configuration sin: you have way too many drives in a raidz group.

Long term: you need to reconfigure your drives. A more appropriate configuration would be to arrange the drives in to small groups of 5 drives or so in raidz1. ZFS will automatically stripe across those small groups. This significantly reduces the resilver time when a drives fails because only 5 drives need to participate instead of all of them. The command to do this would be something like:

zpool create tank raidz da0 da1 da2 da3 da4 \
                  raidz da5 da6 da7 da8 da9 \
                  raidz da10 da11 da12 da13 da14 \
                  spare da15 spare da16

edited Jul 15 '13 at 21:26

Ben Jackson

438
3
7

answered Jul 15 '13 at 15:14

longneck

22,793
4
50
84

Thank-you very much @longneck for the detailed and informative answer! You're spot-on about the sequence of events, and I've already taken your advice on device configuration (the second device I built is configured almost exactly as you described, with some additional considerations to keep each raid spread across the hardware to reduce the chances of loosing an entire raid due to backplane failure, etc.). – jasongullickson Jul 15 '13 at 15:37
Having deleted the corrupted files, "zfs status" now returns hex values instead of filenames; I assume this will go away when the scrub finally finishes? – jasongullickson Jul 15 '13 at 15:38
@jasongullickson only if the metadata of the filesystem is also intact. ZFS is pretty aggressive when it comes to protecting the metadata so you will probably be good. only time will tell. – longneck Jul 15 '13 at 15:40
i personally haven't run in to a metadata corruption event before so i don't know what that will look like in terms of error events. – longneck Jul 15 '13 at 15:41
I'll try and remember to update the thread when I find out @longneck :) Also, am I correct in saying that there's no way to reconfigure the drives "online"; that it's a "backup, reconfigure, restore"-type operation? – jasongullickson Jul 15 '13 at 16:02
how full is your filesystem? – longneck Jul 15 '13 at 16:04
this one is pretty full (85%). I have a second system with a similar capacity that is much less full, so if there is some "magic number" of allocation that I can get it down to that will allow me to reconfigure the drives I might be able to do that. – jasongullickson Jul 15 '13 at 16:13
What I would do is use `zfs zend` to copy the data to the other system. Reconfigure this system. Then send the data back. That's assuming the other system has enough space. – longneck Jul 15 '13 at 16:51
Great post and info - I'm surprised that it handles the situation that well! One note - smaller RAID groups won't necessarily help with shortening re-silvering time, since you're just doing a full read of each disk alive in the group (no matter how many) and a full write to the new member; the capacity and speed of the individual disks is more likely to be determining that. A smaller RAID group reduces the raw volume of data needing to be read to rebuild onto the new disk (so bottlenecks might arise there), but doesn't change what each individual disk needs to do. – Shane Madden Jul 16 '13 at 05:30
@ShaneMadden um, no. If you have 25 disks in a raidz1 group, ZFS reads 24 of them to resilver 1 disk. If you have 30 disks divided in to 6 raidz1 groups, then you only need to read 4 disks to rezilver 1 disk. If you have enough bandwidth available, then those two operations could be equivalent. But resilvering using 24 disks vs. 4 will take approximately 6x more processor time. In my benchmarking on a couple of HP Proliant G5-G8 servers I have available, the exact drive scenario I describe yields a 80% faster resilver with no oversubscription, and 30% when oversubscribed. – longneck Jul 16 '13 at 16:12
@longneck Then your bottleneck is somewhere other than the I/O on the disks themselves, which is a possible scenario as I mentioned.. some other component is bottlenecking the resilver, not allowing the disks to read at full speed, which is certainly a factor that needs to be considered in some systems where the disks are capable of saturating other components (though I'm not sure why you mention CPU, surely it's capable of keeping up with the XOR work?). Can you clarify what you mean by "oversubscribed"? – Shane Madden Jul 16 '13 at 17:12
By oversubscribed I'm referring to having not enough bandwidth available for the drives to all run simultaneously at full speed. – longneck Jul 16 '13 at 17:24
1

@longneck Gotcha, then we're in agreement - it's definitely a bad idea to have a RAID-Z group large enough that you're bottlenecking a resilver and slowing it down. And the other big risk of larger groups is the increased odds of a second device failing during the resilver - an increased number of parity disks (with RAID-Z2 or 3) would help with the reliability problems, but not with the speed of the resilver. – Shane Madden Jul 16 '13 at 17:35
Hey guys, thanks again for all this awesome info. The good news is that I have a second device now to use while I reconfigure this one. The bad news is that recently another disk has failed, so the pool is UNAVAL :( I'm planning to have the recently-failed disk "recovered" to a new hard drive, but when the new drive is ready, I'm not sure the best way to bring it back into the pool. I have a new thread posted here: http://serverfault.com/questions/543823/replacing-a-recovered-hard-disk-in-a-zfs-pool if you have any recommendations :) – jasongullickson Oct 04 '13 at 13:59

ZFS endless resilvering

1 Answers1

Linked