Upgrading ZFS Pool size with different sized disks

Currently, I have a 1TB, 2TB and 3TB drive, with probably around 5.5TB used, and I'm thinking that I will buy two more 4TB drives and set up a 4TB x 4TB x 3TB RAIDZ array.

With 4 TB drives, you shouldn't be looking at single redundancy RAIDZ. I would recommend RAIDZ2 because of the additional protection it affords in case one drive somehow breaks or otherwise develops problems.

Remember that consumer drives are usually spec'd to a URE rate of one failed sector per 10^14 bits read. 1 TB (hard disk drive manufacturer terabyte, that is) is 10^12 bytes or close to 10^13 bits, give or take a small amount of change. A full read pass of the array you have in mind is statistically likely to encounter a problem, and in practice, read problems tend to develop in batches.

I'm not sure why you are suggesting RAIDZ2. Is it more likely that I will develop two simultaneous drive failures if I use RAIDZ1 than if I use no RAID? I want some improvement to the fault tolerance of my system. Nothing unrecoverable will exist in only one place, so the RAID, array is just a matter of convenience.

RAIDZ1 uses a single disk to provide redundancy in a vdev, whereas RAIDZ2 uses two (and some more complex calculations, but you are unlikely to be throughput limited by RAIDZ calculations anyway). The benefit of a second redundant disk is in case the first fails or otherwise becomes unavailable. With only one disk's worth of redundancy, any additional errors are now critical. With 4+4+3 TB, you have 11 TB of raw storage, initially 6 TB of which may need to be read to reconstruct a lost disk (8 TB once you upgrade that 3 TB drive to a 4 TB one and expand the pool to match). For order-of-magnitude estimates, that rounds nicely to somewhere between 10^13 and 10^14 bits. Statistically, you have something like a 50% to 100% probability of hitting an unrecoverable read error during resilvering when using single redundancy with an array of that order of magnitude size. Sure, you may very well luck out, but it suddenly means that you have next to no protection in case of a drive failure.

My understanding is that for heterogeneous disk arrays like this, the pool size will be limited to the size of the smallest disk, which would mean I'd be looking at a 3 x 3 x 3 TB RAIDZ with 6TB usable space and 2TB of non-fault-tolerant space - is this true?

Almost. ZFS will restrict the vdev to the size of the smallest constituent device, so you get the effective capacity of a three-device RAIDZ vdev made up of 3 TB devices, so 6 TB of user-accessible storage (give or take metadata). The remaining 2 TB of raw storage space are wasted; it is not available for use even without redundancy. (They will show up in the EXPANDSZ column in zpool list, but they aren't being used.)

Once you replace the 3 TB drive with a 4 TB drive and expand the vdev (both of which are online operations in ZFS), the pool can use the additional storage space.

There are ways around this -- for example, you could partition the drives to present three 3 TB devices and two 1 TB (remainder of the two 4 TB drives) devices to ZFS -- but it's going to seriously complicate your setup and it's unlikely to work the way you plan. I strongly recommend against that.

The 2 TB of non-fault tolerant space would not be backed up by ZFS to the offline disks, sorry if that was not clear. I was suggesting that I would back it up by normal disk syncing operations like rsync.

That implies that ZFS has no knowledge of those 2 x 1TB, and that you are creating some other file system in the space. Yes, you can do that, but again, it's going to seriously complicate your setup for, quite frankly, what appears to be very little gain.

Assuming #1 is true, when I eventually need more space, if I add a single 4TB or 6TB drive to the array, will it be a simple matter to extend the pool to become a 4 x 4 x 4 TB array, or will I need to find somewhere to stash the 6TB of data while upgrading the array?

As I said above, ZFS vdevs and pools can be grown as an online operation, if you do it by gradually replacing devices. (It is not, however, possible to shrink a ZFS pool or vdev.) What you cannot however do is add additional devices to an existing vdev (such as the three-device RAIDZ vdev that you are envisioning creating); an entirely new vdev must be added to the pool, and the data that is later written is then striped between the two vdevs in the pool. Each vdev has its own redundancy requirements, but they can share hot spares. You also cannot remove devices from a vdev, except in the case of mirrors (where removing a device only reduces the redundancy level of that particular mirror vdev, and does not affect the amount of user-accessible storage space), and you cannot remove vdevs from a pool. The only way to do the latter (and by consequence, the only way to fix some pool configuration mishaps) is to recreate the pool and transfer the data from the old pool, possibly by way of backups, to the new pool.

The 2TB of non-fault-tolerant space is not that big a deal, because I was planning on setting aside around 2TB for "stuff that needs proper backup" (personal photos, computer snapshots, etc), which I would mirror to the remaining 2TB disk and a 2nd 2TB external drive that I will keep somewhere else.

ZFS redundancy isn't really designed for the mostly-offline offsite-backup-drive use case. I discuss this in some depth in Would a ZFS mirror with one drive mostly offline work?, but the bread and butter of it is that it's better to use zfs send/zfs receive to copy the contents of a ZFS file system (including snapshots and other periphernalia), or plain rsync if you don't care about snapshots, than to use mirrors in a mostly-offline setup.

If I'm using half my disks for fault tolerance, I might as well just use traditional offline backups.

This admittedly depends a lot on your situation. What are your time to recovery requirements in different situations? RAID is about uptime and time to recovery, not about safeguarding data; you need backups anyway.

a CVn

Posted 2016-11-28T17:37:22.307

Reputation: 26 553

The 2 TB of non-fault tolerant space would not be backed up by ZFS to the offline disks, sorry if that was not clear. I was suggesting that I would back it up by normal disk syncing operations like rsync. – Paul – 2016-11-29T11:40:40.367

I'm not sure why you are suggesting RAIDZ2. Is it more likely that I will develop two simultaneous drive failures if I use RAIDZ1 than if I use no RAID? I want some improvement to the fault tolerance of my system. Nothing unrecoverable will exist in only one place, so the RAID, array is just a matter of convenience. If I'm using half my disks for fault tolerance, I might as well just use traditional offline backups. – Paul – 2016-11-29T11:51:03.120

1@Paul: While rebuilding a failed drive, all data from all drives must be read. This increases the stress on the disks (especially if they are mostly idle normally) and therefore the chance of another drive failing. Additionally, while reading all data you may get an URE from one of the disks with no second disk to compensate, which means files can be damaged/lost. Third, the bigger your disks are, the longer your window of vulnerability becomes, not just for those problems, but any problems that may occur on the disks or system (power outage etc.). – user121391 – 2016-11-29T12:29:13.900

@Paul Sorry, I've misread your question (my first answer was about reliability of RAIDZ1 in case of one dead drive). To answer it: in this case, you get more reliability with Z1 over basic vdevs while everything works normally, but after one disk has died, your chance for further corruption actually increases. – user121391 – 2016-11-29T12:37:56.487

@MichaelKjörling Well, like I said, everything unrecoverable will be backed up. This is for a home media server if I didn't mention this, so uptime is not a major issue. I will ask a separate question to resolve this RAIDZ1 vs RAIDZ2 issue. – Paul – 2016-11-29T14:50:43.940

@Paul Please do link to it from here in a comment as well; that sounds useful. You may want to review What are the different widely used RAID levels and when should I consider them? and Hot spare or extra parity drive in RAID array? and to a lesser extent Is bit rot on hard drives a real problem? What can be done about it?, all three on [sf] because this is typically an enterprise, not home, consideration.

– a CVn – 2016-11-29T14:59:41.267

1Here is the new question about RAIDZ1 vs. nothing. – Paul – 2016-11-29T15:14:47.233

Upgrading ZFS Pool size with different sized disks

Answers