Arranging a raidz3 ZFS vdev to tolerate entire JBOD failures?

Question

Let's say I'm going to build a very large 1PB zpool. I'll have a head unit with the HBAs inside of it (maybe 4 port LSI SAS cards) and I will have perhaps 7 45-drive JBODs attached to the head unit.

The basic way to do this with raidz3 would be create 21 different 15-drive raidz3 vdevs (3 15-drive vdevs for each of the 7 JBODs) and just make a pool out of all 21 of these raidz3 vdevs.

This would work just fine.

The problem here is that if you lose a single vdev for any reason, you lose the entire pool. Which means that you absolutely can never lose an entire JBOD, since that's 3 vdevs lost. BUT, in a mailing list thread, someone cryptically alluded to a way of organizing the disks so that you could indeed lose an entire JBOD. They said:

"Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual pathed to a couple of LSI SAS switches ... We did triple parity and our vdev membership is set up such that we can lose up to three JBODs and still be functional (one vdev member disk per JBOD)."

... and I am not quite sure what they're saying here. I think what they are saying is that instead of having a vdev be (all contiguous 15 (or 12, or whatever) disks on one HBA), you actually have the parity drives for the vdev split into other JBODs, such that you could lose any jbod and you still have N-3 drives elsewhere to cover that vdev...

Or something...

Two questions:

Does anyone know what the recipe for this looks like
Is it complex enough that you really do need a SAS switch, and I couldn't just set it up with complex HBA<-->JBD cabling ?

Thanks.

Also... personally, I wouldn't do this... at least not without the blessing of a VAR or ZFS storage expert. Can you elaborate on the hardware you're planning to use, the usable disk space you require, the purpose/application of the pool and your expected performance profile? — ewwhite, Jul 14 '14 at 23:36

score 5 · Answer 1 · edited Apr 13 '17 at 12:14

The explanation for the JBOD resiliency you read about on the mailing list is probably something like a set of RAIDZ3 vdevs and enclosures... Say 8 disks per RAIDZ3 (5+3), and 5 (or 8?) enclosures, such that the vdevs were comprised of a single disk from each enclosure.

But for realz, I would not do 1PB of storage without some degree of high-availabilty...

Here are a couple of reference designs for a proper HA cluster with dual HBAs per head-node and redundant, cascaded SAS cabling. If I were designing this, I would plan on ZFS mirror deployment instead of RAIDZ(1/2/3).

I find the limitations of RAIDZ arrays to be a deal breaker in most production situations; lack of expandability, poor performance, complicated planning and more difficult fault recovery.

I'd be using ZFS mirrors and the largest enclosures possible (e.g. 60-disk or 70-disk units), SAS disks and avoid Supermicro equipment ;)

Beyond that, quality JBOD units are very resilient in that they have internal redundancies, dual-path backplanes and midplane assemblies that typically don't fail. Most components are hot-swappable. I'd be less concerned about the enclosures and more about cabling, controller and pool design.

If you must use RAIDZ(1/2/3), configure as needed and keep spare disks in each JBOD. Configure them as global spares as well.

Dual node:

Single node:

Arranging a raidz3 ZFS vdev to tolerate entire JBOD failures?

1 Answers1