ZFS stripe on top of hardware RAID 6. What could possibly go wrong?

Question

I have 36*4TB HDD SAN Rack. RAID controller did not support RAID60 and not more than 16 HDDs in one RAID group. So I decided to make 2 RAID6 groups of 16HDD or 4 of 8 HDDs. I want to get all storage as one partition.

So, what could possibly go wrong if I will use zfs pool on top of hardware RAID6? Yeah, I know that it is strongly recommended to use native HDDs or pass-through mode. But I have not this option.

Or should I stay away from ZFS and software raids in this situation? (I'm mostly interested in compression and snapshots)

If you're going to use ZFS then why not just expose all disks individually (sometimes called HBA mode) and let ZFS handle it - it's what it does best. We have a number of true experts at this (ewwhite for a start) who will help you with this - what exact disk controller are you using? — Chopper3, Nov 21 '16 at 09:57
>>Yeah, I know that it is strongly recommended to use native HDDs or pass-through mode. **But I have not this option.** DotHill 3530 — Severgun, Nov 21 '16 at 10:13
Is it the 3530c - presumably with another two shelves 'daisy-chained' through? what interface are you using (SAS,FC etc.)? is any other server attached to this storage? — Chopper3, Nov 21 '16 at 10:34
You'll be subverting many ZFS features using this method, but overall it's not going to hurt anything to do it this way. Checksumming is a little more useless in this configuration, as the RAID controller will be abstracting away all of the disk details. I'm more interested in why you say you can't use JBOD. assuredsan 3530 are JBOD capable units. — Spooler, Nov 21 '16 at 10:35
Yeah 3 shalves chained with 2 controllers and connected by SAS with fileserver. Seems like there is JBOD but only 16 hdds per controller. What if one controller failed? Help says that in RAID vdisks will be reallocated to alive controller. But nothing about non raid. Also 4 HDDs stay useless cuz no need in hotspare. — Severgun, Nov 21 '16 at 11:12
I'd wait for ewwhite - he's in central US so is sleeping but he knows ZFS better than anyone I know — Chopper3, Nov 21 '16 at 11:50
Yes. Core meintainer of ZFS should be better :) I think this wiki for me https://github.com/ewwhite/zfs-ha/wiki — Severgun, Nov 21 '16 at 13:11
*not more than 16 HDDs in one RAID group* **WHY** would you want to stuff more than 16 HDDs into one RAID group?!?!?! — Andrew Henle, Nov 22 '16 at 11:08
The DotHill 3530 is a NEBS Level 3-compliant piece of hardware with significant built-in reliability and availability features. Using such a device as a JBOD subverts many of those features. Somebody paid a lot of money for those features. Throwing them away because "ZFS!!!" makes no sense. — Andrew Henle, Nov 22 '16 at 11:24
@Severgun *Also 4 HDDs stay useless cuz no need in hotspare* Do you really think it's better for a RAID array with a failed drive to limp along in degraded mode than it is to automatically pick up a hot spare, rebuild, and return to fully-functional status? — Andrew Henle, Nov 22 '16 at 11:32
@Andrew Henle Of course not. That is why I really don't want to use 32 JBOD disk setup. — Severgun, Nov 22 '16 at 15:12
@Severgun I just wanted to make my point clear. You're going to keep getting a lot of **BUT ZFS SHINY!!!!** posts that want you to throw away all the reliability, availability and performance capabilities that your SAN hardware provides. Capabilities that someone paid a lot of money for. I'm guessing most of those postings will come from people without experience on high-end enterprise-class storage systems. — Andrew Henle, Nov 23 '16 at 11:07

Andrew Henle · Answer 1 · 2016-11-23T11:25:53.477

So I decided to make 2 RAID6 groups of 16HDD or 4 of 8 HDDs.

That's not the best way to do things. It may work well enough, but depending on your performance requirements, it may not.

The ideal size for a RAID5/6 array will be such that an exact multiple of the amount of data that "spans" the array matches the block size of the file system built on top of it.

RAID5/6 arrays work as block devices - a single block of data spans the disks in the array, and that block also contains parity data. Most RAID controllers will write a power-of-two sized chunk of data to each disk in the array - the exact value of which is configurable in better RAID systems - and your Dot Hill unit is one of those "better RAID systems". That's important.

So it takes N x (amount of data stored per disk chunk ) to span the array, where N is the number of data disks. A 5-disk RAID5 array has 4 "data" disks, and a 10-drive RAID6 array has 8 data disks.

Because when data is written to a RAID5/6 array, if the block of data is such that it's big enough to span the entire array, the parity is computed for that data - usually in the controller's memory - then the entire stripe is written to disk. Simple, and fast.

But if the chunk of data being written isn't big enough to span the entire array, what does the RAID controller have to do in order to compute the new parity data? Think about it - it needs all the data in the entire stripe to recompute the new parity data.

So if you make a 16-drive RAID6 array with the default per-disk chunk of 512kb, that means it takes 7 MB to "span" the array.

ZFS works in 128kb blocks, generally.

So ZFS writes a 128kB block - to a 16-drive RAID6 array. In the configuration you're proposing, that means the RAID controller needs to read almost 7 MB from the array and recompute the parity across those 7 MB. Then rewrite that entire 7 MB back to disk.

If you're lucky, it's all in cache and you don't take a huge performance hit. (This is one major reason why the "don't use RAID5/6" position has such a following - RAID1[0] doesn't suffer from this.)

If you're unlucky and you didn't properly align your filesystem partitions, that 128kB block spans two RAID stripes that aren't in cache, and the controller needs to read 14 MB, recompute parity, then write 14 MB. All to write one 128kB block.

Now, that's what needs to happen logically. There are a lot of optimizations that good RAID controllers can take to reduce the IO and computational load of such IO patterns, so it might not be that bad.

But under heavy load of writing 128kB blocks to random locations, there's a really good chance that the performance of a 16-drive RAID6 array with a 7 MB stripe size will be absolutely terrible.

For ZFS, the "ideal" underlying RAID5/6 LUNs for a general purpose file system where most accesses are effectively random would have a stripe size that's an even divisor of 128kB, such as 32kB, 64kB, or 128kB. In this case, that limits the number of data disks in a RAID5/6 array to 1 (which is nonsensical - even if possible to configure, it's better to just use RAID1[0]), 2, 4, or 8. Best performance in the best-case scenario would be to use a 128kB stripe size for the RAID5/6 arrays, but best-case doesn't happen often in general-purpose file systems - often because file systems don't store metadata the same as they store file data.

I'd recommend setting up either 5-disk RAID5 arrays or 10-disk RAID6 arrays, with the per-disk chunk size set small enough that the amount of data to span an entire array stripe is 64kB (yeah, I've done this before for ZFS - many times). That means for a RAID array with 4 data disks, the per-disk chunk size should be 16kB, while for an 8-data-disk RAID array, the per-disk chunk size should be 8kB.

Then allow ZFS to use the entire array - do not partition it. ZFS will align itself properly to an entire drive, whether the drive is a simple single disk or a RAID array presented by a RAID controller.

In this case, and without knowing your exact space and performance requirements, I'd recommend setting up three 10-drive RAID6 array or six 5-drive RAID5 arrays with 64kB stripe size, configure a couple of hot spares, and save four of your disks for whatever comes up in the future. Because something will.

I would most certainly not use that disk system in JBOD mode - it's a fully NEBS Level 3-compliant device that provides significant reliability and availability protections built right into the hardware. Don't throw that away just because "ZFS!!!!". If it's a cheap piece of commodity hardware you put together from parts? Yeah, JBOD mode with ZFS handling the RAID is best - but that's NOT the hardware you have. USE the features that hardware provides.

_That means for a RAID array with 4 data disks, the per-disk chunk size should be 16kB, while for an 8-data-disk RAID array, the per-disk chunk size should be 32kB._ I'm a little bit confused with this math. Why 8 disks - 32kB chunk? Correct me if I'm wrong: 128kB(ZFS block) / 3(RAID arrays) = 43 kB per-RAID array. RAID6 of 10 disks 43kB / 8 = 5kB (not available chunksize) closest 8kB chunksize also not available by hardware. So, best performance not accessible? — Severgun, Nov 23 '16 at 08:51
@Severgun I put the chunk sizes backwards. The problem with aiming for the absolute best performance on RAID5/6 is that it will only happen when almost all IO operations perfectly match the RAID array stripe size. Significant numbers of IO operations smaller than the stripe size can seriously degrade performance. Going with a smaller block size helps limit the impact of random small-block writes. In my experience, it's better to give up 1-2% of *possible* maximum performance in exchange for limiting worst-case drop off. General-purpose file systems tend to have a good number of small writes. — Andrew Henle, Nov 23 '16 at 11:00
(cont) 8 data disks in a RAID5/6 array with a 16kB chunk size per disk makes for a 128kB stripe size across the array. Likewise 32kB chunks for a 4-data-disk array. ZFS writes a 128kB file data block to a single device - it's not split across all zdevs. Again, though, for a general-purpose file system, there's going to be a lot of sub-128kB writes, so a smaller stripe size (64kB) will avoid performance degradation better under heavy write load, but at a small cost in best-case performance. — Andrew Henle, Nov 23 '16 at 11:28

score 4 · Answer 2 · edited Apr 13 '17 at 12:14

Okay, I'll bite...

This is the wrong hardware for the application. The DotHill setup has the same limitations as an HP StorageWorks MSA2000/P2000 in that only 16 drives can be used in a single array grouping.

ZFS atop hardware RAID or an exported SAN LUN is not necessarily a problem.

However, striping ZFS LUNs over unknown interconnects, across expansion chassis can introduce some risk.

For instance, are you running multipath SAS in a ring topology with dual controllers?
Do you have redundant cabling back to the server?
Have you distributed drives vertically across enclosures in a manner that would mitigate failure of a single chassis/cable/controller and prevent it from destroying a part of your RAID0 stripe?

Seriously, it may be worth evaluating whether you need all of this storage in a single namespace...

If you DO require that type of capacity in a single mount, you should be using a dedicated HBA-attached JBOD enclosure and possibly multiple head units with resilient cabling and a smarter layout.

score 1 · Answer 3 · answered Nov 22 '16 at 08:12

1

You should DIRECTLY attach all drives to a box running ZFS. Get a SAS HBA and connect the drives to the ZFS capable box (eg runing OmniOS or SmartOS). You can then share the space via NFS, SMB, iScsi ...

answered Nov 22 '16 at 08:12

Tobi Oetiker

1,752
12
12

*You should DIRECTLY attach all drives to a box running ZFS.* Not necessarily - replacing failed drives in a *hardware* array on some controllers is *easy*: pull out the hard drive with the failure light lit then pop a new one in. No system administrator needed to run ZFS commands to replace the drive. In an enterprise setup with hundreds or thousands of servers and maybe tens of thousands of hard drives spread over multiple data centers, that's a concern. Drives fail a whole lot more than bit rot happens. – Andrew Henle Nov 22 '16 at 11:14
@Tobi Oetiker tell me how to place 36 3.5" hdds into 2U case – Severgun Nov 22 '16 at 15:41
we just put them in an extra box ... use a sas extender ... as for large deployments, maybe ask how joyent is handling it. – Tobi Oetiker Nov 22 '16 at 17:54
@AndrewHenle To be fair, it is possible to achieve the same easy replacement procedure and status LEDs with ZFS and the right HBAs (may involve some minor scripting if not using a prepackaged solution). – user121391 Nov 23 '16 at 12:05

score 0 · Answer 4 · answered Nov 22 '16 at 23:35

0

The reason ZFS on top of HW RAID logical volumes is a VERY BAD idea, is because ZFS requires block-level access to actually properly function. Yes, it will be usable, but functionality will not be complete until you attach drives directly to the OS via an HBA or direct SATA connections. One example is that in the configuration you're proposing ZFS cannot reasonably protect your data against changes to the data below (on the other side of the HW RAID controller), and as such cannot guarantee the safety of your data. This is one of the PRIMARY reasons ZFS is used, in addition to it being super duper fast.

ZFS is awesome tech, and I highly recommend it. But you're going to need to revisit your structure here in order to be able to correctly use it. Namely having ZFS create the logical volumes (vdevs) from the disks directly.

It sounds like there's a lot more reading you need to do on how ZFS operates before you can accurately understand what you've proposed it, contrast to what really should be done instead.

answered Nov 22 '16 at 23:35

BloodyIron

174
1
9

Yes yes and yes. I understand how ZFS works as much as I can. But there are some complications: 1) I already have SAN enclosure and **need to** use it. I'm not building storage from scratch. 2) This is not my home NAS where I can buy and throw away things. 3) Budget for storage config rebuild equals **zero**. From storage I need maximum available write speed with space around 100Tb. I'm looking to ZFS mostly due to compression and snapshots. I can try btrfs but it is experimental. Hmm may be ZoL unstable too? I don't now. – Severgun Nov 23 '16 at 05:34
@Severgun As long as you know what the downsides are, you will be fine in my opinion. ZFS has many nice features (like snapshots) that work independently from others. Most advice on the internet stresses the importance of best practices in all areas, but they are recommendations, not strict requirements. This point will become less important in the future, as more and more LInux distributions change to ZFS and most Linux systems run virtualized, so they will have your exact situation. – user121391 Nov 23 '16 at 08:13
1

*The reason ZFS on top of HW RAID logical volumes is a VERY BAD idea, is because ZFS requires block-level access to actually properly function.* That's so bad it's not even good enough to be called wrong. You apparently have no idea what a NEBS 3-compliant piece of hardware means, do you? *in addition to it being super duper fast.* ZFS is lots of good things. "super duper fast" is **NOT** one of them. [This is a *fast*](https://en.wikipedia.org/wiki/QFS) file system. [So is this](https://en.wikipedia.org/wiki/IBM_General_Parallel_File_System). As file systems go, ZFS is *not* fast. – Andrew Henle Nov 23 '16 at 10:52

ZFS stripe on top of hardware RAID 6. What could possibly go wrong?

4 Answers4