7

Background

I'm installing Proxmox Virtual Environment on a Dell PowerEdge R730 with a Dell PowerEdge RAID Controller (PERC) H730 Mini Hardware RAID controller and eight 3TB 7.2k 3.5" SAS HDDs. I was contemplating using the PERC H730 to configure six of the physical disks as a RAID10 virtual disk with two physical disks reserved as hot standby drives. However, there seems to be quite a bit of confusion between ZFS and HW RAID, and my research has brought me confusion rather than clarity.

Questions

  • What are the advantages and disadvantages of HW RAID versus ZFS?
  • What are the differences between HW RAID and ZFS?
  • Are HW RAID and ZFS complementary technologies or incompatible with each other?
  • Since Proxmox VE is a Debian-based Linux distribution, does it make more sense to use the H730 for RAID10 with LVM versus setting the H730 in HBA mode and using ZFS?

If these should be separate ServerFault questions, please let me know.

Similar ServerFault Questions

I found the following similar ServerFault questions, but these don't seem to directly address the above questions. Although, I fully admit that I'm not a full-time sysadmin, so maybe they address my questions, and I'm simply out of my depth.

Additional Research

Matthew Rankin
  • 1,155
  • 5
  • 15
  • 32

5 Answers5

8

Hardware RAID vs ZFS doesn't make a lot of difference from a raw throughput perspective -- either system needs to distribute data across multiple disks, and that requires running a few bit shifting operations on cached data, and scheduling writes to underlying disks. Which processor you use for that hardly matters, and synthetic workloads like running dd can't tell you much here.

The differences are in features:

Hardware RAID is usually just a block layer, perhaps with some volume management on top, while ZFS also includes a file system layer (i.e. there is no separation of concerns in ZFS). This allows ZFS to offer compression and deduplication, while that would be hard to get right on a block layer, but for use cases where you just want a set of simple 1:1 mappings, that additional complexity will still be there.

On the other hand, hardware RAID can offer battery backed write caches that are (almost) transparent to the operating system, so it can easily compensate for the overhead of a journaling file system, and data needs to be transferred out of the CPU only once, before adding redundancy information.

Both have their use cases, and in some places, it even makes sense to combine them, e.g. with a hardware RAID controller that offers a battery backed cache, but the controller is set to JBOD mode and only re-exports the constituent disks to the operating system, which then puts ZFS on top.

In general, ZFS alone is good for "prosumer" setups, where you don't want to spend money on hardware, but still want to achieve sensible fault tolerance and some compression, and where random-access performance isn't your primary concern.

ZFS on top of JBOD is great for container and VPS hosting -- the deduplication keeps the footprint of each container small, even if they upgrade installed programs, as two containers that have installed the same upgrade get merged back into one copy of the data (which is then again kept in a redundant way).

Hardware RAID alone is good for setups where you want to add fault tolerance and a bit of caching on the outside of an existing stack -- one of the advantages of battery backed write caches is that they are maintained outside of OS control, so the controller can acknowledge a transfer as completed as soon as the data has reached the caches, and if a write is superseded later, it can be skipped, and head movements can be scheduled system-wide ignoring dependencies.

The way journaling file systems work, they will first submit a journal entry, then as soon as that is acknowledged, submit the data and after that is acknowledged, another journal entry marking the first as complete. That is a lot of head movement, especially when the disks are shared between multiple VMs that each have their own independent journaling file system, and in a busy system, the caches allow you to skip about half of the writes, but from the point of view of the inner system, the journal still behaves normally and dependent writes are performed in order.

The aspect of safely reordering dependent writes for more optimal head movements is why you want a hardware RAID at the bottom. ZFS generates dependent writes itself, so it can profit from hardware RAID too, but these are the performance bottleneck only in a limited set of use cases, mostly multi-tenant setups with little coordination between applications.

With SSDs, reordering is a lot less important, obviously, so the motivation to use hardware RAID there is mostly bulk performance -- if you've hit the point where memory and I/O interface speed on the mainboard are relevant factors, then offloading the checksum generation and transferring only a single copy one way vs multiple transfers from and to RAM (that need to be synchronized with all the other controllers in the same coherency domain) is definitely worth it. Hitting that point is a big "if" -- I haven't managed so far.

Simon Richter
  • 3,209
  • 17
  • 17
  • 3
    Ten years or more ago it was necessary to have RAID offloaded to a controller because the CPUs of the time just couldn't handle it very well (or at all) in addition to the actual business workload. These days CPU usage is often not an issue. – Michael Hampton Feb 12 '21 at 16:00
  • 1
    CPU workload was never a big issue here, workloads with both CPU and I/O load at the same time are seldom even today, and you wouldn't run these on checksummed RAID anyway (more likely: database on RAID1). Before SATA, hotplug was one of the reasons people bought hardware RAID -- one of the common failure modes for software RAID was disks failing during the power cycle and reboot from a degraded RAID that you needed to swap an IDE drive, and that's where the reputation of software RAID as unreliable comes from. It's still [annoying to set up](https://serverfault.com/a/1053425/66021) though. – Simon Richter Feb 12 '21 at 16:13
3

Short answer... You can use hardware RAID where it makes sense.

It really depends on where you want your RAID protection to come from and where you want your volume management to come from.

For example, I use HPE ProLiant servers...

  • I'm building a 100TB storage array today.
  • This is going into an environment where there won't be regular IT staff or knowledgeable support.
  • I'm using HPE SmartArray RAID to build this as a RAID 60 setup across 24 disks.
  • I'll set the Smart array to carve out a 100GB RAID 60 volume for the OS and the remainder to the data volume.
  • ZFS will be installed on top of the RAID block device presented to the OS (e.g. a single VDEV ZFS zpool)

The reasoning for this design is that HPE SmartArray hardware RAID is reliable and consistent in operation. It's easy to direct someone to replace a disk or to build automatic spares into that setup. Considering the location is unstaffed with IT resources, this make sense for manageability reasons.

I still get the benefit of ZFS volume management and caching, compression, performance, etc.

In a more controlled environment, I may instead set the controller in HBA mode and use raw disks for ZFS.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • 1
    Placing ZFS on HW RAID negates bit rot protection. I'm having a very hard time understanding how this benefits any use case, or how knowledgeable support factors in to the picture. You can still direct staff to replace individual disks and perform the final drive replacement yourself over SSH. The instructions you give to whoever is there is unchanged by using ZFS. – MrDrMcCoy Feb 12 '21 at 05:59
  • 2
    @MrDrMcCoy *Placing ZFS on HW RAID negates bit rot protection* **BOLLOCKS** Do you really think if bit-rot protection is important that it hasn't been built right into hard drives by now? How much bit-rot do you see on corporate Windows networks that don't use ZFS? *You can still direct staff to replace individual disks and perform the final drive replacement yourself over SSH.* That utterly ignores the fact that enterprise disk replacement is done by vendor techs without coordination with you, and the fact that hardware RAID disk replacement doesn't need you to SSH - new disk in, done. – Andrew Henle Feb 12 '21 at 12:54
  • (cont) Disk drives are no longer simple magnetic storage devices. They're effectively full systems on their own. I've been using ZFS since it came out on Solaris 10 almost 20 years ago. ZFS is great, but this "cult of ZFS" ignores reality. *or how knowledgeable support factors in to the picture* Seriously? You're attempting to characterize this answer as not "knowledgeable"? When you're remotely administering hundreds or thousands of servers with two or three admins no one has the time to "perform the final drive replacement yourself over SSH". Nevermind you're not paid to do that. – Andrew Henle Feb 12 '21 at 13:02
  • 2
    @AndrewHenle your understanding of bit-root is partial at best. While I agree that "pure" bit-root (ie: undetected changed bit) is very rare, a failing cable/connector/DRAM cache *will* lead to data corruption. I myself tracked down an HDD corruption to a faulty power supply, which caused spurious HDD DRAM cache corruption *on a single drive of an MDRAID mirror*. XFS was trashed by the corrupted data. A mirrored ZFS array will be much harder to trash in this manner (side note: it happened again with ZFS and it survided without any drama). – shodanshok Feb 12 '21 at 16:19
  • @ewwhite Just for your information, `zed` natively supports SAS enclosure LEDs, and `ledctl` can be scripted for SATA disks. That said HP equipments are really good, so I understand your use case. – shodanshok Feb 12 '21 at 16:21
  • @shodanshok And not using ECC RAM is playing with fire, and an even larger potential source of data corruption. Nothing is going to protect against writing data that gets corrupted in RAM. I sure hope all the ZFS zealots who rail against ZFS on hardware RAID because "ZFS protects from bit rot, but only if you don't use hardware RAID" are using ECC RAM... – Andrew Henle Feb 12 '21 at 16:34
  • 1
    @AndrewHenle you did not understand what I wrote. In my specific issue, the main RAM was 100% fine, no application/kernel error was ever recorded. The problem was related to the *HDD own private DRAM cache*, which was corrupted due to a non-stable power supply rail. Data corruption is not related to magnetic bit-root only. While enterprise equipments are more resilient to this kind of error, checksum remain an important safety net. But please do not trust me; rather, read SAS T10 specs or `dm-integrity` docs to understand why a data CRC/checksum can be so important. – shodanshok Feb 12 '21 at 16:59
  • @shodanshok I never claimed bit rot is the only source for data corruption. My point in fact was that there ***are*** other sources of data corruption and that one of the largest is RAM errors. My main point was that today's disk drives already provide protection against bit rot (as a perusal of [values disks provide for SMART monitoring](https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes) shows...) so the "you lose bit rot protection if you run ZFS on hardware RAID" is, umm, a misguided claim. At best. – Andrew Henle Feb 12 '21 at 17:19
  • @AndrewHenle but you *lose* protection against some class of data corruption when using hardware RAID without T10-ready disks. Data corruption can hide in many different places - deciding what holes to close and what to ignore can be difficult, and I am not questioning @ewwhite choice. But you should realize that only because the embedded HDD ECC protect you from "pure" bit-rotting, checksuming filesystems (or T10 or `dm-integrity`) are not useless. – shodanshok Feb 12 '21 at 17:25
  • @shodanshok I never said otherwise. I'm not really sure why you muddied things up with a hardware failure that ZFS (on a JBOD?) survived and XFS did not, especially without demonstrating that hardware RAID controllers would fail to detect such an error. The fact is that ZFS zealots who make claims like "Placing ZFS on HW RAID negates bit rot protection" are flat-out ***wrong***. Today's hard drives do provide bit rot protection. – Andrew Henle Feb 12 '21 at 17:34
  • @shodanshok Unfortunately, not every [production storage array](https://macdailynews.com/2018/12/13/apple-now-sells-lumaforge-jellyfish-workflow-servers-with-up-to-200tb-storage-costing-up-to-50000/) I support has SES-capable backplanes/enclosures (grrrr). – ewwhite Feb 12 '21 at 17:43
  • @AndrewHenle bit-rot is often used as a generic term for "unexpected data corruption". While you can argue about the terms, I invite you to see the greater picture: non-T10 DIF/DIX HW RAID *can* experience unexpected data corruption due to data being altered in a non-discoverable manner from HDD ECC point of view (ie: corruption in the HDD DRAM cache). [*The very reason T10 DIF/DIX exists it to protect from these kind of errors*](https://wiki.lustre.org/images/d/d1/LUG2018-T10PI_EndToEnd_Data_Integrity-Ihara.pdf). ZFS, btrfs or `dm-integrity` provide the same/similar protection via software. – shodanshok Feb 12 '21 at 19:12
  • @AndrewHenle and please note that I am not against the use of ZFS over HW RAID. I also use it in this manner is some setup, because at least ZFS will *detect* the issue/corruption. However, when not having redundancy at the vdev-level it will be unable to transparently fix the corrupted data. – shodanshok Feb 12 '21 at 19:19
2

If you use hardware raid instead of ZFS raid you will lose a couple features. Lets imagine a simple two disk mirror for the rest of this post.

  1. ZFS will not be aware that there are two disks and therefore two copies of every block. So if and when it detects a checksum error all it can do is notify you that there is a file that has a corrupt block. You would need to restore this file from backup to fix. ** ALSO**: You're hindering it's ability to even detect corruption. In the scenario above when ZFS requests to read a block there are actually two disks that each have identical copies and the HW controller will give ZFS data from one of the two disks. ZFS has no control over which disk, or even awareness that there are more than one. So ZFS has no way of checking each disk individually so it's quite possible it could take multiple successive reads to finally read the bad block and even detect the corruption. At that point you don't even know which disk it is.

  2. Because ZFS is also the filesystem it is aware of the FS. So if I have a mirror that is 95% free space and I replace a drive ZFS knows to only copy the 5% of actual data. HW Raid controllers are blind to the FS and have no way of distinguishing free space (or, previously used but since freed space) from data. So HW raid will blindly block copy the entire contents of diskA to diskB.

2

ZFS over HW RAID is fine, but only if you understand what this means. Specifically:

  • not having redundancy at the vdev level, it can detect but not fix data corruption
  • a faulty controller can totally trash your pool (a point only partially invalidated by the fact that a dying CPU/RAM/MB can have a similar effect)
  • you depend on the controller powerloss-protected writeback cache to be healthy (PERC H730 should use a flash-based powerloss-protected cache, or FBWB, so you should be safe here)

Replying to your questions:

  • What are the advantages and disadvantages of HW RAID versus ZFS? Read here for the answer

  • What are the differences between HW RAID and ZFS? HW RAID is a block-level affair only, where the RAID controller follow some "simple" replication/distribution schema to remap the LBAs to the physical addresses. ZFS is a complex CoW filesystem which, by its very nature, really like as much cache you can throw at it (to avoid read/modify/write cycles). In ZFS, the "RAID layer" is handled by the SPA (storage pool allocator).

  • Are HW RAID and ZFS complementary technologies or incompatible with each other? You can use ZFS on top of HW RAID, if your setup requires it. ZFS will work just fine, with the notes I already wrote above.

  • Since Proxmox VE is a Debian-based Linux distribution, does it make more sense to use the H730 for RAID10 with LVM versus setting the H730 in HBA mode and using ZFS? Having a performant H730 card, I would not use MDRAID. I have a similar server (DELL R720 with PERC H710p) with HW RAID + ZFS and it works very well.

shodanshok
  • 44,038
  • 6
  • 98
  • 162
1
  1. What are the advantages and disadvantages of HW RAID versus ZFS?

Hardware RAID can sometimes yield better performance from a base config, but ZFS is far more powerful, scales better, and when properly tuned, it can yield better performance.

  1. What are the differences between HW RAID and ZFS?

ZFS offers many features that are completely unavailable in other kinds of RAID, such as snapshots, copy-on-write, send-and-receive, compression, deduplication, caching, bit rot protection, nested volumes and filesystems, and independence from a particular manufacturer's hardware RAID implementation. ZFS is lso far more flexible, and can meet multiple use cases at the same time.

  1. Are HW RAID and ZFS complementary technologies or incompatible with each other?

They are completely different technologies. You can run ZFS on top of other RAIDs, but then you would lose bit rot protection.

  1. Since Proxmox VE is a Debian-based Linux distribution, does it make more sense to use the H730 for RAID10 with LVM versus setting the H730 in HBA mode and using ZFS?

LVM offers some of the features ZFS does, minus bit rot and filesystem support, and doesn't perform quite as well. ZFS is built into Proxmox for a reason.

I would highly encourage you to read Ars Technica's introduction to ZFS, as it will explain this in far better detail.

MrDrMcCoy
  • 164
  • 9
  • *it can yield better performance* Horsehockey. ZFS is great at a lot of things. Management, replication, backup. A whole lot of things. But there's no free lunch, and those features come at the cost of performance. Performance-wise, ZFS is a slow pig. You can throw a lot of hardware at ZFS so that performance is acceptable, but ZFS is never going to be confused with fast. – Andrew Henle Feb 12 '21 at 13:06
  • 2
    The caching makes a difference... and certainly compression and more intelligent use of resources. So on average, my ZFS systems perform better than XFS because of the combination of features reduce the reliance on pure disk I/O or throughput. – ewwhite Feb 12 '21 at 13:47
  • @ewwhite *The caching makes a difference...* Well, that's true for any cache system. But write enough TB to blow through that cache, and on similar hardware you'd get better throughput on that XFS system. And if your system has a sudden need for the memory used by the ZFS ARC, you can time freeing that memory from the ARC and your application getting access to it with a sundial. – Andrew Henle Feb 12 '21 at 14:27
  • XFS is designed with hardware RAID and especially battery backed up write caches in mind, but it needs to be configured appropriately, or it will be horribly slow -- especially if it isn't aware of the battery, it will refuse to do things that are unsafe without it (the original machines XFS was designed for came with a 2.5s buffer in the PSU to write disk caches on power failure, but trying that on a standard PC is a bad idea). – Simon Richter Feb 12 '21 at 14:28
  • Also, a pure software cache is a safety vs performance trade-off. I am fairly sure that I can get more throughput on ext4 than on ZFS if I [enable caches](https://manpages.debian.org/testing/eatmydata/eatmydata.1.en.html). – Simon Richter Feb 12 '21 at 15:46
  • @SimonRichter And even more if you disable write barriers... – Andrew Henle Feb 12 '21 at 16:38