36

If one happens to have some server-grade hardware at ones disposal, is it ever advisable to run ZFS on top of a hardware-based RAID1 or some such? Should one turn off the hardware-based RAID, and run ZFS on a mirror or a raidz zpool instead?

With the hardware RAID functionality turned off, are hardware-RAID-based SATA2 and SAS controllers more or less likely to hide read and write errors than non-hardware-RAID controllers would?

In terms of non-customisable servers, if one has a situation where a hardware RAID controller is effectively cost-neutral (or even lowers the cost of the pre-built server offering, since its presence improves the likelihood of the hosting company providing complementary IPMI access), should it at all be avoided? But should it be sought after?

cnst
  • 12,948
  • 7
  • 51
  • 75
  • 2
    possible duplicate of [ZFS on top of Hardware Mirroring, or just mirror in ZFS?](http://serverfault.com/questions/189414/zfs-on-top-of-hardware-mirroring-or-just-mirror-in-zfs) – Shane Madden Oct 10 '13 at 21:13
  • 2
    @ShaneMadden, the questions are similar, however, my question already comes from the perspective of hardware raid being bad in terms of zfs, and I'm asking just how bad is it; also, consider that the accepted answer to your linked question doesn't address my question at all; my question is more like a followup question to the question you've linked. – cnst Oct 10 '13 at 21:23
  • "ZFS on top of Hardware Mirroring, or just mirror in ZFS?" and this question are two different topics. That other topic is more narrow in scope then this topic. – Stefan Lasiewski Dec 11 '13 at 23:40
  • @ewwhite, didn't you ask this already? – cnst Aug 21 '15 at 22:44
  • @cnst Well, there's no marked answer, and people keep **downvoting** my answer. So it would be nice for there to be some closure to the question posed. (_it's the responsible thing to do_) – ewwhite Aug 21 '15 at 22:51
  • @ewwhite, well, i'm sorry to hear that; i think your answer provides a good prospective, i for one surely didn't downvote it! even though i can see where those people that do are coming from... – cnst Aug 21 '15 at 23:04

7 Answers7

21

The idea with ZFS is to let it known as much as possible how the disks are behaving. Then, from worst to better:

  • Hardware raid (ZFS has absolutely no clue about the real hardware),
  • JBOD mode (The issue being more about any potential expander: less bandwidth),
  • HBA mode being the ideal (ZFS knows everything about the disks)

As ZFS is quite paranoid about hardware, the less hiding there is, the more it can cope with any hardware issues. And as pointed out by Sammitch, RAID Controller configurations and ZFS may be very difficult to restore or reconfigure when it fails (i.e. hardware failure).

About the issue of standardized hardware with some hardware-RAID controller in it, just be careful that the hardware controller has a real pass-through or JBOD mode.

Tmanok
  • 247
  • 1
  • 11
Ouki
  • 1,367
  • 1
  • 11
  • 16
  • 12
    It's also worth noting that if you're using HW RAID and your controller dies [happens more than you'd think] if you can't get a replacement that's either identical or fully compatible, you're hooped. On the other hand if you gave the raw disks to ZFS you can plug those disks back into any controller on any machine and ZFS can reconstruct the array and carry on like nothing happened. – Sammitch Oct 10 '13 at 22:45
  • 2
    High-end servers typically have onboard RAID controllers. E.g. I've never had to replace a controller on an HP or Dell system. – ewwhite Oct 10 '13 at 23:31
  • 4
    This answer does not answer anything. It expresses just the biased opinion, that the supplier of the server hardware and the ZFS programmer have done a better job than the supplier of the RAID controller and the programmer of the RAID firmware. The FreeNAS community is full of guys who killed their Zpools with malfunctioning server memory or inappropriate power supplies. The chance that something big fails is higher than something small. – ceving Oct 21 '15 at 10:29
  • @Sammitch Thank you for this great advise. I am sticking to ZFS now. – Akito Feb 06 '20 at 19:20
16

Q. If one happens to have some server-grade hardware at ones disposal, is it ever advisable to run ZFS on top of a hardware-based RAID1 or some such?

A. It is strongly preferable to run ZFS straight to disk, and not make use of any form of RAID in between. Whether or not a system that effectively requires you make use of the RAID card precludes the use of ZFS has more to do with the OTHER benefits of ZFS than it does data resiliency. Flat out, if there's an underlying RAID card responsible for providing a single LUN to ZFS, ZFS is not going to improve data resiliency. If your only reason for going with ZFS in the first place is data resiliency improvement, then you just lost all reason for using it. However, ZFS also provides ARC/L2ARC, compression, snapshots, clones, and various other improvements that you might also want, and in that case, perhaps it is still your filesystem of choice.

Q. Should one turn off the hardware-based RAID, and run ZFS on a mirror or a raidz zpool instead?

A. Yes, if at all possible. Some RAID cards allow pass-through mode. If it has it, this is the preferable thing to do.

Q. With the hardware RAID functionality turned off, are hardware-RAID-based SATA2 and SAS controllers more or less likely to hide read and write errors than non-hardware-RAID controllers would?

A. This is entirely dependent on the RAID card in question. You'll have to pore over the manual or contact the manufacturer/vendor of the RAID card to find out. Some very much do, yes, especially if 'turning off' the RAID functionality doesn't actually completely turn it off.

Q. In terms of non-customisable servers, if one has a situation where a hardware RAID controller is effectively cost-neutral (or even lowers the cost of the pre-built server offering, since its presence improves the likelihood of the hosting company providing complementary IPMI access), should it at all be avoided? But should it be sought after?

A. This is much the same question as your first one. Again - if your only desire to use ZFS is an improvement in data resiliency, and your chosen hardware platform requires a RAID card provide a single LUN to ZFS (or multiple LUN's, but you have ZFS stripe across them), then you're doing nothing to improve data resiliency and thus your choice of ZFS may not be appropriate. If, however, you find any of the other ZFS features useful, it may still be.

I do want to add an additional concern - the above answers rely on the idea that the use of a hardware RAID card underneath ZFS does nothing to harm ZFS beyond removing its ability to improve data resiliency. The truth is that's more of a gray area. There are various tuneables and assumptions within ZFS that don't necessarily operate as well when handed multi-disk LUN's instead of raw disks. Most of this can be negated with proper tuning, but out of the box, you won't be as efficient on ZFS on top of large RAID LUN's as you would have been on top of individual spindles.

Further, there's some evidence to suggest that the very different manner in which ZFS talks to LUN's as opposed to more traditional filesystems often invokes code paths in the RAID controller and workloads that they're not as used to, which can lead to oddities. Most notably, you'll probably be doing yourself a favor by disabling the ZIL functionality entirely on any pool you place on top of a single LUN if you're not also providing a separate log device, though of course I'd highly recommend you DO provide the pool a separate raw log device (that isn't a LUN from the RAID card, if at all possible).

Nex7
  • 1,925
  • 11
  • 14
10

I run ZFS on top of HP ProLiant Smart Array RAID configurations fairly often.

Why?

  • Because I like ZFS for data partitions, not boot partitions.
  • Because Linux and ZFS boot probably isn't foolproof enough for me right now.
  • Because HP RAID controllers don't allow RAW device passthrough. Configuring multiple RAID 0 volumes is not the same as RAW disks.
  • Because server backplanes aren't typically flexible enough to dedicate drive bays to a specific controller or split duties between two controllers. These days you see 8 and 16-bay setups most often. Not always enough to segment the way things should be.
  • But I still like the volume management capabilities of ZFS. The zpool allows me to carve things up dynamically and make the most use of the available disk space.
  • Compression, ARC and L2ARC are killer features!
  • A properly-engineered ZFS setup atop hardware RAID still gives good warning and failure alerting, but outperforms the hardware-only solution.

An example:

RAID controller configuration.

[root@Hapco ~]# hpacucli ctrl all show config

Smart Array P410i in Slot 0 (Embedded)    (sn: 50014380233859A0)

   array B (Solid State SATA, Unused Space: 250016  MB)
      logicaldrive 3 (325.0 GB, RAID 1+0, OK)

      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, Solid State SATA, 240.0 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, Solid State SATA, 240.0 GB, OK)
      physicaldrive 2I:1:7 (port 2I:box 1:bay 7, Solid State SATA, 240.0 GB, OK)
      physicaldrive 2I:1:8 (port 2I:box 1:bay 8, Solid State SATA, 240.0 GB, OK)

block device listing

[root@Hapco ~]# fdisk  -l /dev/sdc

Disk /dev/sdc: 349.0 GB, 348967140864 bytes
256 heads, 63 sectors/track, 42260 cylinders
Units = cylinders of 16128 * 512 = 8257536 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1       42261   340788223   ee  GPT

zpool configuration

[root@Hapco ~]# zpool  list
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
vol1   324G  84.8G   239G    26%  1.00x  ONLINE  -

zpool detail

  pool: vol1
 state: ONLINE
  scan: scrub repaired 0 in 0h4m with 0 errors on Sun May 19 08:47:46 2013
config:

        NAME                                      STATE     READ WRITE CKSUM
        vol1                                      ONLINE       0     0     0
          wwn-0x600508b1001cc25fb5d48e3e7c918950  ONLINE       0     0     0

zfs filesystem listing

[root@Hapco ~]# zfs list
NAME             USED  AVAIL  REFER  MOUNTPOINT
vol1            84.8G   234G    30K  /vol1
vol1/pprovol    84.5G   234G  84.5G  -
ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • So, in regards to the closed question that you've linked to, is it to say that if I want to use ZFS, I'd better avoid, for example, Dell PERC H200 and HP P410? Do they still not have a way to disable the hardware raid mode, be that RAID0 or RAID1? – cnst Oct 10 '13 at 22:06
  • So, it seems like http://www.dell.com/learn/us/en/04/campaigns/dell-raid-controllers does claim that H200 "Supports non-RAID", although http://h18004.www1.hp.com/products/servers/proliantstorage/arraycontrollers/smartarrayp410/ is not entirely clear on whether the raid functionality of P410 can or cannot be turned off. – cnst Oct 10 '13 at 22:12
  • @cnst You cannot disable the RAID functionality of an HP Smart Array P410. – ewwhite Oct 11 '13 at 12:05
  • Is it still correct? Are you saying there is no dangerous running ZFS on hardware raid? – sparse Nov 24 '19 at 10:12
  • 1
    Correct. It’s not dangerous. – ewwhite Nov 24 '19 at 14:21
  • Anyone doing this with an H730? I understand RAID mode can be configured as RAID and Non-RAID. Wherein the second option the drives are passed through directly to the OS for management. I understand that the only drawback being that SMART data relies in firmware drivers being available from Dell for your drive. Will the H730 do anything weird like hidden error correction that might break ZFS when running in RAID mode but configured as Non-RAID Devices? – TJ Zimmerman Apr 20 '20 at 01:03
  • I am seeing some crashes in ZOL when using a PERC H740P mini (embedded) with each disk as a single disk RAID 0 and seeing crashes that I am highly suspicious are related to using ZOL with HW RAID. – sed_and_done Mar 21 '21 at 15:14
  • @sed_and_done Please avoid using multiple RAID0 arrays from a RAID controller to serve ZFS. Either go with a single LUN hardware RAID or an HBA. – ewwhite Mar 21 '21 at 23:57
6

Typically you should never run ZFS on top of disks configured in a RAID array. Note that ZFS does not have to run in RAID mode. You can just use individual disks. However, virtually 99% of people run ZFS for the RAID portion of it. You could just run your disks in striped mode, but that is a poor use of ZFS. Like other posters have said, ZFS wants to know a lot about the hardware. ZFS should only be connected to a RAID card that can be set to JBOD mode, or preferably connected to an HBA. Jump onto IRC Freenode channel #openindiana ; any of the ZFS experts in the channel will tell you the same thing. Ask your hosting provider to provide JBOD mode if they will not give a HBA.

chris
  • 61
  • 1
  • 1
  • 1
    Yeah, I agree. But it's also a matter of what's available in stock with the configuration that fits the bill and the spec. If a server has great CPU, lots of ECC RAM, great bandwidth, and plenty of it, but has to come with a hardware-based RAID, it's may not be cost-effective to seek alternatives, which may be several times more expensive, due to being in a different category or so, or missing some of the enterprise features like the ECC RAM etc. – cnst Dec 12 '13 at 03:14
3

Everybody tells that ZFS on top of RAID is a bad idea without even providing a link. But the developers of ZFS - Sun Microsystems even recommend to run ZFS on top of HW RAID as well as on ZFS mirrored pools for Oracle databases.

The main argument against HW RAID is that it can't detect bit rot like ZFS mirror. But that's wrong. There is T10 PI for that. You can use T10 PI capable controllers (that at least all LSI controllers that I used are) Majority of enterprise disks are T10 PI capable. So if it is appropriate for you, you can build T10 PI capable array, create ZFS pool without redundancy on top of it, and just make sure you follow the guidelines regarding to your use case in the article. Though it is written for Solaris, IMHO it is also suitable for the other OS.

The benefits for me is that the replacing disk in HW controller is really easier ( especially in my case, because I don't use whole disk for zpool for performance reasons ) It requires NO intervention at all and can be done by client's staff.

The downside is that you have to make sure that disks you buy are actually formatted to support T10 PI, because some of them though capable of T10 PI but sold formatted as regular disks. You can format them yourself, but it's not very straightforward and potentially dangerous if you interrupt the process.

Alek_A
  • 298
  • 2
  • 8
  • 4
    While I upvoted for you reference to T10-ready disks and controllers, please note that the *very same* Oracle page you linked tell the following: **"... Consider using JBOD-mode for storage arrays rather than hardware RAID so that ZFS can manage the storage and the redundancy...** So no, Oracle docs do not suggest or recommend ZFS over HW RAID; rather, they explain how to use HWRAID with ZFS without too much pain. – shodanshok Feb 13 '21 at 20:44
  • 1
    I disagree. They actually do. The phrase expresses that you should **consider** using it for **storage arrays** in general. Why? Because if you not confident in HW RAID redundancy i.e. if it can't handle silent data corruption (bit rot), it's the best way to use JBOD. But next phrase states **"If you are confident in the redundancy of your hardware RAID solution, then consider using ZFS without ZFS redundancy with your hardware RAID array"** Please read carefully. In Databases section stated that it is recommended as well as mirrored pool **for Oracle database**. IMHO for databases in general – Alek_A Feb 14 '21 at 03:04
  • 1
    Just below: "**Using ZFS redundancy has many benefits – For production environments, configure ZFS so that it can repair data inconsistencies**...If you are confident in the redundancy of your hardware RAID solution, then **consider** using ZFS without ZFS redundancy with your hardware RAID array". I can't see any recommendations to use HW RAID; rather, the docs explain how to use it without causing too much trouble. Don't get me wrong: sometime I am also using ZFS on HW RAID, but I would not describe it as a "recommend setup". The *mirrored pool* note is to use ZFS mirror rather than ZRAID. – shodanshok Feb 14 '21 at 09:00
  • 1
    People tend to see what they want to see) Sorry, I suppose further conversations is useless. OK, just let people make their own conclusions about the article. – Alek_A Feb 14 '21 at 10:42
2

In-short: using RAID below ZFS simply kills the idea of using ZFS. Why? — Because it's designed to work on pure disks, not RAIDs.

poige
  • 9,171
  • 2
  • 24
  • 50
  • 4
    Not necessarily. What if I care more about the volume management flexibility than the optimization around have raw access to physical devices. ZFS works quite well for my use case. – ewwhite Dec 12 '13 at 01:53
  • 4
    @ewwhite, well, someone can drive bicycle walking near it, saying that he likes to walk and love bicycles in general, but the truth is bicycles are made for being ridden on. ) – poige Dec 12 '13 at 01:58
1

For all of you... ZFS over any Raid is a total PAIN and is done only by MAD people!... like using ZFS with non ECC memory.

With samples you will understand better:

  • ZFS over Raid1, one disk have a bit changed when was not powered off... pry all you know, ZFS will see some damage or not depending what disk is readed (Raid controller did not see that bit changed and think both disks are OK)... if the fail is in the VDEV part... the whole ZPOOL looses all its data forever.
  • ZFS over Raid0, one disk have a bit changed when was not powered off... pry all you know, (Raid controller did not see that bit changed and think both disks are OK)... ZFS will see that damage but if the fail is in the VDEV part... the whole ZPOOL looses all its data forever.

Where ZFS is good is in detecting Bits that changed when disk where without power (RAID controllers can not do that), also when something changes without been asked to, etc.

It is the same problem as when a bit in a RAM module spontaneously changes without being asked to... if memory is ECC, memory corrects it self; if not, that data had changed, so that data will be sent to disks modified; pry that change is not on the UDEV part, if the fail is in the VDEV part... the whole ZPOOL looses all its data forever.

That is a weakness on ZFS... VDEVs fails implies all data get lost for ever.

Hardware Raid and Software Raid can not detect spontaneous bit changes, they do not have checksums, worst on Raid1 levels (mirros), they read not all parts and compare them, they supose all parts will allways have the same data, ALLWAYS (i say it loudly) Raid suposes data has not changed by any other thing/way... but disks (as memory) are prone to spontaneous bit changes.

Never ever use a ZFS on a non-ECC RAM and never ever use ZFS on raided disks, let ZFS see all the disks, do not add a layer that can ruin your VDEV and POOL.

How to simulate such fail... power off the PC, took out one disk of that Raid1 and alter only one bit... reconect and see how Raid controller can not know that has changed... ZFS can because all reads are tested against the checksum and if does not match, read form another part... Raid never read again because a fail (except hardware impossible read fails)... if Raid can read it thinks data is OK (but it is not on such cases)... Raid only try to read from another disk if where it reads says "hey, i can not read from there, hardware fail"... ZFS read from another disk if checksum does not match as also as if where it reads says "hey, i can not read from there, hardware fail".

Hope i let it very clear... ZFS over any level of Raid is a toal pain and a total risk to your data! as well as ZFS on non-ECC memories.

But what no one says (except me) is:

  • Do not use disks with internal cache (not only that ones SHDD, also some that has 8Mib to 32MiB cache, etc)... some of them use non-ECC memory for such cache
  • Do not use SATA NCQ (a way to queu writes) because it can ruin ZFS if power loose

So what disks to use?

  • Any disk with internal battery that ensures all the queu will be writted to the disk on power fail cases and uses ECC memory inside it (sorry, there are very little ones with all of that and are expensive).

But, hey, most people do not know all of this and never ever had a problem... i say to them: wow, how lucky you are, buy some lottery tickets, before lucky goes away.

The risks are there... such failures conincidences may occur... so the better answer is:

  • Try not to put any layer between ZFS and where data is really stored (RAM, Raid, NCQ, internal disk cache, etc)... as much as you can afford.

What i personally do?

  • Put some layers more... i use each 2.5" SATA III 7200 rpm disk on a USB 3.1 Gen2 type C enclosure, i connect some enclosures to a USB 3.1 Gen 2 Type A Hub that i connect to the PC; other to another hub that i connect to another root port on the PC, etc.
  • For the system i use internal sata connectors on a ZFS (Raid0 level) because i use an inmutable (Like a LiveCD) Linux system, each boot identical content on internal disks... and i have a Clone image of the system i can restore (less than 1GiB system)... also i use the trick to have the system contained on a file and use RAM mapped drive where i clone it on boot, so after boot all the system runs in RAM... putting such file on a DVD i can also boot the same way, so in case of fail of internal disks, i just boot with the DVD and system is online again... similar trick to SystemRescueCD but a little bit more complex beacuse ISO file can be on the internal ZFS or just be the real DVD and i do not want two different versions.

Hope i could give a little light on ZFS against Raid, it is really a pain when things go wrong!

Claudio
  • 19
  • 1
  • 4
    So you're saying that ZFS is so unreliable that if a single bit changes you can lose the whole filesystem? How does SATA NCQ cause data loss when the drive still notifies the host only when the sectors have been written successfully (albeit in perhaps a different order)? – Malvineous Sep 24 '18 at 22:26
  • 2
    If this answer was correct, I would've lost already several PetaByte of data... And others that I know, too... – Akito Feb 06 '20 at 19:29
  • @Akito why is it not correct? – TJ Zimmerman Apr 20 '20 at 01:04
  • 1
    You probably do not know about T10 PI capable disks (which most enterprise disk are). Just create T10 PI capable HW RAID array and you are protected from the bit rot. – Alek_A Feb 13 '21 at 13:11
  • 1
    This answers looks as if it came directly from google translate. Really awful. – Lethargos Jun 11 '22 at 11:25