We're considering building a ~16TB storage server. At the moment, we're considering both ZFS and XFS as filesystem. What are the advantages, disadvantages? What do we have to look for? Is there a third, better option?
-
8Don't even compare them. ZFS is a modern enterprise-level file system like jfs2, wafl. XFS was good 10 years ago but today it's just a stone age fs. – disserman Oct 23 '10 at 21:51
-
In some ways, you can't compare them: XFS is a filesystem; ZFS is a filesystem and so much more: it replaces the filesystem, the volume manager (like LVM), and RAID besides. However, JFS is no longer maintained if memory serves: however, XFS is active and maintained and robust. Either way - ZFS or XFS - you can't go wrong in my opinion. – Mei Dec 22 '11 at 02:42
-
2I still think this question is relevant, so Ill write our experience here : XFS is simple, you install it, you run it, its quick, it works. (HW raid below). ZFS is save, has compression, but is allot of work to get tuned to work as fast as XFS. So it also depends on the situation you are expecting the server to run. (backend of cluster. user storage, archive, ...) – SvennD Oct 15 '16 at 09:36
-
There is also Hammer2 https://www.dragonflybsd.org/hammer/ – skan May 24 '17 at 18:56
12 Answers
ZFS will give you advantages beyond software RAID. The command structure is very thoughtfully laid out, and intuitive. It's also got compression, snapshots, cloning, filesystem send/receive, and cache devices (those fancy new SSD drives) to speed up indexing meta-data.
Compression:
#zfs set compression=on filesystem/home
It supports simple to create copy-on-write snapshots that can be live-mounted:
# zfs snapshot filesystem/home/user@tuesday
# cd filesystem/home/user/.zfs/snapshot/tuesday
Filesystem cloning:
# zfs clone filesystem/home/user@tuesday filesystem/home/user2
Filesystem send/receive:
# zfs send filesystem/home/user@tuesday | ssh otherserver "zfs receive -v filesystem/home/user"
Incremental send/receive:
# zfs send -i filesystem/home/user@tuesday | ssh otherserver "zfs receive -v filesystem/home/user"
Caching devices:
# zpool add filesystem cache ssddev
This is all just the tip of the iceberg, I would highly recommend getting your hands on an install of Open Solaris and trying this out.
http://www.opensolaris.org/os/TryOpenSolaris/
Edit: This is very old, Open Solaris has been discontinued, the best way to use ZFS is probably on Linux, or FreeBSD.
Full disclosure: I used to be a Sun storage architect, but I haven't worked for them in over a year, I'm just excited about this product.
- 963
- 1
- 6
- 8
-
That link didn't work for me with www. Use `http://opensolaris.org/os/TryOpenSolaris/` – aggregate1166877 May 27 '16 at 02:58
-
I'd actually say that best bet for zfs is still FreeBSD. It's been a part of the system for quite a few years. So my guess is, there's the least possibility for nasty surprises. Though it's just my $0.02. – Fox May 29 '16 at 07:45
I've found XFS more well suited to extremely large filesystems with possibly many large files. I've had a functioning 3.6TB XFS filesystem for over 2 years now with no problems. Definately works better than ext3, etc at that size (especially when dealing with many large files and lots of I/O).
What you get with ZFS is device pooling, striping and other advanced features built into the filesystem itself. I can't speak to specifics (I'll let others comment), but from what I can tell, you'd want to use Solaris to get the most benefit here. It's also unclear to me how much ZFS helps if you're already using hardware RAID (as I am).
- 1,353
- 1
- 16
- 22
-
35The key feature of ZFS that you (usually) don't get elsewhere is block-level CRC, which is supposed to detect (and hopefully prevent) silent data corruption. Most filesystems assume that if a write completed successfully, then the data was indeed written to disk. That isn't always the case, especially if a sector is starting to go "marginal". ZFS detects this by checking the CRC against the resulting write. – Avery Payne May 05 '09 at 14:22
-
3And yes, I do like XFS alot. :) The only gotcha that you have to keep in mind is the propensity to zero out sectors that were "bad" during a journal recovery. In some (rare) cases, you can end up with some data loss... Found this paper with the Google search term "xfs zeros out sectors upon recovery" http://pages.cs.wisc.edu/~vshree/xfs.pdf – Avery Payne May 05 '09 at 14:25
-
3One of the things I like at XFS is the `xfs_fsr` "defragmentation" program. – Cristian Ciupitu Jun 21 '09 at 19:28
-
1The utility of ZFS block-level CRCs is questionable. Hard drives and SSDs use Hamming code ECC to correct single-bit errors and report two-bit errors. If the ECC can't transparently correct the physical read error, the data is lost anyway and a read failure will be reported to the OS. CRCs don't correct errors. This feature is pushed as a major benefit of ZFS but the truth is it's redundant and has no value. As for the XFS zero-after-power-fail bug, that was corrected a long time ago and isn't relevant today. – Jody Bruchon Dec 26 '16 at 16:10
-
@JodyLeeBruchon what you wrote is incorrect: while it is true that storage devices already have parity code attached to data, it does not means they are capable of end-to-end data protection. To achieve this goal without a chechsumming filesystem, you need a) a [SAS T10/DIF/DIX](https://access.redhat.com/solutions/41548) storage stack or b) use devicemapper [dm-integrity](https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.txt). – shodanshok Aug 08 '19 at 07:30
-
@shodanshok No, what I wrote is not incorrect. What you are saying is different from what I am saying. If you are going to "correct" me, at least read what I wrote and understand what it says first. – Jody Bruchon Aug 09 '19 at 12:56
-
@JodyLeeBruchon you are free to think what you want, but a CRC/ECC which lives near the original data is *not* the same of end-to-end data checksum. If so, both the DIF/DIX specs and the dm-intregrity target would be wasted works. I recommend you to read the [original CERN research paper](https://storagemojo.com/2007/09/19/cerns-data-corruption-research/) about data corruption, and how end-to-end data checksum can be used to avoid these problems. – shodanshok Aug 09 '19 at 14:12
-
@shodanshok Again, you have failed to read and comprehend what I said. You are reading what you want to read, not what I actually said. – Jody Bruchon Aug 10 '19 at 17:27
using lvm snapshots and xfs on live filesystems is a recipe for disaster especially when using very large filesystems.
I've been running exclusively on LVM2 and xfs for the last 6 years on my servers (at home even since zfs-fuse is just plain too slow)...
However, I can no longer count the different failure modes I encountered when using snapshots. I've stopped using them altogether - it's just too dangerous.
The only exception I'll make now is my own personal mailserver/webserver backup, where I'll do overnight backups using an ephemeral snapshot, that is always equal the size of the source fs, and gets deleted right afterwards.
Most important aspects to keep in mind:
- if you have a big(ish) filesystem that has a snapshot, write performance is horribly degraded
- if you have a big(ish) filesystem that has a snapshot, boot time will be delayed with literally tens of minutes while the disk will be churning and churning during import of the volume group. No messages will be displayed. This effect is especially horrid if root is on lvm2 (because waiting for the root device will timeout and system doesn't boot)
- if you have a snapshot it is very easy to run out of space. Once you run out of space, the snapshot is corrupt and cannot be repaired.
- Snapshots cannot be rolledback/merged at the moment (see http://kerneltrap.org/Linux/LVM_Snapshot_Merging). This means the only way to restore data from a snapshot is to actually copy (rsync?) it over. DANGER DANGER: you do not want to do this if the snapshot capacity is not at least the size of the source fs; If you don't you'll soon hit the brick wall and end up with both the source fs and the snapshot corrupted. (I've been there!)
- 450
- 3
- 9
-
2As it happens, just today someone confirmed that the vg with snapshot - unable-to-boot-linux is still current: https://bugs.launchpad.net/lvm2/+bug/360237 – sehe Oct 09 '09 at 17:56
-
3Revisiting this bug, they still think that the abysmal boot problems with snaphots are "normal behaviour for lvm": https://bugs.launchpad.net/lvm2/+bug/360237/comments/7 (on 2012-01-07) – sehe Mar 17 '12 at 00:39
-
4
-
It would be interesting to compare using ZFS in the same scenarios (snapshotting live system running the same software). – saulius2 May 25 '22 at 11:17
-
@saulius2 I think there's no comparison, certainly not since ZfsOnLinux matured and became default or support for root in some Linux distros. By which I think snapshots in LVM2 have just been superseded by other volume management as in btrfs/zfs – sehe May 25 '22 at 11:47
-
I am not sure LVM2 has been superseded yet, but yes, I would like it and can see this coming (albeit in a slow way). What I am not sure is whether ZFS snapshots may give less failures than LVM snapshots. My guess is that not: Live snapshots of any FS should be quite unreliable thing. – saulius2 May 25 '22 at 16:41
-
@saulius2 Have you ever tried it? I've been using ZFS for 10 years, and I have automatic live snapshotting in the background without even noticing. The point is that ZFS/btrfs do snapshotting at the dataset level, not just blocklevel. – sehe May 26 '22 at 01:49
A couple additional things to think about.
If a drive dies in a hardware RAID array regardless of the filesystem that's on top of it all the blocks on the device have to be rebuilt. Even the ones that didn't hold any data. ZFS on the other hand is the volume manager, the filesystem, and manages data redundancy and striping. So it can intelligently rebuild only the blocks that contained data. This results in faster rebuild times other than when the volume is 100% full.
ZFS has background scrubbing which makes sure that your data stays consistent on disk and repairs any issues it finds before it results in data loss.
ZFS file systems are always in a consistent state so there is no need for fsck.
ZFS also offers more flexibility and features with it's snapshots and clones compared to the snapshots offered by LVM.
Having run large storage pools for large format video production on a Linux, LVM, XFS stack. My experience has been that it's easy to fall into micro-managing your storage. This can result in large amounts of unused allocated space and time/issues with managing your Logical Volumes. This may not be a big deal if you have a full time storage administrator who's job is to micro-manage the storage. But I've found that ZFS's pool storage approach removes these management issues.
- 12,409
- 2
- 27
- 41
ZFS is absolutely amazing. I am using it as my home file server for a 5 x 1 TB HD file server, and am also using it in production with almost 32 TB of hard drive space. It is fast, easy to use and contains some of the best protection against data corruption.
We are using OpenSolaris on this server in particular because we wanted to have access to newer features and because it provided the new package management system and way of upgrading.
- 752
- 1
- 8
- 18
I dont think you should focus on performance. Is your data safe with XFS, ext4, etc? No. Read these PhD thesis and research papers:
XFS is not safe against data corruption: pages.cs.wisc.edu/~vshree/xfs.pdf
And neither is ext3, JFS, ReiserFS, etc: zdnet.com/blog/storage/how-microsoft-puts-your-data-at-risk/169?p=169&tag=mantle_skin%3bcontent "I came across the fascinating PhD thesis of Vijayan Prabhakaran, IRON File Systems which analyzes how five commodity journaling file systems - NTFS, ext3, ReiserFS, JFS and XFS - handle storage problems.
In a nutshell he found that the all the file systems have
. . . failure policies that are often inconsistent, sometimes buggy, and generally inadequate in their ability to recover from partial disk failures. "
But ZFS successfully protects your data. Here is an research paper on this: zdnet.com/blog/storage/zfs-data-integrity-tested/811
Which OS are you planning on running? Or is that another part of the consideration? If you're running Solaris, XFS isn't even an option as far as I know. If you're not running Solaris, how are you planning on using ZFS? Support is limited on other platforms.
If you're talking about a Linux server, I'd stick with Ext3 personally, if only because it receives the most amount of testing. zfs-fuse is still very young. Also, I had troubles with XFS once, when a bug caused data corruption after a kernel update. The advantages of XFS over Ext3 definitely didn't outweigh the costs involved in restoring the machine, which was located in a remote datacenter.
- 1,983
- 2
- 15
- 16
-
6
-
7http://wiki.freebsd.org/ZFSKnownProblems I think your definition of mature might be different from mine :-) Maybe I'd consider it after 8.0 is released. – Kjetil Limkjær Apr 30 '09 at 12:27
-
9ext3 with 16TB? No no no. Do NOT do it. You will cry. ZFS or XFS are the best filesystems out there in my opinion. Use ZFS if you can (don't run it on Linux). I say this with lots of experience on large volumes on Linux and Solaris over 5 years. – Thomas Jun 17 '09 at 06:13
-
2There is also the option to use Nexenta: A Linux (Ubuntu) based distribution which uses the OpenSolaris kernel. It was created for (file) servers. – knweiss Jul 15 '09 at 11:02
-
3FreeBSD 7.2 after 20090601 have rendered most of the ZFSKnownProblems moot. If you are running the AMD64 version of the OS, it is now stable. In 8.0, FreeBSD has marked ZFS as stable enough for Production. – Walter Sep 30 '09 at 15:43
-
3
Not a FS-oriented answer sorry but be aware that a number of disk controllers won't deal with >2TB LUNS/logical-disks - this can limit the way you organise your storage quite a bit. I just wanted you to be aware so you can check your system end-to-end to ensure it'll deal with 16TB throughout.
- 100,240
- 9
- 106
- 238
Well guys, lets not forget about latest addition to zfs: deduplication. And lets speak about on the fly iscsi, nfs or smb sharing. As others already said, exports of zfs file systems, snapshots, raidz(=raid5) block checksum, dynamic stripe width, cache managing and many others. I vote for zfs.
- 1,591
- 8
- 6
It depends what features you want..., the two reasonable choices are xfs and zfs as you have said, the xfs code is pretty well tested I first used it 8 years ago under IRIX
It is possible to get snapshots from xfs ( using lvm and xfs_freeze )
It is possible to have a separate log device eg SSD
mkfs.xfs -l logdev=/dev/sdb1,size=10000b /dev/sda1
Large xfs traditionally need lots of memory to check
The issue with zeros turning up was a "security" feature, Which I think disappeared a while ago.
Apart from what is already mentioned, from a performance point of view xfs on MD base raid performs better than zfs on streaming media. I've used the exact same hardware for half a decade with xfs and about the same amount of time with zfs on my media server. On the Intel Atom 330 with xfs I never experience stuter, on zfs on complex scenes the same hardware cannot keep up and starts dropping frames.
- 159
- 1
- 8
Rather than building your own, an alternative is the Sun 7410 aka Toro. It has some very useful software that comes bundled with the solution.
- 5,337
- 2
- 18
- 17