11

I'm considering migrating from ext3 to ZFS for data storage on my Debian Linux host, using ZFS on Linux. One killer feature of ZFS that I really want is its data integrity guarantees. The ability to trivially grow storage as my storage needs increase is also something I'd look forward to.

However, I also run a few VMs on the same host. (Though normally, in my case only one VM is running on the host at any one time.)

Considering ZFS's data checksumming and copy-on-write behavior, together with the fact that the VM disk images are comparatively huge files (my main VM's disk image file currently sits at 31 GB), what are the performance implications inside the VM guest of such a migration? What steps can I take to reduce the possible negative performance impact?

I can live with less data integrity guarantees on the VM disk images if necessary (I don't do anything really critical inside any of the VMs) and can easily separate them from the rest of the filesystem, but it would be nice if I don't have to (even selectively) turn off pretty much the feature that most makes me want to migrate to a different file system.

The hardware is pretty beefy for a workstation-class system, but won't hold much of a candle to a high-end server (32 GB RAM with rarely >10 GB in use, 6-core 3.3 GHz CPU, currently 2.6 TB usable disk space according to df and a total of about 1.1 TB free; migrating to ZFS will likely add some more free space) and I'm not planning on running data deduplication (as turning on dedup just wouldn't add much in my situation). The plan is to start with a JBOD configuration (obviously with good backups) but I may move to a two-way mirror setup eventually if conditions warrant.

user
  • 4,267
  • 4
  • 32
  • 70
  • Also keep in mind that [ZFS performs better then traditional RAID5 in terms of IOPS](http://serverfault.com/questions/531319/is-calculating-iops-for-zfs-raidz-different-then-calculating-iops-for-raid5-ra). RAIDZ writes perform at the speed of a single disk because it doesn't suffer from the I/O performance penalties which plagues traditional RAID5/6. – Stefan Lasiewski Aug 20 '13 at 14:34
  • 1
    **Thanks to all who have answered** for your insights! I'll definitely be coming back to this question later. – user Aug 20 '13 at 20:03
  • Stefan's comment is .. well, it's just false. ZFS RAIDZ performance is _significantly_ worse from an IOPS perspective (what you usually have issues with in VM's) than traditional RAID5 arrays. Please do not assume an improvement in write performance by moving to ZFS. It's rarely the case. Read perf gains will be dependent on RAM available to the ARC and your working set size and delta. Usually with VM's, ZFS ARC ends up helping with overall read performance compared to alternatives. Writes usually suffer, even on mirrors, ALWAYS with raidz. – Nex7 Aug 22 '13 at 02:29
  • @Nex7 How are writes with no RAID from ZFS at all but with only one storage device, which e.g. is provided by some mdraid? Does ZFS perform comparable to other file systems because no fancy RAID stuff is used? – Thorsten Schöning May 16 '17 at 12:55

4 Answers4

4

Since ZFS works at a block level the size of the files makes no difference. ZFS requires more memory and CPU but is not inherently significantly slower as a filesystem. Though you need to be aware that RAIDZ is not equivalent in speed to RAID5. RAID10 is fine where speed is a priority.

JamesRyan
  • 8,138
  • 2
  • 24
  • 36
4

ZFS on decent (i.e buff) hardware will likely be faster than other file systems, you likely want to create a ZIL on a fast (i.e. SSD) location. This is essentially a location to cache writes (well, more like a journal in ext3/4). This lets the box ack writes as being written to disk before the actual spindles have the data.

You can also create a L2 ARC on SSD for read cache. This is fantastic in a VM environment where you can bring physical disks to their knees by booting several VMs at the same time.

Drives go into VDEVs, VDEVs go into zpools (please use entire disks at a time). If this is a smaller system you may want to have a single zpool and (if you are not too concerned about data loss) a single VDEV. VDEVs are where you select the RAID level (although you can also MIRROR VDEVs if you've got enough disks). The slowest disk in a VDEV determines how fast the entire VDEV is.

ZFS is all about data integrity - the reason a lot of the traditional tools for file system maintenance don't exist (like fsck) is the problem they solve can't exist on a ZFS file system.

IMO the biggest drawback of ZFS is that if your file systems approach full (say 75%+) it gets VERY slow. Just don't go there.

TheFiddlerWins
  • 2,973
  • 1
  • 14
  • 22
2

31GB really isn't big at all...

Anyway, depending on the file system you are currently using, you may find ZFS is slightly slower but given your hardware specs it may be negligible.

Obviously ZFS will use a good chunk of RAM for caching which may make your VMs seem 'snappier' in general use (When not doing heavy reading or writing). I'm not sure of how ZFS is tuned on Linux but you may need to limit its ARC, if possible, to stop it running away with all your RAM (Seeing as you'll want a decent chunk left over for your host system and VMs).

I would enable compression (advice these days is to turn it on unless you have a good reason not to). Remember this has to be done before putting data on the file system. Most people are surprised to find it's actually quicker with it on, as the compression algorithms will generally run faster than disk IO. I doubt it will cause much of a performance issue with your 6 core processor. I wasn't expecting VMs to compress much, but I managed to turn ~470GB of VM data into 304GB just with the default compression setting.

Don't bother with dedupe, it will just come back to haunt you later on and you'll spend weeks shuffling data around trying to get rid of it.

If you do encounter performance problems then the obvious answer is to add an SSD as ZIL/L2ARC or even both. It's not ideal to use one device for both but it'll most likely still improve performance on a pool containing a small number of disks/vdevs.

To Add: I would really try and start with a redundant configuration if possibly (ideally mirrors), or convert to mirrors from a stripe as soon as possible. While ZFS will checksum all data and detect errors on the fly (or during a scrub), it won't be able to do anything about it (without using copies = 2 which will double disk usage). You'll just be left with it telling you there are errors in files (probably your VM disk images) which you won't be able to do a lot about without deleting and re-creating those files.

USD Matt
  • 5,321
  • 14
  • 23
  • *"You'll just be left with it telling you there are errors in files ... which you won't be able to do a lot about"* That's a good opinion, and I appreciate it. That said, that's where my nightly backups come in. As it stands *nothing* stands in between me and silent data corruption, so even if ZFS simply refuses to let me read the file or a part of it until I restore it from the (known good) backup, that's a *huge* improvement in data integrity assurances. – user Aug 20 '13 at 19:53
  • As for file size, no, 31 GB isn't exactly objectively huge (though it's still ~1.2% of my total system storage capacity), but my worry was more along the line of that COW would have the system copying *all* of that data back and forth continuously, [a misconception that JamesRyan quickly corrected](http://serverfault.com/a/532290/58408). – user Aug 20 '13 at 20:01
1

Depending on your use cases and VMs i would consider the Following. Let the Host Operating system take care of the files you are Storing on the ZFS Volumes.

If possible, create just a LUN for every VM, only containing the Operating System and necessary binary files. And present Storage stace for Individual Data as as shares via NFS, samba, or iSCSI (or zvols as mentioned in the comments). ZFS is able to keep track of every file with checksumming, and access times ect. Of course if the speed is no so important you could also enable compression on some Datastores. The benefit would be a missing layer of another Filesystem. If you'd create a LUN for second Virtual Harddrive and create an NTFS Filesystem ontop of that, ZFS has to handle a big Binary blob an does not know any of the contents or files, and therefore cant take advantage of ZIL or ARC cache in the same way the plane files could.

Mentioning ACLs, ZFS is able to use ACLs via NFSv4 or Samba (if enabled). I have do admit that I use ZFS on FreeBSD, and can not assure how to enable Sambas ACLs mating onto ZFS volumes. But I am sure this should not be a big deal.

Deduplication in combination with a Read cache is a big advantage when it comes to saving some space and improving massive reads (Boot storm) as all VMs begin to read the same blocks.

The same goes for ZFS snapshots for the VMs and the Datastores. You can create a simple shell script, to freeze the VM, take a snapshot of the VM and the Datastore and continue working, or just the Datastore alone, and clone the VM present the Snapshot of the original one and test some stuff.

The possibilities are endless with ZFS ;)

EDIT: Hopefully i have explained it a bit better now

EDIT2: Personal opinion: Consider using a RAIDZ2 (RAID6) as you can withstand a double disk failure! If you have a single spare disk left, it will never be wrong, but two disk failures should be enough for quick reaktion. I just postet my script for monitoring the disk status here

Daywalker
  • 485
  • 5
  • 25
  • 1
    I'm not sure I get it. Are you saying I should store the files that are used by the VMs as separate files on the ZFS file system, rather than as a disk image? What about such things as partitions, boot sectors, attributes which ZFS doesn't know about, Windows ACLs in a Linux context, ...? I'm either misunderstanding you, or you are answering something other than what I am asking. Can you please re-read the question and edit your answer to clarify how it addresses my storage performance concern? – user Aug 20 '13 at 14:20
  • Regarding snapshots: It may not be necessary to actually freeze the VM. ZFS uses Copy-on-Write (COW), which means that Snapshots are instantaneous and will provide you with a complete disk image. Some admins do use this for MySQL & PostGRES databases without freezing their databases (E.g. No downtime), although others do flush the tables first. If you do need to freeze the VM, taking the ZFS snapshot should only take a few seconds. – Stefan Lasiewski Aug 20 '13 at 14:30
  • Michael I think Daywalker is refering to zvols where you can create a file that acts like a block device. I'd use NFS not individual zvols for VMs (well in this case it looks like it's all local so just files in the file systems). Yeah, zvols can be cool but they are an extra layer of complication. And the ZFS snapshots are by definition consistent. It does not mean the VM's OS knows it needs to flush it's data to disk but you'll get file system consistency with the same level as if you lost power on the VM. – TheFiddlerWins Aug 20 '13 at 14:43
  • Dedup is very resource intensive. Using compression is not and (for VMs) will likely get you back a lot of space due to whitespace in the VM file systems. – TheFiddlerWins Aug 20 '13 at 14:51
  • @MichaelKjörling Just editet my Post, hoping for better understanding (also with the comments from TheFiddlerWins and Stefan Lasiewski – Daywalker Aug 20 '13 at 14:59
  • @StefanLasiewski I thought of something as hibernation for Memory snapshot that the VM will be consistent. But of course a small script for the application on the VM would be enough, to stop the service or flush the database. BUT(!) ZFS will have NO idea of the files beeing written iside the LUNF with any file system. It will complete ITS OWN write action and take a snapshot. If the VM haven't flushed their FS cache some operations will remain unwritten to the physical disk and maybe result in an inconsistent file inside the VM. For this reason i mentioned dedicatet Datastores for appfiles – Daywalker Aug 20 '13 at 18:19
  • @Daywalker is right. The contents of the application should be committed to disk before the snapshot is taken. This is why MySQL admins issue something like "`FLUSH CACHE WITH READ LOCK`". The VMs should be committed to disk before the Snapshot takes place. However, ZFS snapshots are usually fast and efficient (Thanks to the magic of COW), unlike UFS & LVM snapshots. – Stefan Lasiewski Aug 20 '13 at 18:33