I have been doing lots of I/O testing on a ZFS system I will eventually use to serve virtual machines. I thought I would try adding SSD's for use as cache to see how much faster I can get the read speed. I also have 24GB of RAM in the machine that acts as ARC. vol0 is 6.4TB and the cache disks are 60GB SSD's. The zvol is as follows:

pool: vol0
 state: ONLINE
 scrub: none requested

        NAME                     STATE     READ WRITE CKSUM
        vol0                     ONLINE       0     0     0
          c1t8d0                 ONLINE       0     0     0
          c3t5001517958D80533d0  ONLINE       0     0     0
          c3t5001517959092566d0  ONLINE       0     0     0

The issue is I'm not seeing any difference with the SSD's installed. I've tried bonnie++ benchmarks and some simple dd commands to write a file then read the file. I have run benchmarks before and after adding the SSD's.

I've ensured the file sizes are at least double my RAM so there is no way it can all get cached locally.

Am I missing something here? When am I going to see benefits of having all that cache? Am I simply not under these circumstances? Are the benchmark programs not good for testing the effect of cache because of the the way (and what) it writes and reads?

  • Assuming that you're testing your production configuration here. I have a few things to point out. With ZFS you don't really want a 1 device zpool. It's not that it won't work but you are loosing out on some of the data protection that ZFS offers. In this configuration it will only be able to detect CRC errors and not correct them. It also limits the scrubbing feature to just identifying problems rather than fixing them. ZFS mirrors and RAIDZ1/2 configurations also has advantages over hardware RAID solutions. Like reslivering only the used space and no write hole with RAIDZ1/2. – 3dinfluence Nov 19 '09 at 20:31
  • Is this for serving via NFS or iSCSI? What are the bonnie++ results like so far? – ewwhite Nov 19 '09 at 20:34
  • I should add that you can get some protection from CRC errors by using this command. "zfs set copies=2 vol0" This will cut your usable space in half and double the amount of IO involved in writes. So this isn't always an ideal solution. But for more info check out http://blogs.sun.com/relling/entry/zfs_copies_and_data_protection – 3dinfluence Nov 19 '09 at 20:42
    Seeing one zvol on my output is a bit deceiving (although technicall true). This is really coming from a vTrak promise array, 16 1TB disks in a RAID 10 configuration (2 spares). The vtrak is attached to a Nexenta head machine which created the zvol. – jemmille Nov 19 '09 at 21:18
  • Current results: WRITE CPU RE-WRITE CPU READ CPU RND-SEEKS 381MB/s 22% 202MB/s 14% 469MB/s 11% 791/sec – jemmille Nov 19 '09 at 21:20
  • oh, and this is iSCSI – jemmille Nov 19 '09 at 21:22
    Ok just so we are on the same page. The VTrak is setup as a JBOD and the Nexenta filer then creates a zpool with 7 pairs of mirrors + 2 spares and then this is presented as an iSCSI target to your server? Because if the VTrak is doing the RAID10 then what I said before still holds true. – 3dinfluence Nov 19 '09 at 21:28
  • The VTrak is not setup as JBOD so I see what you are saying. I'm pretty new to ZFS and would happily change the setup to something better, I'd just have to convince my boss ;-) Regardless of that, any ideas with the caching? – jemmille Nov 19 '09 at 21:46
  • I haven't had the chance to use L2Arc so I don't have any personal experience with it. But to see the performance from L2Arc the cache has to be warm. I'm not sure a benchmark is going to do a good job of warming the cache to see it's affect. But in general you're much better off doing real world tests. Is your server connected to the Nexenta with 10Gbit? – 3dinfluence Nov 19 '09 at 22:16
  • The Nexenta box has 2 quad NIC's (plus 4 onboard). The quad nics are running into a switch setup with LACP 8x bond for a 8Gbit link. The vTrak is connected to the Nexenta machine with a 3Gbit HBA card. That aggregated link has a private IP the servers use to connect. Each server has a 2Gbit bonded link to the storage network. I can explain all the logic behind this but this thread is becoming quite cumbersome. If you want to know more we should try to connect outside of this. I'm interested if you are, if for no reason other than to exchange ideas. – jemmille Nov 19 '09 at 22:42

It seems your tests are very sequential like writing a large file with dd then reading it. ZFS L2ARC cache is designed to boost performance on random reads workloads, not for streaming like patterns. Also, to get optimal performance, you might want to wait a longer time until the cache is warm. Another point would be to make sure your working set fit into the SSDs. Having io statistics observed during the tests would help figuring out what devices are used and how they perform.

Given the state of the answer here I will provide one.

Instead of answering with a question or an answer irrelevant to the question I will try to given an answer that is relevant.

Sadly I do not know the factual answer as to what should be going on, but I can answer with my own experience.

From my own experience, a zvol bigger than the ARC (or L2ARC) will not be cached. Other than avoiding read amplification.

You can run arc_summary on linux to get access to the ARC statistics.

I tested with accessing the same file over and over inside a virtual machine with its drive hosted on a zvol, which meant the same parts of the zvol should have been accessed over and over, but all the i/o was not even registering in the ARC at all as if it was been bypassed.

On the other hand I have another virtual machine hosted on a raw file on a zfs dataset, and that is caching just fine.

To confirm if ARC is enabled for the zvol (or dataset), check the primarycache variable, and for the l2arc, check the secondarycache variable.

Chris C
Did you consider the ARC space compared to your test? In testing the I/O benefit of SSD's used as L2ARC (pool read cache) and/or ZIL (pool synchronous write cache) you need to consider the size of your ARC in contrast to your test's working set. If ARC can be used, it will be without pulling from L2ARC. Likewise, if write caching is enabled, writes will be coalesced regardless of ZIL unless flush and explicit synchronous behavior is enabled (i.e. the initiator's write cache is disabled too, etc.)

If you want to see the value of SSD for smaller working sets, consider that 16 disk RAID10 will deliver about 1200+ IOPS (SAS/SATA?) for writes and about twice that for reads. Reducing the disk set to two (for testing) and reducing the ARC to minimum (about 1/8th main memory) will then allow you to contrast spindle vs. SSD. You'd otherwise need to get more threads banging on your pool (multiple LUNs) to see the benefit. Oh yes, and get more interfaces working too, so you're not BW bound by a single 1Gbps interface...


Anyone attempting to benchmark the L2ARC will want to see how "warm" the L2ARC is, and to also assess what that their requests are hitting the L2ARC. There is a nice tool and article for doing just that: arcstat.pl updated for L2ARC statistics

David Baird
  • I use the newer [arcstat.py](http://github.com/mharsch/arcstat) and find that `arcstat.py -f read,miss%,l2read,l2miss%,dm%,pm%,mm%,arcsz,c,l2size,l2asize 2` gives you a great insight into how your arc and l2arc are performing over time. – Mark Booth Nov 07 '15 at 12:04

Are you running your tests with compression=on? ZFS is very efficient when working with compression, and many benchmarking tools test drive performance by writing lots of zeros.

Make sure to turn off compression while benchmarking, or use random data* while benchmarking.

*write 1gb of random data: openssl rand -out myfile $((1024*1024*1000))

Back in the days I used iozone to do some benchmarking. I'm not a benchmarking guru but my command for local benchmark was this:

iozone -a -o -r 128K -n 128K -g 48G

Here the description of the command:

-a: Auto-Mode
-o: This forces all writes to the file to go completely to disk before returning to the benchmark
-r: Set Record Size. Default 128K on ZFS filesystem
-n: Minimum filesize = Record Size
-g: Maximum filesize = 2 * RAM

Maybe you give it a try and post your experience with this tool. I had good results but benchmarking is like statistics. Don't trust a benchmark you didn't fake yourself. ;-)


One thing left: I would do a benchmark with and without the caching devices. I think there you can see the impact on the results. Would be nice to see some results. Just curious.

  • I was using iozone in the beginning but since I was receiving the same results from iozone and bonnie++ I just picked bonnie++ . I'll give this command a go thought and see what happens. – jemmille Nov 19 '09 at 19:11
  • I've done benchmarks with and without the caching. The results are almost the same which is the whole point of my question. – jemmille Nov 19 '09 at 21:12
  • Have you played a little bit with the record size? Maybe you see an impact on a "large amount of small files". Or you ask a consulting company to do a professional benchmark if the result is important for a company decision or something like this. Sorry the benchmark using iozone didn't work out for you but like I said: I'm not a guru on filesystem benchmarking and this kind of setup worked for me to see some impact on using ZFS in general. But with "only" 8GB RAM, no caching devices and letting ZFS take care of disk management (RAID-Z1). – chrw Nov 20 '09 at 06:14

I'm a little confused why you would use a raid array as your backing store for the zvol.

You are then limiting the way that ZFS can manage the disks/io with a layer that probably isn't as robust as ZFS. (as 3dinfluence mentions) Is setting up your RAID directly with ZFS an option?

Is it possible that you have enough spindles to match the SSD's IOs/s? (SSDs should boost your IOs/second, not necessarily sequential bandwidth)

  • After weeks of testing various configurations, a 16 disk, RAID-10 backed 6.4 TB zVol ended up being by far the fastest option and in my particular situation speed was more important than robustness (we have 2nd 16 disk mirrored array for robustness as this is project is fully redundant on all hardware). I agree, after much learning about ZFS, that I am loosing some of the ZFS features by doing it this way but if any of you who have answered work for a company where you have to "work with what you are given" I think some of this confusion should go away ;-) – jemmille Jan 05 '10 at 12:17