7

TL;DR: I'm having some performance issues with my hypervisor storage. here are a bunch of test results from fio. Skip to the Results section to read about them and see my questions.


Summary

I recently purchased an R730xd so before I migrate over to it I wanted to be sure that storage was performing opmtimally. I've been running some benchmark tests with fio and have found some alarming results. Using a combination of those results and fio-plot, I've amassed a pretty large collection of graphs and charts that demonstrate the issues across my various storage backends.

However, I'm having a hard time turning it into usable information because I don't have anything to compare it to. And I think I'm having some very strange performance issues.


Disk Configuration

Here are the four types of storage exposed to my hypervisor (Proxmox):

╔═══════════╦════════════════════════════════╦═════════════╦════════════════════════════╗
║  Storage  ║            Hardware            ║ Filesystem  ║        Description         ║
╠═══════════╬════════════════════════════════╬═════════════╬════════════════════════════╣
║ SATADOM   ║ 1x Dell K9R5M SATADOM          ║ LVM/XFS     ║ Hypervisor filesystem      ║
║ FlashPool ║ 2x Samsung 970 EVO M.2 SSD     ║ ZFS RAID 1  ║ Hypervisor Compute Storage ║
║ DataPool  ║ 6x HGST 7200RPM HDD            ║ ZFS RAID 10 ║ Redundant Data Storage     ║
║ RAIDPool  ║ 6x Seagate/Hitachi 7200RPM HDD ║ HW RAID 10  ║ General Purpose Storage    ║
╚═══════════╩════════════════════════════════╩═════════════╩════════════════════════════╝

Storage Details

Here is a more detailed breakdown for each storage backend:

  1. SATADOM: The SATADOM is managed directly by Proxmox via LVM. Here is the output from lvdisplay pve. The SATADOM is connected to the server via the internal DVD-ROM SATA port as it is unused in the R730xd model.

  2. FlashPool: The FlashPool is a simple ZFS RAID 1 comprised of dual NVMe SSDs. The goal is to use this as backing storage for my VMs. Here are the outputs for:

     zpool list  
     zpool status  
     zfs get all
    

    Each of the SSDs in the FlashPool are connected to the server via PCI-E -> M.2 adapters installed in x16 PCIe slots. I recognize that these are x4 PCIe adapters. However, I'm pretty sure NVMe only operates at that speed so faster adapters are not manufactured.

  3. DataPool: The DataPool is the only pre-existing dataset. It is a couple years old and was previously used for both Data and VM storage to the detriment of performance. It is also managed by Proxmox as a ZFS RAID 10.

    It was originally comprised of 6x 4TB HGST Ultrastar 7K4000 7200RPM disks. However, as they started to fail I decided to replace them with higher density disks. As a result, the array now consists of:

     2x 6TB HGST Ultrastar He6 7200RPM  
     4x 4TB HGST Ultrastar 7K4000 7200RPM 
    

    I obviously intend to eventually move entirely to 6TB disks as the older ones continue to fail. Here are the outputs for the same commands posted above for the FlashPool.

    These 6 disks are connected to the server via the first 6 bays on the backplane. This backplane is connected to a Dell H730 Mini PERC RAID Controller.

  4. RAIDPool: The RAIDPool is an experimental storage backend. I've never worked with hardware RAID before so I was excited for the opportunity now that I have a proper RAID Controller. Similar to the DataPool, these disks are installed in the last 6 bays on the backplane. However, instead of being passed through to Proxmox, they are managed by the PERC. They are presented to Proxmox as a single disk which is then managed by LVM and presented to the OS via logical volumes as XFS filesystems. Here is the output from lvdisplay RAIDPool.


RAID Controller Configuration

So, you may have just noticed that both the DataPool and RAIDPool are installed and managed by the H730 RAID Controller. However, the DataPool is managed by Proxmox via ZFS and the RAIDPool is managed by the actual controller.

Here is a screenshot of the topology of the physical disks. The H730 is capable of passing disks directly through to the OS and simultaneously managing other disks. As you can see, The first 6 disks are configured in Non-RAID mode and the last 6 disks are configured in Online mode.

  • Here are the configured properties for the controller from within the iDRAC UI.
  • Disk Cache is enabled for both Write Back and Read Ahead on the Virtual Disk (RAIDPool). Since this is configured specifically for the VD, it should not impact the ZFS drives.
  • Dick Cache for Non-RAID disks (ZFS DataPool) is set to Disable.
  • The Link Speed for all of the drives is set to auto.

Also, after stepping through all of the settings once more, I enabled Write Cache for the Embedded SATA Controller. So this might improve performance on the SATADOM from what is seen in the benchmarks below.


Benchmarking:

I have benchmarked all of these storage backends in two ways. For both tests, I ran a series of fio-plot commands in a small shell script that dumped the results in a few folders.

If you're crazy and want to parse through the raw results on your own, here they are. You will need to massage my scripts a bit to rerun then since I moved around the directory structure to organize it prior to uploading it.

In a nutshell, they ran a series of tests against each storage backend that evaluated its RANDOM bandwidth, IOPS, and latency. It then plotted these results on graphs. Some graphs compare multiple backends. Other graphs simply show results from individual backends. I did not perform any SEQUENTIAL tests. In all cases, the default block size was used for the test.

Test 1) From within Proxmox, I mounted all of the storage backends into the /mnt directory. The ZFS Pool were simply imported into the OS and both the RAIDPool and the SATADOM were presented to the OS via LVM. Each had a Logical Volume formatted as an XFS partition that was used for benchmarking. NOTE: I ran these benchmarks from the live OS so the performance of the SATADOM will be affected accordingly.

The log files were generated using these commands:

./bench_fio --target /mnt/SATADOM_Data/bm --type directory --size 450M --mode randread randwrite --output SATADOM
./bench_fio --target /mnt/RAIDPool_Data/bm --type directory --size 1G --mode randread randwrite --output RAIDPOOL
./bench_fio --target /mnt/DataPool/bm/ --type directory --size 1G --mode randread randwrite --output DATAPOOL
./bench_fio --target /mnt/FlashPool/bm/ --type directory --size 1G --mode randread randwrite --output FLASHPOOL

Test 2) I created three VMs in Proxmox. Each of which used a different backing storage from the FlashPool, DataPool, and RAIDPool. The FlashPool and DataPool VMs ran in their own ZFS dataset. The RAIDPool VM ran on its own thick-provisioned Logical Volume. All three VMs were given 4 vCPUs and 40GB of memory.

The log files were generated using these commands:

./bench_fio     --target /fio     --type file     --size 1G     --mode randread randwrite     --duration 600     --output DATAPOOL_VM
./bench_fio     --target /fio     --type file     --size 1G     --mode randread randwrite     --duration 600     --output RAIDPOOL_VM
./bench_fio     --target /fio     --type file     --size 1G     --mode randread randwrite     --duration 600     --output FLASHPOOL_VM

Results:

The graphs in the above Imgur links should all be in the same order. The results from the two benchmarks are quite a bit different. But that is to be expected when you factor in the overhead from virtualization. What is NOT expected to me, is that they all seem to behave about the same.

  • For example, this chart shows that when fio was run from within a VM, the average write bandwidth was somewhere around 125 MB/s. Shouldn't the two NVMe SSDs in RAID 1 (FlashPool) MASSIVELY outperform the SATADOM? Instead, you can see that the FlashPool VM took the LONGEST amount of time to complete the test and had the slowest average write bandwidth. The same situation can be seen for the Write IOPS comparison -- The average IOPS were around 3,000 and the FlashPool VM took the longest to execute the test!

  • Stepping away from the benchmarks taken from WITHIN a VM, and instead looking at those taken by directly interacting with the storage from the hypervisor, we can see some different behavior. For example, in this test the write bandwidth for the FlashPool and DataPool was as high as 400MB/s. However, the performance for the RAIDPool averaged around 10MB/s. Which coincidentally, was about the same as the SATADOM? Surely, the RAIDPool should have performed compatible to, if not better than, the DataPool? Given that they are comprised of similar disks present in the same RAID Controller? Similar to above, the Write IOPS show the same bizarre story.

  • The Write Latency from the Hypervisor tests also appears to be unusual. The RAIDPool appears to be experiencing up to ten times worse latency than the ZFS Pools? However, if you flip over to the VM tests, the latency for the three storage backends seems to congregate around 300us. Which is pretty similar to what we were seeing in the WORST cast for the RAIDPool. Why does this smooth effect happen to write latency when the tests are run from VMs instead of hypervisor? Why does the latency for the ZFS Pools suddenly become so much worse, and comparable to, the RAIDPool?

  • Looking at the read Bandwidth, IOPS, and latency shows a similar story. All metrics are equally slow, despite having massively different hardware configurations, when benchmarked from within a VM. However, once benchmarked from the hypervisor, the ZFS Pools suddenly greatly outperform everything else?


Questions:

  1. These results are abnormal... right? This benchmark from this website shows a 970 EVO achieving upwards of 900MB/s random write speeds. Why are mine only coming in at 150MB/s on the hypervisor and 10MB/s in a VM? Why are these speeds so different when benchmarked from the hypervisor and from a VM?

  2. Why does the RAIDPool suddenly become abnormally slow when benchmarked from the Hypervisor? Here we see that the Read Bandwidth in a VM averages 20MB/s. However, from the hypervisor, it instead reports 4MB/s. Just like the benchmark tests I showed in question 1, shouldn't these read speeds be closer to 900MB/s?

  3. Why do the ZFS Pools suddenly perform significantly worse when benchmarked from within a VM instead of the hypervisor? For example, here we can see that the read IOPS averaged around 200,000 and latency under 650us. However, when benchmarked from within a VM, we can suddenly see that the read IOPS average around 2,500 and the latency more than quadrupled? Shouldn't the performance in both situations be about the same?

TJ Zimmerman
  • 241
  • 5
  • 17

1 Answers1

10

When benchmarking ZFS pools, you need to understand how caching and recordsize interact with your workloads:

  • your fio commands do not skip the linux pagecache (no --direct=1 option), nor ZFS ARC. However, due the different mode of operation between the two, you can end favoring a plain filesystem (XFS) vs ZFS or vice-versa. To mitigate the caching effect, I suggest you benchmarking with a file 2x greater than your RAM value (ie: if having 24 GB of RAM, use a 48 GB file). Do not benchmark ZFS with caching disabled (ie: primarycache=none), as a CoW filesystem needs an high cache hit rate to give good performance (especially when writing less-than-recordsize blocks, as you can read below);

  • your random read/write IOPs and thoughtput are going to be severely affected by ZFS recordsize property, as ZFS generally transfers full recordsized blocks (with the exception of small files, where "small" means < recordsize). In other words, while fio is reading/writing 4K blocks, ZFS actually read/write 32K blocks for each 4K block requested by fio. Caching can (and will) alter this generic rule, but the point remains: with large recordsize, throughput saturation can be a thing. Please note that I am not stating that 32K recordsize is unreasonable (albeit I would probably use 16K to limit wear on the SSDs); however, you need to account for it when evaluating the benchmark results;

  • I would re-enable the physical disk cache for the pass-through disks as ZFS knows how to flush their volatile cache. However, you need to check that your H730P honours ATA FLUSHes / FUAs for pass-though disks (it should pass syncs, but its manual is unclear on this point and I have no actual hardware to try);

  • your RAIDPool array is composed of mechanical HDDs, so its random read performance are going to be low (the controller cache will not help you for random reads).

All considerend, I don't find your results to be abnormal; rather, they do not represent a valid workload and are partially misinterpreted. If you really want to compare ZFS and HWRAID+XFS, I suggest you to test with an actual expected workload (ie: a database + application VMs doing some useful works) while at the same time being sure to use ThinLVM (rather than classical LVM) to have at least a fast-snapshot capability somewhat comparable to ZFS own snapshot/clone features.

But, in a sense, you can avoid doing these tests, simply because the outcome will be quite predictable:

  • a plain HWRAID+LVM+XFS setup will be faster for sequential IO and random read/writes on datasets which fits into the linux pagecache: not being affected by CoW, it pays a much smaller overhead than ZFS;

  • the ZFS setup will be faster in real-world scenarios where the scan-resistant nature of ARC will ensure the most frequently used data remain always cached. Moreover, compression and checksumming are two killer features (to have similar features from HWRAID you need to use a stacked dm-integrity + vdo + thinlvm setup, which itself commands a great performance penalty).

As a reference point, I recently replaced a Dell R720xd featuring a H710P + 12 10K RPM SAS disks with a much cheaper SuperMicro 5029WTR featuring 2x SSD (for boot and L2ARC) + 1x NVMe Optane (for SLOG) and 6x 7.2K RPM SATA disks. The SuperMicro system, while having only 1/3 the nominal random read performance than the Dell one, performs much better thanks to ARC/L2ARC and compression.

In the end, while I fully understand the motivations to use a classical HWRAID+LVM+XFS system, I would not go back to using it rather than ZFS for a bare metal machine as an hypervisor (unless targeting specific workloads which really badly perform with a CoW layer in between or when extreme speed and DirectIO are required - see XFS dax option).

shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • This is such a fantastic answer. You have taught me a lot. Thank you so much for taking the time to think through and write this. – TJ Zimmerman Jun 13 '20 at 20:09