Slow sequential speeds on 9x7-drive raidz2 (ZFS ZoL 0.8.1)

Question

I'm running a large ZFS pool built for 256K+ request size sequential reads and writes via iSCSI (for backups) on Ubuntu 18.04. Given the need for high throughput and space efficiency, and less need for random small-block performance, I went with striped raidz2 over striped mirrors.

However, the 256K sequential read performance is far lower than I would have expected (100 - 200MBps, peaks up to 600MBps). When the zvols are hitting ~99% iowait in iostat, the backing devices typically run between 10 and 40% iowait, which suggests to me the bottleneck is something I'm missing in configuration, given it shouldn't be the backplane or CPUs in this system, and sequential workloads shouldn't work the ARC too hard.

I've played quite a bit with module parameters (current config below), read hundreds of articles, issues on OpenZFS github, etc. Tuning prefetch and aggregation got me to this performance level - by default, I was running at about ~50MBps on sequential reads as ZFS was sending TINY requests to the disks (~16K). With aggregation and prefetch working OK (I think), disk reads are much higher, around ~64K on average in iostat.

NICs are LIO iscsi target with cxgbit offload + Windows Chelsio iscsi initiator work well outside the ZFS zvols, with a optane directly mapped returning nearly full line-rate on the NICs (~3.5GBps read and write).

Am I expecting too much? I know ZFS prioritizes safety over performance, but I'd expect a 7x9 raidz2 to provide better sequential reads than a single 9-drive mdadm raid6.

System specs and logs / config files:

Chassis: Supermicro 6047R-E1R72L
HBAs: 3x 2308 IT mode (24x 6Gbps SAS channels to backplanes)
CPU: 2x E5-2667v2 (8 cores @ 3.3Ghz base each)
RAM: 128GB, 104GB dedicated to ARC
HDDs: 65x HGST 10TB HC510 SAS (9x 7-wide raidz2 + 2 spares)
SSDs: 2x Intel Optane 900P (partitioned for mirrored special and log vdevs)
NIC: Chelsio 40GBps (same as on initiator, both using hw offloaded iSCSI)
OS: Ubuntu 18.04 LTS (using latest non-HWE kernel that allows ZFS SIMD)
ZFS: 0.8.1 via PPA
Initiator: Chelsio iSCSI initiator on Windows Server 2019

Pool configuration:

ashift=12
recordsize=128K (blocks on zvols are 64K, below)
compression=lz4
xattr=sa
redundant_metadata=most
atime=off
primarycache=all

ZVol configuration:

sparse
volblocksize=64K (matches OS allocation unit on top of iSCSI)

Pool layout:

7x 9-wide raidz2
mirrored 200GB optane special vdev (SPA metadata allocation classes)
mirrored 50GB optane log vdev

/etc/modprobe.d/zfs.conf:

# 52 - 104GB ARC, this system does nothing else
options zfs zfs_arc_min=55834574848
options zfs zfs_arc_max=111669149696

# allow for more dirty async data
options zfs zfs_dirty_data_max_percent=25
options zfs zfs_dirty_data_max=34359738368

# txg timeout given we have plenty of Optane ZIL
options zfs zfs_txg_timeout=5

# tune prefetch (have played with this 1000x different ways, no major improvement except max_streams to 2048, which helped, I think)
options zfs zfs_prefetch_disable=0
options zfs zfetch_max_distance=134217728
options zfs zfetch_max_streams=2048
options zfs zfetch_min_sec_reap=3
options zfs zfs_arc_min_prefetch_ms=250
options zfs zfs_arc_min_prescient_prefetch_ms=250
options zfs zfetch_array_rd_sz=16777216

# tune coalescing (same-ish, increasing the read gap limit helped throughput in conjunction with low async read max_active, as it caused much bigger reads to be sent to the backing devices)
options zfs zfs_vdev_aggregation_limit=16777216
options zfs zfs_vdev_read_gap_limit=1048576
options zfs zfs_vdev_write_gap_limit=262144

# ZIO scheduler in priority order 
options zfs zfs_vdev_sync_read_min_active=1
options zfs zfs_vdev_sync_read_max_active=10
options zfs zfs_vdev_sync_write_min_active=1
options zfs zfs_vdev_sync_write_max_active=10
options zfs zfs_vdev_async_read_min_active=1
options zfs zfs_vdev_async_read_max_active=2
options zfs zfs_vdev_async_write_min_active=1
options zfs zfs_vdev_async_write_max_active=4

# zvol threads
options zfs zvol_threads=32

I'm tearing my hair out on this. Pressure's on from users to go all-Windows with Storage Spaces, but I've used parity storage spaces (even with Storage Spaces Direct with mirrors on top), and it's not pretty either. I'm tempted to go straight mdadm raid60 under iSCSI, but would love it if someone could point out something boneheaded I'm missing that will unlock performance with the bitrot protection of ZFS :)

nice information about veeam and ReFS + synthetic Clones. I build a home system with Debian 10 openzfs and get slow read with NFS. My impression ist that openzfs is not for production enviroments. Maybe u can give a status of your present situation. I would advice you to use centos with the same setúp you already build. I follow your wish to use ReFS. Good luck! — jew, May 21 '20 at 22:48

ewwhite · Answer 1 · 2019-08-07T13:52:34.103

7

Good question.

I think your sparse zvol block size should be 128k.
Your ZIO scheduler settings should all be higher, like minimum 10 and max 64.
zfs_txg_timeout should be much longer. I do 15 or 30s on my systems.
I think the multiple RAIDZ3's (or was that a typo) are overkill and play a big part in the performance. Can you benchmark with RAIDZ2?

Edit: Install Netdata on the system and monitor utilization and ZFS stats.

Edit2: This is for a Veeam repository. Veeam support Linux as a target, and works wonderfully with ZFS. Would you consider benchmarking that with your data? zvols aren't an ideal use case for what you're doing, unless the NIC's offload is a critical part of the solution.

edited Aug 07 '19 at 13:52

answered Aug 07 '19 at 05:20

ewwhite

194,921
91
434
799

Thanks! Point by point in follow-up comments, except Z3 which was indeed a typo :). On volblocksize, I've tested with both 128k and 64k, and the performance didn't change much for sequential reads. 128k would likely be a bit more space-efficient, but 64k matches the initiator client OS allocation unit size, and seems to do significantly better in random i/o scenarios (which are rare), while not mattering much in sequential i/o scenarios. – obrienmd Aug 07 '19 at 11:57
I'll test with txg_timeout higher - would that matter in the least for sequential reads? Given the low iowait on the backing disks, it didn't seem like write flushes were contending with / impacting average read speeds much. – obrienmd Aug 07 '19 at 12:02
1

MOST interesting feedback I have for you (I think) is for ZIO scheduler. When I move the needle on async mins and maxes, it _destroys_ io aggregation and the net result is quite bad. For reads, which is what I really care about here as writes are great, going to 10/64 makes average IOs to the disks ~16KB in iostat, and cuts the average read speed by 75% (~30 - 60MBps) given those disks' IOPS. I've also tweaked sync read #s and didn't see much affect, but I'll give that another shot regardless :) – obrienmd Aug 07 '19 at 12:04
zfs zfs_dirty_data_max_percent=25 - I'm usually 40% or greater there. – ewwhite Aug 07 '19 at 12:07
Oh, reads are a problem? What type of data is this? How full is the pool? – ewwhite Aug 07 '19 at 12:09
Yep, reads are the problem. Data was all written sequentially on the zvols, and this is a super fresh pool - about 35% full with very low fragmentation. It's a backing store for Veeam repository volumes, _huge_ files written sequentially on Windows. – obrienmd Aug 07 '19 at 12:10
I upped txg_timeout and the sync ZIO min and max read, didn't do much one way or the other. – obrienmd Aug 07 '19 at 12:11
So, I do run a large number of Veeam repositories based on ZFS. It's much cleaner to use native Linux versus iSCSI+zvols for this. Linux makes a great Veeam repository. – ewwhite Aug 07 '19 at 12:19
I'd typically go that route (and have a few mdadm+xfs older systems doing just that) - but for GFS copies, ReFS with synthetic fulls via fast clone is _phenomenal_ for space efficiency. From the space efficiency perspective, it's almost like inline dedupe (per chain) without the pain of inline dedupe :) Hence my want to use ZFS as a backing store (because Storage Spaces parity suuuuucks) and get decent sequential performance out of it :) – obrienmd Aug 07 '19 at 14:34
I'm suggesting to use ZFS filesystems for your setup, just not with zvols. – ewwhite Aug 07 '19 at 15:31
Sorry, I wasn't clear! I understand what you're suggesting. However, Veeam as of 9.x at some point added support for ReFS block cloning, which allows a repo to store a boatload of synthetic fulls and have them "share blocks". So, it removes the broken chain risk and most of the performance implications of using forever incremental or forever reverse incremental. ReFS with Veeam-usable block clone is only available on raw devices or storage spaces on Windows 2016+, meaning I'd have to use a zvol or a raw file on ZFS (both of which perform similarly) exported via iSCSI to make use of it. – obrienmd Aug 07 '19 at 15:50
@obrienmd Okay. – ewwhite Aug 07 '19 at 16:07

Slow sequential speeds on 9x7-drive raidz2 (ZFS ZoL 0.8.1)

1 Answers1