1

This is a follow-up to: High speed network writes with large capacity storage. The setup has changed notably.

I have a pool with a single raid-z2 with 6 drives, all Exos X18 CMR drives. Using fio and manual tests I know that the array can sustain around 800 MB/s sequential writes on average this is fine and in-line with the expected performance of this array. The machine is an Ryzen5 Pro 2400 GE (4C/8T, 3.8 GHz boost) with 32G ECC RAM, NVMe boot/system drive and 2x10Gbps ethernet ports (Intel x550-T2). I'm running an up-to-date Arch system with zfs 2.1.2-1.

My use case is a video archive of mostly large (~30G) write once, read once, compressed video. I've disabled atime, set recordsize=1M, set compressios=off and dedup=off as the data is actually incompressible and testing showed worse performance with compression=lz4 than off despite what the internet said and there is no duplicate data by design. This pool is shared over the network via Samba. I've tuned my network and Samba to the point where transferring from NVMe NTFS on a Windows machine to NVMe ext4 reaches 1GB/s, i.e reasonably close to saturating the 10 Gbps link with 9K Jumbo Frames.

Here's where I run into problems. I want to be able to transfer one whole 30G video archive at 1GB/s to the raid-z2 array that can only support 800 MB/s sequential write. My plan is to use the RAM based dirty pages to absorb the spillover and let it flush to disk after the transfer is "completed" on the client side. I figured that all I would need is (1024-800)*30~=7G of dirty pages in RAM that can get flushed out to disk over ~10 seconds after the transfer completes. I understand the data integrity implications of this and the risk is acceptable as I can always transfer the file again later for up to a month in case a power loss causes the file to be lost or incomplete.

However I cannot get ZFS to behave in the way I expect... I've edited my /etc/modprobe.d/zfs.conf file like so:

options zfs zfs_dirty_data_max_max=25769803776
options zfs zfs_dirty_data_max_max_percent=50
options zfs zfs_dirty_data_max=25769803776
options zfs zfs_dirty_data_max_percent=50
options zfs zfs_delay_min_dirty_percent=80

I have ran the appropriate mkinitcpio -P command to refresh my initramfs and confirmed that the settings were applied after a reboot:

# arc_summary | grep dirty_data
        zfs_dirty_data_max                                   25769803776
        zfs_dirty_data_max_max                               25769803776
        zfs_dirty_data_max_max_percent                                50
        zfs_dirty_data_max_percent                                    50
        zfs_dirty_data_sync_percent                                   20

I.e. I set the max dirty pages to 24G which is waay more than the 7G that I need, and hold of to start delaying writes until 80% of this is used. As far as I understand, the pool should be able to absorb 19G into RAM before it starts to push back on writes from the client (Samba) with latency.

However what I observe writing from the Windows client is that after around 16 seconds at ~1 GB/s write speed the write performance falls off a cliff (iostat still shows the disks working hard to flush the data) which I can only assume is the pushback mechanism for the write throttling of ZFS. However this makes no sense as at the very least even if nothing was flushed out during the 16 seconds it should have set in 3 seconds later. In addition it falls off once again at the end, see picture: [enter image description here][https://i.stack.imgur.com/Yd9WH.png]

I've tried adjusting the zfs_dirty_data_sync_percent to start writing earlier because the dirty page buffer is so much larger than the default and I've alse tried adjusting the active io scaling with zfs_vdev_async_write_active_{min,max}_dirty_percent to kick in earlier as well to get the writes up to speed faster with the large dirty buffer. Both of these just moved the position of the cliff slightly but no where near what I expected.

Questions:

  1. Have I missunderstood how the write throttling delay works?
  2. Is what I'm trying to do possible?
  3. If so, what am I doing wrong?

Yes, I know, I'm literally chasing a couple of seconds and will never recoup the effort spent in achieving this. That's ok, it's personal between me and ZFS at this point, and a matter of prinicple ;)

Emma
  • 374
  • 3
  • 11

1 Answers1

-1

You don't currently have enough RAM or storage resources for what you're seeking.

Design around your desired I/O throughput levels and their worst-case performance.

If you need 1GB/s throughput under all conditions for the working set of data being described, then ensure the disk spindle count or interface throughput is capable of supporting this.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • I don't need it under "all conditions" I need it in one very specific condition, a single 30GB burst. – Emma Apr 08 '22 at 06:55
  • How is 32G RAM not enough to buffer 7G? The system RAM pressure is very low, less than 6G used most time so there's around 26G free. My NIC and Samba can do 1 GB/s as stated in OP. Can you explain why the dirty page buffer cannot be used in this way with this amount of memory? Because too me, it should be... – Emma Apr 08 '22 at 07:09