High speed network writes with large capacity storage

Question

I have a NAS running Samba with a 20T ZFS pool with one raid1 vdev with two spinning rust drives. I have 16G RAM in the machine right now. The storage is used for continuously growing, permanent backup archive of video footage. It's write once, read once for processing and then possibly backup restore.

I regularly fling 40GiB files to this NAS. I'm going to upgrade my gigabit network to 10GbE in order to make this process less painful. However I'm suspecting I'm going to become limited by the write speed of the underlying drives.

My understanding is that a ZIL and SLOG only accelerate synchronous writes so adding an nvme SSD as SLOG wouldn't affect my use case as I believe Samba is using asynchronous writes by default.

I'm not sure if configuring samba for synchronous writes and adding a SLOG on a nvme SSD would do what I need. I understand this comes with the risk of data loss if the drive fails or power cuts out. This is acceptable as I retain the files on the source machine long enough to retransfer in case of near term data loss. Wear and tear on the SSD is a concern but typical drives have 300 TBW or there about which is enough to fill my never-delete NAS 15 times over, or in 75 years at current data generation rate, I'm ok with that and to buy a new SSD if/when the SSD breaks. These are acceptable caveats. Normally I would just try and benchmark but in the current everything-shortage I'd like to know ahead of time what I need to purchase for this.

I know I can add more raid 1 vdevs to the pool to get a raid 10 pool but this is too expensive, the midtower chassis cannot support that many drives, it grossly over provisions the pool together with the existing drives and would use more energy over time to keep all that rust spinning.

What are my options for achieving write speeds in excess of 10Gbps to this zfs pool for at least 40GiB worth of data, aside from adding more spinning rust to the pool in a raid 10 fashion?

score 3 · Accepted Answer · answered Sep 30 '21 at 05:59

Synchronous writing mode ensures that the writes end up in a persistent location immediately. With asynchronous writes, data is cached in RAM and the write call finishes right away. The filesystem will schedule the actual writes to final location (hard disk).

In ZFS case, the point of ZIL / SLOG is to act as a fast interim persistent storage, that allows synchronous mode, that is, ensuring writing client that the writes are final. Otherwise the filesystem would need to write the blocks to the hard disk directly, which makes synchronous mode slow.

In your case, if you want to ensure full speed writing of 40 GB of data, then you should increase your RAM size to cover the size of the file.

However, since the FS starts writing to hard disks immediately, you don't need 40GB memory to get full speed for your writes. For example, when the client has written 20GB of data, 10GB could be in RAM cache and the rest 10GB already in hard drive.

So, you need to do some benchmarking to see how much RAM you need in order to get the full speed writes.

I thought the RAM based write buffer was limited to some size. You're saying ZFS will happily gobble up all the system memory as write cache? That sounds surprising to me as that could cause all kinds of weird performance issues if you don't leave enough RAM for launching new processes... — Emma, Oct 04 '21 at 18:27

shodanshok · Answer 2 · 2021-10-05T06:12:34.860

I understand this comes with the risk of data loss if the drive fails or power cuts out. This is acceptable as I retain the files on the source machine long enough to retransfer in case of near term data loss

If you can tolerate the loss of up to 5 seconds of writes, you can simply configure ZFS to ignore sync requests with the command zfs set sync=disabled tank

Forcing all writes to go through a SLOG, even a very fast one, is never faster than bypassing sync requests. SLOG is not a classical writeback cache, which absorbs write for de-staging them to the slower tier. Rather, it is a mean to provide low latency persistence by temporarily storing sync write (and only them) in an intermediate fast storage. After some seconds, the very same writes will be transferred from main memory to the main pool. A SLOG is never read until a crash (and recover) happens.

That said, with a single HDD-based mirror vdev you will never be able to saturate a 10 Gbs link. For consistently writing at ~1 GB/s speed, you need at least 10 HDD in raidz2 or 12+ HDD in mirror+striping. Or, even better, you need an all-SSD pool. This even before considering things as recordsize, compression, etc.

EDIT, to clarify SLOG jobs:

To minimize latency for synch writes, ZFS used the so-called ZFS Intent Log (ZIL). In short: each time sync write arrive, ZFS immediately writes them on a temporary pool area called ZIL. This enable writes to immediately return, letting the calling application continue. After some seconds, at transaction commit, any records written to ZIL are replied to the main pool. This does not means that the ZIL is read at each commit; rather, the to-be-written data comes from the main DRAM ARC cache. In other words, the ZIL is a sort of "log-ahead journal" which assure fast data persistence for to-be-written sync data.

This actually means that sync writes are duplicated: they are written both to ZIL and the main pool. Enter the SLOG (separated log device): a device dedicated to sync writes only - ie: it frees the main pool from ZIL traffic. A fast SSD SLOG is important due to HDDs being very slow for sync writes. The SLOG is not your classical writeback cache because:

it only absorb sync writes, completely ignoring normal writes;
it replicates only data that are already cached in ARC.

The two points combined means that a big SLOG is basically wasteful, because it only needs 3x the max size of a ZFS transaction. In other words, a 2-4 GB SLOG is sufficient for most cases, with bigger SLOG only useful in specific setups.

Such a SLOG is key to provide lower latency for random sync writes but, while it can absorb very small spikes of sequential sync writes, this is not its main function. In other words, you can see the ZIL/SLOG as a persistent slice of ARC. The corollary is that you can not expect to write dozen of GBs and hiding the slow main pool speed via the SLOG, because this means that you already have dozens GBs of dirty data inside your RAM-based ARC.

Setting sync=disabled instruct ZFS to threat all writes, even sync ones, as normal async writes. This will bypass any data ZIL/SLOG and if you can accept a 5s dataloss window, it is the faster setting you can ever achieve - even when compared to very fast SLOG as Optane or a RAMdrive. The nice thing about sync=disabled is that it does not disable sync writes for ZFS own metadata and so it does not put your filesystem at risk. This does not means you can use it lightly: as stated multiple times, you should be sure to understand its implications (you can lose the last seconds of unsynched data in case of crash/powerloss).

On the other hand, a classical SSD-based writeback cache as lvmcache and bcache can (more or less) efficiently use hundreds GBs of SSD cache to mask the main pool latency / throughput - specifically because they are full-fledged writeback caches which do not need to have their data inside main memory (on contrary, the main memory flushes itself via these SSD caches).

The reasoning behind ZFS was that the (big) main system memory is your real read/write cache, with the SLOG being a mean to have lower latency for random sync writes.

or just add a SATA-SSD as caching for around 500mb/s or NVME for 1gb/s++++++++ — djdomi, Oct 03 '21 at 12:05
@djdomi no, as explained above, a SLOG drive is **not** a writeback cache drive. For what you are suggesting, one should use `lvmcache` or `bcache`, with no ZFS involvement. — shodanshok, Oct 03 '21 at 13:21
@djdomi again, no: ZIL is the ZFS intent log, and a SLOG simply is a device dedicated to ZIL duties (rather then using the main pool). — shodanshok, Oct 03 '21 at 17:07
however it's called, it Exists an write cache nativly on Zfs and that's still a fact and an ssd can be used for it — djdomi, Oct 03 '21 at 17:10
@djdomi well, I suggest you reading some documentation. For example: https://www.servethehome.com/what-is-the-zfs-zil-slog-and-what-makes-a-good-one/ — shodanshok, Oct 03 '21 at 17:45
This comment chain of exactly why I'm asking the question as I keep finding conflicting information and the manual I read want exactly clear on the expected benefits in my case. I would love if the confusion could be cleared up. — Emma, Oct 03 '21 at 23:56
I never said I have a raidz pool, my pool is one raid1 vdev. I don't need consistent 10Gbps writing, I need 10Gbps for 40GiB. My pool is 20T with two drives, getting that kind of capacity with SSDs is out of budget. — Emma, Oct 04 '21 at 18:25
Do you have (well over) 40 GB of RAM? If so, you can (de)tune ZFS forcing it to keep many GB of data cached in ARC for a long time, but this would be terrible both for data safety and for performance (you are going to suffer *minutes-long* stalls when ZFS finally decides to flush dirty pages to stable storage). For such a workload, ZFS is not the right choice, and raidz vs mirror does not change anything. I would rather try `lvmcache` in writeback mode, but in this case be aware than a fault cache device will cause data loss (in other words: you need to mirror your cache devices). — shodanshok, Oct 04 '21 at 19:28

High speed network writes with large capacity storage

2 Answers2

Linked