While user121391's answer is mostly correct, the 1/4 limit for meta data is no longer the case/has not been the case for a long time:
There's a limit to how much of the ZFS ARC cache can be allocated for metadata (and the dedup table falls under this category), and it is capped at 1/4 the size of the ARC
First of all, the zfs_arc_meta_limit (the amount of caching memory that may be used for meta data, including the dedup table) has always been tunable (iirc). So even in very old ZFS versions where 25% might have been the default, you could use that setting to tune the amount of cache available for metadata.
In case of a backup system where most of the user data is rarely accessed, >=75% for meta data + <=25% for user data might be more appropriate. Please keep in mind, that said tunable is the available amount of memory in bytes, not a percentage.
Depending on your ZFS implementation, please also consider the following:
For ZFS in Oracle Solaris 11, the limit has long been completely removed by default:
Prior to this change being implemented, the ARC limited metadata to one quarter of memory. Whatever the rationale for this might once have been it carries now a serious adverse effect on dedup performance. Because the DDT is considered to be metadata, it is subject to the 1/4 limit. At this point, this limit is an anachronism; it can be eliminated (or rather, set to arc_c).
So while you CAN still set the limit, it is no longer recommended.
For ZFS on Linux up to 0.6.x, e.g. in Ubuntu 16.04 the default seems to be 75%:
zfs_arc_meta_limit (ulong):
The maximum allowed size in bytes that meta data buffers are allowed to consume in the ARC. When this limit is reached meta data buffers will be reclaimed even if the overall arc_c_max has not been reached. This value defaults to 0 which indicates that 3/4 of the ARC may be used for meta data.
There's also a tunable if you would like to make sure a minimum amount of memory is always reserved for meta data:
zfs_arc_meta_min (ulong):
The minimum allowed size in bytes that meta data buffers may consume in the ARC. This value defaults to 0 which disables a floor on the amount of the ARC devoted meta data.
In ZFS on Linux 0.7.0, it seems like there will be a way to tune the amount of memory with a percentage limit:
zfs_arc_meta_limit_percent (ulong):
Percentage of ARC buffers that can be used for meta data. See also zfs_arc_meta_limit which serves a similar purpose but has a
higher priority if set to nonzero value.
If you're planning to use a Linux based ZFS implementation, before spending lots of $$$ on hardware, consider simulating your use case in a virtual machine. I would recommend testing the worst case for dedup (=100% random data). If you do not have the necessary virtualization resources at hand, be advised that you can always just spin up insanely huge instances on most cloud providers for a couple of hours for very little money.
One last thing to consider: You can always tune the ZFS recordsize. Generally speaking, small record sizes will yield better dedup ratios (but obviously require more RAM for the dedup table). Larger record sizes will yield worse dedup ratios, but require less RAM for the dedup table. E.g.: While we're currently not using dedup on our ZFS backup storage, I have set the ZFS recordsize to 1M to match the block size our backup application is working with.
Not sure why I just wrote a PHD thesis on the caching of ZFS meta data, but I hope it helps. :)
As you're saying its a rule of thumb it's likely that it would run with less RAM available as well. It just would take longer. In addition it would depend on how much you're actually going to recover by using dedup. Maybe this could help you?
– Seth – 2017-01-19T10:08:14.413I tried running it in a VM for testing at 16GiB RAM. Imported about a month of backups, and everything came to a crawling halt :) Dedup ratio was impressive, though, and is for the full dataset estimated to be 2.3. – Daniel – 2017-01-19T12:33:13.900