ZFS dedupe (again): Is memory usage dependent on physical (deduped, compressed) data stored or on logical used?

I've been googling this a lot, but I cannot get sufficient info on this one. The rule of thumb seems to be 5gb of RAM for 1TB of storage. But what is storage actually? Physical or logical used?

Let's say I have a 6TB harddrive, no dedupe, no compression. I have 6TB of actual data. Let's assume it'd dedupe 2:1, down to 3TB of data. Would we (approximately) require 3 * 5GB of memory, or 6 * 5GB?

As I understand it, it's dependent on a record. Since I cannot store more than 6TB of actual records on the disk, about 30GB ought to be enough, no matter compression / deduplication ratio, of course depending on the actual record sizes?

The thing is, that we'd like to calculate what's cheaper: Replace 6*6TB disks (3x onsite storage/mirror/hot spare, 3x offsite, we don't have more slots available in those boxes) with larger ones for backups, or buy some RAM for both boxes.

(Disclaimer: I'm not a sysadmin, but someone needed to put that hat on, so we can continue to have backups.)

Daniel

Posted 2017-01-19T09:23:36.897

Reputation: 53

As you're saying its a rule of thumb it's likely that it would run with less RAM available as well. It just would take longer. In addition it would depend on how much you're actually going to recover by using dedup. Maybe this could help you?

– Seth – 2017-01-19T10:08:14.413

I tried running it in a VM for testing at 16GiB RAM. Imported about a month of backups, and everything came to a crawling halt :) Dedup ratio was impressive, though, and is for the full dataset estimated to be 2.3. – Daniel – 2017-01-19T12:33:13.900

Answers

While user121391's answer is mostly correct, the 1/4 limit for meta data is no longer the case/has not been the case for a long time:

There's a limit to how much of the ZFS ARC cache can be allocated for metadata (and the dedup table falls under this category), and it is capped at 1/4 the size of the ARC

First of all, the zfs_arc_meta_limit (the amount of caching memory that may be used for meta data, including the dedup table) has always been tunable (iirc). So even in very old ZFS versions where 25% might have been the default, you could use that setting to tune the amount of cache available for metadata. In case of a backup system where most of the user data is rarely accessed, >=75% for meta data + <=25% for user data might be more appropriate. Please keep in mind, that said tunable is the available amount of memory in bytes, not a percentage.

Depending on your ZFS implementation, please also consider the following:

For ZFS in Oracle Solaris 11, the limit has long been completely removed by default:

Prior to this change being implemented, the ARC limited metadata to one quarter of memory. Whatever the rationale for this might once have been it carries now a serious adverse effect on dedup performance. Because the DDT is considered to be metadata, it is subject to the 1/4 limit. At this point, this limit is an anachronism; it can be eliminated (or rather, set to arc_c).

So while you CAN still set the limit, it is no longer recommended.

For ZFS on Linux up to 0.6.x, e.g. in Ubuntu 16.04 the default seems to be 75%:

zfs_arc_meta_limit (ulong): The maximum allowed size in bytes that meta data buffers are allowed to consume in the ARC. When this limit is reached meta data buffers will be reclaimed even if the overall arc_c_max has not been reached. This value defaults to 0 which indicates that 3/4 of the ARC may be used for meta data.

There's also a tunable if you would like to make sure a minimum amount of memory is always reserved for meta data:

zfs_arc_meta_min (ulong): The minimum allowed size in bytes that meta data buffers may consume in the ARC. This value defaults to 0 which disables a floor on the amount of the ARC devoted meta data.

In ZFS on Linux 0.7.0, it seems like there will be a way to tune the amount of memory with a percentage limit:

zfs_arc_meta_limit_percent (ulong): Percentage of ARC buffers that can be used for meta data. See also zfs_arc_meta_limit which serves a similar purpose but has a higher priority if set to nonzero value.

If you're planning to use a Linux based ZFS implementation, before spending lots of $$$ on hardware, consider simulating your use case in a virtual machine. I would recommend testing the worst case for dedup (=100% random data). If you do not have the necessary virtualization resources at hand, be advised that you can always just spin up insanely huge instances on most cloud providers for a couple of hours for very little money.

One last thing to consider: You can always tune the ZFS recordsize. Generally speaking, small record sizes will yield better dedup ratios (but obviously require more RAM for the dedup table). Larger record sizes will yield worse dedup ratios, but require less RAM for the dedup table. E.g.: While we're currently not using dedup on our ZFS backup storage, I have set the ZFS recordsize to 1M to match the block size our backup application is working with.

Not sure why I just wrote a PHD thesis on the caching of ZFS meta data, but I hope it helps. :)

nlx-ck

Posted 2017-01-19T09:23:36.897

Reputation: 156

This actually helped quite a lot! Thanks! The 1/4th thing was the major buzz kill. That'd definitely make it cheaper than more hard drives for our use case. – Daniel – 2017-02-08T12:31:45.707

The calculation is from actual pool size before the dedup, or more accurately, from the number of stored blocks in the pool (each block needs about 320 Bytes of space on the DDT, number of blocks needed varies depending on actual data stored). Therefore you would assume 6 * 5 = 30 as a rule of thumb.

But that is not all that is needed, as stated in this excellent guide on dedup:

The Total RAM Cost of Deduplication

But knowing the size of your deduplication table is not enough: ZFS needs to store more than just the dedup table in memory, such as other metadata and of course cached block data. There's a limit to how much of the ZFS ARC cache can be allocated for metadata (and the dedup table falls under this category), and it is capped at 1/4 the size of the ARC.

In other words: Whatever your estimated dedup table size is, you'll need at least four times that many in RAM, if you want to keep all of your dedup table in RAM. Plus any extra RAM you want to devote to other metadata, such as block pointers and other data structures so ZFS doesn't have to figure out the path through the on-pool data structure for every block it wants to access.

Therefore, the rule of thumbs are extended:

For every TB of pool data, you should expect 5 GB of dedup table data, assuming an average block size of 64K.

This means you should plan for at least 20GB of system RAM per TB of pool data, if you want to keep the dedup table in RAM, plus any extra memory for other metadata, plus an extra GB for the OS.

In your case, this comes to roughly 120+ GB of RAM, so not out of the question for current Xeon E5 server boards (128 - 512 GB usual RAM size per CPU). The article also contains a real world example with dollars that should serve you well.

user121391

Posted 2017-01-19T09:23:36.897

Reputation: 1 228

Ahh, thanks! Finally understood it. We ran DDT estimation, and we'd actually be closer to 5.5GB / TB. Assuming staying below 80% utilisation (dedup would be around 2.3, compress 1.5 => enough data) 128 GB would be fine. Though we might skip that, and just run RaidZ1 in both locations for the time being. Less redundancy, actually less space, but money is sadly an issue. One last thing: We could run an L2ARC. That could hold the dedup table. Since we don't need to be hyper performant, it'd maybe actually be ok to do that. But how much memory is enough then? 16 GiB is not :) – Daniel – 2017-01-19T12:30:51.080

@Daniel If you try it, It would be nice if you could report your experiences here, it seems that not many people have tried this already. Of course, have a backup first ;) – user121391 – 2017-01-20T09:10:07.893

1I finally have values :) We bought an additional system with 64GB ECC memory, 4x 10TB HDDs, no L2ARC, running in mirror mode, Debian Stretch system with its included ZFS version (0.6.something) on top of luks. Dedup and compression turned on. Running on 3 years of partial thinned out rsnapshot data of mostly Debian VMs, including user generated data, like a ton of images that most probably have been renamed, copied, moved from time to time, thus were not caught with rsnapshot. – Daniel – 2017-07-22T09:25:40.823

1We got a total of 25.4M allocated blocks, a dedup ratio of 2.45x, a compression ratio of 1.6x (compared to 1.8x on non-deduped data). Logical data is 7.28T, physical data on the disks is 2.24T. If I did the calculation correctly, we are only sitting at 7.6GiB used for DDT. I set zfs_arc_max to 58GiB. I have done no further tuning at all. If you'd like to know anything else, I'm happy to help. – Daniel – 2017-07-22T09:25:45.273