ZFS: Memory issues with dedup even though zdb -DD looks fine

Question

I've been experimenting with ZFS on a machine with Ubuntu 12.10, 32GB RAM (non-ECC, production system will have ECC) and a 2x2TB Linux-managed RAID1 (will be moved to RAIDZ1 for production). I just created the tank on the 2TB soft-RAID1 device, enabled compression and dedup and stored a few 100GB of data.

I got a dedup ratio of about 3.5x (it really makes sense for my data, that's why I would like to use it), but there was no free memory left at all, the system became unusable. Restarting the system, everything seemed fine, then I wrote a few GB of data, same thing.

I then set zfs_arc_max to 12GB (since apparently I am not the only one who has had runaway memory consumption), which saved the system from becoming unresponsive, but writing a few GB maxed out the memory limit and writing to the tank became really slow, basically unusable.

I know that dedup takes RAM, but as far as I know, this

DDT-sha256-zap-duplicate: 615271 entries, size 463 on disk, 149 in core
DDT-sha256-zap-unique: 846070 entries, size 494 on disk, 159 in core

DDT histogram (aggregated over all DDTs):

bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1     826K   83.5G   51.7G   52.9G     826K   83.5G   51.7G   52.9G
     2     363K   34.6G   17.8G   18.5G     869K   81.9G   41.3G   43.0G
     4     138K   14.1G   8.89G   9.11G     654K   66.4G   41.0G   42.1G
     8    49.0K   3.94G   2.25G   2.34G     580K   44.3G   25.3G   26.4G
    16    37.2K   3.96G   3.06G   3.10G     865K   90.1G   69.9G   70.8G
    32    9.81K    854M    471M    488M     464K   40.5G   21.9G   22.7G
    64    1.84K    160M   80.8M   85.1M     148K   11.8G   5.99G   6.33G
   128    1.13K   60.4M   24.7M   27.7M     218K   11.2G   4.70G   5.26G
   256      545   52.9M   30.9M   32.1M     169K   15.5G   9.00G   9.36G
   512      120   7.17M   4.19M   4.51M    84.5K   5.09G   2.96G   3.18G
    1K      368   40.0M   19.0M   19.7M     480K   52.2G   24.8G   25.7G
    2K       16    401K     23K     76K    46.4K   1.31G   73.5M    226M
    4K        8      5K      4K     32K    39.9K   24.6M   20.0M    160M
 Total    1.39M    141G   84.3G   86.6G    5.32M    504G    299G    308G

means that the table should only take about 90MB of memory, so I don't get what's happening. I have the same setup on an identical server without dedup and that one seems to work fine...

I would really appreciate any help (beyond "disable dedup", as it really, really makes sense for my data ;))! More data:

tank  type                  filesystem             -
tank  creation              Thu Jan 16 13:17 2014  -
tank  used                  342G                   -
tank  available             1.67T                  -
tank  referenced            341G                   -
tank  compressratio         1.64x                  -
tank  mounted               yes                    -
tank  quota                 none                   default
tank  reservation           none                   default
tank  recordsize            128K                   default
tank  mountpoint            /    tank                  default
tank  sharenfs              off                    default
tank  checksum              on                     default
tank  compression           lzjb                   local
tank  atime                 off                    local
tank  devices               on                     default
tank  exec                  on                     default
tank  setuid                on                     default
tank  readonly              off                    default
tank  zoned                 off                    default
tank  snapdir               hidden                 default
tank  aclinherit            restricted             default
tank  canmount              on                     default
tank  xattr                 sa                     local
tank  copies                1                      default
tank  version               5                      -
tank  utf8only              off                    -
tank  normalization         none                   -
tank  casesensitivity       sensitive              -
tank  vscan                 off                    default
tank  nbmand                off                    default
tank  sharesmb              off                    default
tank  refquota              none                   default
tank  refreservation        none                   default
tank  primarycache          all                    default
tank  secondarycache        all                    default
tank  usedbysnapshots       36.9M                  -
tank  usedbydataset         341G                   -
tank  usedbychildren        702M                   -
tank  usedbyrefreservation  0                      -
tank  logbias               latency                default
tank  dedup                 on                     local
tank  mlslabel              none                   default
tank  sync                  standard               default
tank  refcompressratio      1.64x                  -
tank  written               308K                   -
tank  snapdev               hidden                 default

UPDATED with more data.

OK, so I ran another test - rebooted the server, mounted the volume etc.:

             total       used       free     shared    buffers     cached
Mem:         32138        457      31680          0         19         66
-/+ buffers/cache:        372      31766
Swap:         7812          0       7812

and arcstats

4 1 0x01 84 4032 7898070146 560489175172
name                            type data
hits                            4    1059
misses                          4    185
demand_data_hits                4    0
demand_data_misses              4    0
demand_metadata_hits            4    971
demand_metadata_misses          4    49
prefetch_data_hits              4    0
prefetch_data_misses            4    7
prefetch_metadata_hits          4    88
prefetch_metadata_misses        4    129
mru_hits                        4    476
mru_ghost_hits                  4    0
mfu_hits                        4    495
mfu_ghost_hits                  4    0
deleted                         4    9
recycle_miss                    4    0
mutex_miss                      4    0
evict_skip                      4    0
evict_l2_cached                 4    0
evict_l2_eligible               4    0
evict_l2_ineligible             4    2048
hash_elements                   4    176
hash_elements_max               4    176
hash_collisions                 4    0
hash_chains                     4    0
hash_chain_max                  4    0
p                               4    6442450944
c                               4    12884901888
c_min                           4    1610612736
c_max                           4    12884901888
size                            4    1704536
hdr_size                        4    101424
data_size                       4    1448960
other_size                      4    154152
anon_size                       4    16384
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    1231872
mru_evict_data                  4    206336
mru_evict_metadata              4    849408
mru_ghost_size                  4    0
mru_ghost_evict_data            4    0
mru_ghost_evict_metadata        4    0
mfu_size                        4    200704
mfu_evict_data                  4    0
mfu_evict_metadata              4    4096
mfu_ghost_size                  4    16384
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    16384
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    1498200
arc_meta_limit                  4    3221225472
arc_meta_max                    4    1449144

Played around a bit until ARC hit vfs_arc_max (12GB):

4 1 0x01 84 4032 7898070146 1406380500230
name                            type data
hits                            4    7338384
misses                          4    117090
demand_data_hits                4    4841648
demand_data_misses              4    10072
demand_metadata_hits            4    2423640
demand_metadata_misses          4    35334
prefetch_data_hits              4    37879
prefetch_data_misses            4    65420
prefetch_metadata_hits          4    35217
prefetch_metadata_misses        4    6264
mru_hits                        4    2672085
mru_ghost_hits                  4    301
mfu_hits                        4    4615778
mfu_ghost_hits                  4    1183
deleted                         4    9
recycle_miss                    4    1022
mutex_miss                      4    17
evict_skip                      4    2
evict_l2_cached                 4    0
evict_l2_eligible               4    1977338368
evict_l2_ineligible             4    751589376
hash_elements                   4    166822
hash_elements_max               4    166828
hash_collisions                 4    59458
hash_chains                     4    21504
hash_chain_max                  4    4
p                               4    55022931
c                               4    12652319216
c_min                           4    1610612736
c_max                           4    12884901888
size                            4    12327222416
hdr_size                        4    55933440
data_size                       4    12149027328
other_size                      4    122261648
anon_size                       4    1056256
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    6481734656
mru_evict_data                  4    6220393984
mru_evict_metadata              4    188646912
mru_ghost_size                  4    1902724096
mru_ghost_evict_data            4    1871710720
mru_ghost_evict_metadata        4    31013376
mfu_size                        4    5666236416
mfu_evict_data                  4    5643978240
mfu_evict_metadata              4    16081408
mfu_ghost_size                  4    708022272
mfu_ghost_evict_data            4    680676352
mfu_ghost_evict_metadata        4    27345920
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    1947
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    462466704
arc_meta_limit                  4    3221225472
arc_meta_max                    4    465357280

and free -m showed what was to be expected, buffers/cache and first line agreeing about used/free. But, playing around some more led to the system becoming unreasonably slow (minutes to copy 1GB) and

             total       used       free     shared    buffers     cached
Mem:         32138      31923        215          0          6      15442
-/+ buffers/cache:      16473      15665
Swap:         7812          0       7812

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  1    308 3774708  27204 9464052    0    0   386   271   72  348  1  2 83 15

Unmounting the ZFS volume and unloading the kernel module frees up all the memory...So to me, it really looks like some sort of memory leak: zfs_arc_max is set, and arcstats says this limit is observed (see below), but ZFS somehow continues to eat up memory. Phew...

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
14:08:08     0     0      0     0    0     0    0     0    0   9.8G   10G

I'll try to answer with something better than disable dedupe... but you know it's going to come down to disabling dedupe :) — ewwhite, Jan 22 '14 at 15:12
That would be too bad ;) But at this point, I am not just looking to get deduplication at any cost, but interested/concerned as to why this is happening. I am planning a non-small deployment of ZFS, so I would like to understand what's going on here ;) — , Jan 22 '14 at 15:36
@admr, sometimes it's better go and ask the devs: https://github.com/zfsonlinux/zfs/issues — poige, Jan 22 '14 at 17:22
Considering the still-not-insignificant issues regarding 'memory' in the ZFS On Linux project, I would really not recommend you try dedupe at any scale on ZoL. Stick to an illumos-derivative preferably, or a FreeBSD box if that's not an option. — Nex7, Jan 22 '14 at 20:15
@ewwhite I am basically storing large amounts of numerical results in text form, plus "metadata" for those results. There are many users in the system who can start their work on "template results, many of them not making significant modifications to them - this is where deduplication comes in for me... — , Jan 22 '14 at 20:32
@poige I wanted to ask on the ZoL mailing list, but got an error, and did not want to open an issue before making sure I did not screw up ;) But thanks! — , Jan 22 '14 at 20:35
@Nex7, yes, this is what I was worried about in the first place, and this issue I encountered might confirm it... — , Jan 22 '14 at 20:35
@admr If it's just text, I'd really keep it to compression-only. That does not solve your dedup anomaly, but I have plenty of environments with a similar data type, and `lz4` compression alone is worth the effort. — ewwhite, Jan 22 '14 at 20:37
@ewwhite Thanks for the advice, I will definitely use lz4! Dedup really would be great in my case though, as the dataset I used for testing realistically reflects the (non-small amount of) data I will need to store, and I see a 4x dedup ratio...and of course really high performance on large copy operations, which will happen often... — , Jan 23 '14 at 14:13
I thought I posted this but I guess I didn't. Don't assume your data is actually dedupable, btw. Also, in my experience, dedupe ratios usually end up falling over time, not increasing. So if you only start at 3.5x, I wouldn't place bets on it increasing. ZFS dedupe is block level not file level, and not 'intelligent' about bit shift. If you have two otherwise identical blocks of data, but there's a single bit offset on the data, it will not dedupe at all. Most the 'dedupe is what we do' style products tend to have algorithms to look for such things, so they tend to return better ratios. — Nex7, Jan 23 '14 at 21:42
Not to mention, your "start from template results" sounds like a good case for using snapshots and clones. This will probably save you a lot more disk than dedup. — Michael Hampton, Jan 30 '14 at 13:55
Thanks for the advice! I don't think its fit my application though. What I have is an interactive system for several users. Users can import the templates, which are basically folders, into their own workspace, and either play around with them directly, or modify them. Creating a clone every time this happened would lead to thousands of file systems/volumes. — , Jan 31 '14 at 08:50

score 3 · Answer 1 · edited Apr 13 '17 at 12:14

Dedupe on ZFS isn't always worth it. Okay, it's rarely worth it... I know it's appealing, sexy-sounding and seems to be a great selling point... but at what cost?

Predictability.
Stability.
RAM usage.
Planning and design.
Performance.

Also see: ZFS - destroying deduplicated zvol or data set stalls the server. How to recover?

So let's examine your DDT table...
If you're not sure how to compute, see: How large is my ZFS dedupe table at the moment?

DDT-sha256-zap-duplicate: 615271 entries, size 463 on disk, 149 in core

615271*149=91675379 -> 91675379/1024/1024 == 87.42 Megabytes.

So hmm... not much RAM required for the dataset.

Other items to note. You should probably be using lz4 compression, but that's about all I can see from here. Can you see if this is an interaction between the Linux virtual memory subsystems and ZFS? I'd keep ARC where it is... but check the Linux VM stats at the time of the slow speeds. This may depend a bit on what type of data you're storing. What types of files are these?

I was using lz4. I am not sure how I could determine where exactly the issue arises. Any pointers? Data are mostly numerical results in text form, no file larger than a few GB. — , Jan 22 '14 at 20:37

b13n1u · Answer 2 · 2014-01-22T16:11:11.613

2

A good rule of thumb is to plan around 5 GB of RAM for every 1 TB of disk. So if you have 2TB of data this would 10GB only for deduplication + ARC + ZFS metadata. It's not the answer you want, but it's not worth the effort. You still will have some savings with compression enabled. Take a look at this article

5GB is a general rule but it does not have to be true. We assume that you will need 5GB of RAM per 1TB assuming that you use 64K blocks. But the blocksize can be different beween 512b and 128K. The solution could be L2ARC and SSD drives but it will be expensive.

edited Jan 22 '14 at 16:11

answered Jan 22 '14 at 15:23

b13n1u

980
9
14

Thanks, good article, I had already read through all of that :) 10GB sounds definitely reasonable - I set the limit to 12GB though, and it still happened. And, as far as I understand the zdb output, at least the dedup table should only have been around 90MB. Of course, there is other data that ZFS stores in memory, but for a 2TB poool with <500GB in it, maxing out 12GB RAM seems unreasonable, especially if the FS becomes unusable even then. – Jan 22 '14 at 15:28
1

Check out the [arc_summary](https://code.google.com/p/jhell/downloads/list) script to check how much is used by ARC only. You need to add to this the dedup table. The 2nd thing is that 5GB is a general rule but it does not have to be true. We assume that you will need 5GB of RAM per 1TB assuming that you use 64K blocks. But the blocksize can be different beween 512b and 128K. The solution could be L2ARC and SSD drives but it will be expensive. – b13n1u Jan 22 '14 at 16:01
Ah thanks for the script, I will play around with that tomorrow! – Jan 22 '14 at 20:40
I used the script and ran one more test - see my original post below the break. arcstats reports the ARC observing the max size I set, but memory usage grows beyond that... – Jan 23 '14 at 14:10
Ah, actually, I used /proc to get at the data since arc_summary seems to be for FreeBSD only. – Jan 23 '14 at 14:18
Yes, I when I wrote the answer I've assumed that you are using Solaris or BSD. Actually FreeBSD would be much better option, especially as there is a new FreeBSD 10 :) – b13n1u Jan 23 '14 at 20:30

score 2 · Answer 3 · 2014-01-30T11:20:53.230

2

Answering this myself for now - apparently, 0.6.2.1 still has lots of memory fragmentation overhead, the deduplication part of which will be improved in 0.6.3. I guess I'm going to try the current dev version or the patches suggested in the issue I opened: https://github.com/zfsonlinux/zfs/issues/2083. Let's see how that goes.

Update: see below - I decided to go with 0.6.2 and no deduplication for now. I will keep testing new releases until I fell "safe" with deduplication, as I believe it can make sense for my application.

Thanks everyone!

edited Jan 30 '14 at 11:20

answered Jan 27 '14 at 16:31

Again, dedupe just isn't worth it. Let us know the results of the patches. – ewwhite Jan 27 '14 at 16:42
I decided to try 0.6.2 first. Might just be my impression, but I think the situation has improved. After pushing around a lot of data (several times the amount of my previous tests), I'm still at 21GB used/10GB free (buffers/cache). Although it's still more than zfs_arc_max and throughput is not great, the system seems to remain responsive. Will try the current git code next. – Jan 28 '14 at 12:59
1

I've been experimenting with `zram` devices added to the pool as L2ARC and making ARC smaller. For your dedupe case, this may help a lot, since the Linux memory management is so off here. Look up some threads on the mailing list to try this. – ewwhite Jan 28 '14 at 13:40
Hm, interesting, I might try that. I have not tried L2ARC at all actually, I wanted to "feel safe" with ZFS at first before going there. Thanks! – Jan 28 '14 at 14:10
The current git code seems to further improve the situation, but requires a newer kernel version, which I can't currently use for the production system. I've decided to go with 0.6.2 and no deduplication for now. Still plan on deploying deduplication though at some point ;) I will simply keep testing new releases for improvements and probably start using L2ARC; my tests and knowledge of the application make me believe that it can make sense for my application both in terms of storage savings and performance (frequently "copying" large amounts of data). Again, thanks for all the advice! – Jan 30 '14 at 11:19

the-wabbit · Answer 4 · 2014-01-22T16:52:08.643

You might be running into an implementation-specific issue. For Linux, there is the ZFS on Linux project as well as the zfs-fuse implementation. The latter is considerably slower, but you should try your scenario with both of them to rule out version-specific code issues. Also, it might be worth testing with a Nexenta / OpenIndiana release or even a Solaris 11.1 ODN install.

Keep in mind that ZFS' online deduplication has some architectural issues, huge memory consumption and rather high CPU utilization when writing to the pool being the main ones. It might be worth checking if offline deduplication like the one offered by Windows Server 2012 for NTFS or BTRFS with bedup patches would be a better fit your usage pattern.

I was using ZoL, because from what I can tell zfs-fuse is much suitable for production-use. Thanks for the pointers! I heard about BTRFS with dedup, but as far as I know BTRFS itself is not really considered ready for production use. I was considering Solaris or BSD as well, but lacking experience there I was trying to avoid that option. — , Jan 22 '14 at 20:39

ZFS: Memory issues with dedup even though zdb -DD looks fine

4 Answers4