Why does doing large deletes, copies, moves on my ZFS NAS block all other IOs?

Question

I've got a Solaris 11 ZFS-based NAS device with 12x1TB 7.2k rpm SATA drives in a mirrored configuration.

It provides 2 services from the same pool - an NFS server for a small VM farm and a CIFS server for a small team to host shared files. The cifs ZFS has dedup on, while the NFS ZFS filesystem has dedup off. Compression is off everywhere. I'm snapshotting each filesystem every day and keeping the last 14 snapshots.

I've run into a performance issue in cases where I'm either moving, copying or deleting a large amount of data while directly SSH'd into the NAS. Basically, the process seems to block all other IO operations, even to the point of VMs stalling because they receive disk timeouts.

I've a couple of theories as to why this should be the case, but would appreciate some insight into what I might do next.

Either:

1) the hardware isn't good enough. I'm not so convinced of this - the system is an HP X1600 (single Xeon CPU) with 30GB RAM. Although the drives are only 7.2k SATA, they should push a max of 80 IOPS each, which should give me more than enough. Happy to be proven wrong though.

2) I've configured it wrong - more than likely. Is it worth turning dedup off everywhere? I'm working under the assumption that RAM = good for dedup, hence giving it a reasonable splodge of RAM.

3) Solaris being stupid about scheduling IO. Is it possible that a local rm command completely blocks IO to the nfsd? If so, how do I change this?

how's the CPU usage when you move the files? does it spike? monitor it with mpstat/vmstat/prstat. ZFS shouldn't care who's using its services (NFS, CIFS, local). you've plenty of RAM for the DDT. — Giovanni Tirloni, May 31 '11 at 12:31
Top 4 lines of prstat are at http://pastebin.com/SniyfRDu - in short, a little CPU spike, but not pegging anything by any means. — growse, May 31 '11 at 13:11

score 3 · Accepted Answer · answered May 27 '11 at 15:13

3

Option #2 is most likely the reason. Dedup performs best when the dedup table (DDT) fits entirely in memory. If it doesn't, then it spills over onto disk, and DDT lookups that have to go to disk are very slow and that produces the blocking behavior.

I would think that 30G of RAM is plenty, but the size of the DDT is directly dependent on the amount of data being deduped and how well dedup works on your data. The dedup property is set at the dataset level, but lookups are done across the entire pool, so there is just one pool-wide DDT.

See this zfs-discuss thread on calculating the DDT size. Essentially it's one DDT entry per unique block on the pool, so if you have a large amount of data but a low dedup ratio, that means more unique blocks and thus a larger DDT. The system tries to keep the DDT in RAM, but some of it may be evicted if the memory is needed for applications. Having L2ARC cache can help prevent the DDT from going to the main pool disks, as it will be evicted from main memory (ARC) into L2ARC.

answered May 27 '11 at 15:13

eirescot

554
4
8

I'll turn off dedup for the fs in question. I know this will take a while for writes to become un-duplicated, but hopefully it should make a difference. My Dedup ratio is only 1.14, so shouldn't cost much in space. – growse May 28 '11 at 10:25
1

Unfortunately once you have activated dedup on any fs in a pool, there will be a DDT and it was always be active. The only way to "start over" is to destroy the pool and recreate it (and never turn on dedup on any fs.) We had to do that at work on a server that didn't have enough RAM and could not practically be upgraded with enough. – eirescot May 28 '11 at 23:57
Would it not be the case that if I deactivated dedup, moved all data off the pool and then copied it back, the DDT size would effectively be zero, as all data on the pool would have been written post-turning-off-dedup? – growse May 29 '11 at 09:01
It's possible, but the only way to be sure is to destroy the pool. If you can move all the data off and then back, why not destroy and recreate the pool in the process? – eirescot May 31 '11 at 02:34
Apologies, I misspoke. It's not possible to move the data off the entire pool. I meant to say that if I move all the data off that particular ZFS fs that has had dedup enabled, might I see an improvement? – growse May 31 '11 at 10:59
You could certainly try, but I am not sure whether having destroyed the dataset that used dedup does anything different from simply setting the dedup property to "off". The DDT is classified as metadata, and if all the possible bits that could have been subject to dedup are gone, then it seems logical that the DDT would be empty. Would be interesting to test. – eirescot May 31 '11 at 21:37

score 3 · Answer 2 · answered May 31 '11 at 04:33

One thing to keep in mind with ZFS and snapshots is that nothing is free, and as you remove large amounts of data, and expect snapshots to continue to maintain that data for you, especially if you have a large number of snapshots, as you conduct your deletes, snapshots have to be updated accordingly to reflect changes to the filesystem. I am assuming that you have 6 VDEVs, basically 6 mirrors in the pool, which means that you actually have to perform these changes against all disks, since data is quite evenly spread across each VDEV. With dedup on, the situation gets much more complicated, especially if the ratio is good, and if the ratio is poor, do not use it. In essence, if the ratio is good, or great, you have a large number of references, all of which are of course metadata and they all need to be updated, along with snapshots, and snapshot related metadata. If you filesystems have small blocksizes, the situations gets even more complex, because the amount of refs. is much greater for a 4K blocksize dataset vs. a 128K dataset.

Basically, there are few things you can do, other than: 1) Add high-performance SSDs as caching devices, and tune your filesystem to use caching devices for nothing but metadata, 2) reduce large delete/destroy operations, 3) re-consider usage of deduplication. But, you cannot simply disable deduplication on a pool, or a filesystem. If enabled on whole pool, you have to re-create the pool, or if set on individual filesystem destroying and re-creating the filesystem will address the issue.

At Nexenta, we are very careful with deduplication when we recommend it to customers. There are a lot of cases where it is a brilliant solution, and customer could not live without it. And in those cases we often have customers using 96GB of RAM or more, to maintain more of the metadata, and the DDT in RAM. As soon as DDT metadata is pushed to spinning media everything literally comes to a screeching halt. Hope this helps.

This is true. See: http://serverfault.com/questions/234475/zfs-destroying-deduplicated-zvol-or-data-set-stalls-the-server-how-to-recover — ewwhite, Jun 02 '11 at 15:02
Is an SSD array that can do 400k IOPS good enough for the DDT metadata? I'm wondering if I can realistically do dedup on 150tb with only 100gb RAM. Throughput above 300MB/s isn't that important. — Dan Buhler, Apr 21 '17 at 22:34

Why does doing large deletes, copies, moves on my ZFS NAS block all other IOs?

2 Answers2

Linked