ZFS - destroying deduplicated zvol or data set stalls the server. How to recover?

Question

I'm using Nexentastor on a secondary storage server running on an HP ProLiant DL180 G6 with 12 Midline (7200 RPM) SAS drives. The system has an E5620 CPU and 8GB RAM. There is no ZIL or L2ARC device.

Last week, I created a 750GB sparse zvol with dedup and compression enabled to share via iSCSI to a VMWare ESX host. I then created a Windows 2008 file server image and copied ~300GB of user data to the VM. Once happy with the system, I moved the virtual machine to an NFS store on the same pool.

Once up and running with my VMs on the NFS datastore, I decided to remove the original 750GB zvol. Doing so stalled the system. Access to the Nexenta web interface and NMC halted. I was eventually able to get to a raw shell. Most OS operations were fine, but the system was hanging on the zfs destroy -r vol1/filesystem command. Ugly. I found the following two OpenSolaris bugzilla entries and now understand that the machine will be bricked for an unknown period of time. It's been 14 hours, so I need a plan to be able to regain access to the server.

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6924390

and

http://bugs.opensolaris.org/bugdatabase/view_bug.do;jsessionid=593704962bcbe0743d82aa339988?bug_id=6924824

In the future, I'll probably take the advice given in one of the buzilla workarounds:

Workaround
    Do not use dedupe, and do not attempt to destroy zvols that had dedupe enabled.

Update: I had to force the system to power off. Upon reboot, the system stalls at Importing zfs filesystems. It's been that way for 2 hours now.

ewwhite · Accepted Answer · 2012-04-20T15:26:36.647

This has been solved. They key is that deduplicated volumes need to have the dedup flag turned off before deletion. This should be done at the pool level as well as the zvol or filesystem level. Otherwise, the deletion is essentially being deduplicated. The process takes time because the ZFS deduplication table is being referenced. In this case, RAM helps. I temporarily added 16 additional Gigabytes of RAM to the system and brought the server back online. The zpool imported completely within 4 hours.

The moral is probably that dedupe isn't super polished and that RAM is essential to its performance. I'm suggesting 24GB or more, depending on the environment. Otherwise, leave ZFS dedupe off. It's definitely not reasonable for home users or smaller systems.

score 5 · Answer 2 · answered Nov 01 '12 at 02:04

As a long time user of Sun/Oracle ZFS 7000-series appliances, I can tell you without question dedupe isn't polished. Never confuse sales with delivery! The salesguys will tell you "Oh, it's been fixed". In real life - my real life - I can tell you 24GB isn't enough to handle the "DDT tables". That is, the back end index which stores the dedupe table. That table has to reside in system memory so that each I/O is intercepted in-flight in order to figure out if it needs to be written to disk or not. The larger your storage pool, the more data changes, the larger this table - and the larger demand on the system memory. That memory comes at the expense of ARC (cache) and at times, the OS itself - which is why you experience the hangs, as certain commands happen in the foreground, some in the background. Seems the pool delete happens in the foreground, unless you tell it otherwise in CLI. GUI wizards won't do this.

Even a mass-delete of NFS data within a share defined on a deduped volume will bring your system to a half if you don't have enough memory to process the "writes" to ZFS telling it to delete the data.

In all, unless you max out your memory and even then, find a way to reserve memory for the OS by restricting ARC and DDT (and I don't think you can restrict DDT by nature of it is, it's just an index tied exactly to your I/O) - then you're hosed during large deletes or destory zvol/pools.

ZFS - destroying deduplicated zvol or data set stalls the server. How to recover?

2 Answers2

Linked