Delete 10M+ files from ZFS, effectively

Question

I have written a buggy program that has accidentally created about 30M files under /tmp. (The bug was introduced some weeks ago, and it was creating a couple of subdirectories per second.) I could rename /tmp to /tmp2, and now I need to delete the files. The system is FreeBSD 10, the root filesystem is zfs.

Meanwhile one of the drives in the mirror went wrong, and I have replaced it. The drive has two 120GB SSD disks.

Here is the question: replacing the hard drive and resilvering the whole array took less than an hour. Deleting files /tmp2 is another story. I have written another program to remove the files, and it can only delete 30-70 subdirectories per second. It will take 2-4 days to delete all files.

How is it possible that resilvering the whole array takes an hour, but deleting from the disk takes 4 days? Why do I have so bad performance? 70 deletions/second seems very very bad performance.

I could delete the inode for /tmp2 manually, but that will not free up the space, right?

Could this be a problem with zfs, or the hard drives or what?

I'm not a zfs expert, so I can't speak to your performance tuning or what you might do to improve it (that would also take a lot of information and would probably best be done directly by an expert). However, I can say that resilvering happens at the block level, while your deletions happen at the filesystem level. The filesystem will have mostly overhead when deleting a bagillion inode buffers like that. — Spooler, Sep 05 '16 at 06:36
Written another program: `rm -rf /tmp2` will not do the job? — Thorbjørn Ravn Andersen, Sep 05 '16 at 14:32
Could you not just reboot? `/tmp` should be a `tmpfs` filesystem and is stored in memory. — Blender, Sep 05 '16 at 20:44
Check out this related question and the interesting answers: [rm on a directory with millions of files](https://serverfault.com/questions/183821/rm-on-a-directory-with-millions-of-files/328305) Maybe some of the answers, help with ZFS as well. — nh2, Oct 30 '21 at 11:01

score 35 · Accepted Answer · edited Sep 07 '16 at 10:38

35

Deletes in ZFS are expensive. Even more so if you have deduplication enabled on the filesystem (since dereferencing deduped files is expensive). Snapshots could complicate matters too.

You may be better off deleting the /tmp directory instead of the data contained within.

If /tmp is a ZFS filesystem, delete it and create again.

edited Sep 07 '16 at 10:38

Giacomo1968

3,522
25
38

answered Sep 05 '16 at 07:05

ewwhite

194,921
91
434
799

Unfortunately, it is not a separate filesystem. It is part of the zfs root fs. – nagylzs Sep 05 '16 at 08:18
1

@nagylzs In that case I would suggest making it a separate ZFS file system. Then you can move the current /tmp out of the way, move a new /tmp into place, and delete the files at the system's leisure. Result: minimal downtime plus a slight performance degredation (mitigatable with `ionice`, assuming FreeBSD has it) while the delete is running. – user Sep 05 '16 at 11:28
11

I was wrong. It was a separate filesystem. Here is what worked: reboot to single user mode, then do "zfs delete zroot/tmp ; zfs create zroot/tmp; chmod 41777 /tmp " – nagylzs Sep 05 '16 at 16:28
6

It was 5 minutes total downtime. Fantastic! :-) – nagylzs Sep 05 '16 at 16:30
1

Well, that also speaks to the concern I had, that deleting fikes never frees up space because of snapshots. But tmp will be set up to not make automatic periodic snapshots, *right*? – JDługosz Sep 05 '16 at 20:32
1

Actually this was: zfs create -o compression=on -o exec=on -o setuid=off zroot/tmp ; chmod 1777 /zroot/tmp ; zfs set mountpoint=/tmp zroot/tmp ; I'm not sure how to turn off auto snapshots though. There is "zfs set com.sun:auto-snapshot=false" but that works on solaris only, I think. – nagylzs Sep 06 '16 at 10:11
As what I experienced, deleting in ANY file system is expensive compare to format/re-create the whole file system. All file systems have to due with the business to identify which information it need to keep and which it can throw away. – Earth Engine Sep 06 '16 at 11:27
@ewwhite What do you mean by _"You may be better off deleting the /tmp directory instead of the data contained within"_? How can you delete a directory without first deleting the files within, if it's not a standalone ZFS filesystem? `rm -r` first deletes all files within, serially. Does ZFS offer the ability to delete directories directly without doing that? Or are you referring to a snapshot trick like in [Bulk remove directory on ZFS without traversing it recursively](https://unix.stackexchange.com/questions/219786/bulk-remove-a-large-directory-on-a-zfs-without-traversing-it-recursively)? – nh2 Oct 30 '21 at 11:07

score 29 · Answer 2 · answered Sep 05 '16 at 11:33

29

How is it possible that resilvering the whole array takes an hour, but deleting from the disk takes 4 days?

Consider an office building.

Removing all of the computers and furniture and fixings from all the offices on all the floors takes a long time, but leaves the offices immediately usable by another client.

Demolishing the whole building with RDX is a whole lot quicker, but the next client is quite likely to complain about how drafty the place is.

answered Sep 05 '16 at 11:33

Phill W.

1,336
7
7

8

ZFS is not an office building :) – developerbmw Sep 05 '16 at 23:22
11

@developerbmw there is also not actually a file or folder on there either but we need metaphorical concepts to understand what is going on. – JamesRyan Sep 06 '16 at 14:37
2

@JamesRyan yep it's actually a nice analogy... I was just being stupid – developerbmw Sep 06 '16 at 22:50

score 7 · Answer 3 · answered Sep 06 '16 at 06:28

There's a number of things going on here.

First, all modern disk technologies are optimised for bulk transfers. If you need to move 100MB of data, they'll do it much faster if they're in one contiguous block instead of scattered all over the place. SSDs help a lot here, but even they prefer data in contiguous blocks.

Second, resilvering is pretty optimal as far as disk operations goes. You read a massive contiguous chunk of data from one disk, do some fast CPU ops on it, then rewrite it in another big contiguous chunk to another disk. If power fails partway through, no big deal - you'll just ignore any data with bad checksums and carry on as per normal.

Third, deleting a file is really slow. ZFS is particularly bad, but practically all filesystems are slow to delete. They must modify a large number of different chunks of data on the disk and time it correctly (i.e. wait) so the filesystem is not damaged if power fails.

How is it possible that resilvering the whole array takes an hour, but deleting from the disk takes 4 days?

Resilvering is something that disks are really fast at, and deletion is something that disks are slow at. Per megabyte of disk, you only have to do a little bit of resilvering. You might have a thousand files in that space which need to be deleted.

70 deletions/second seems very very bad performance

It depends. I would not be surprised by this. You haven't mentioned what type of SSD you're using. Modern Intel and Samsung SSDs are pretty good at this sort of operation (read-modify-write) and will perform better. Cheaper/older SSDs (e.g. Corsair) will be slow. The number of I/O operations per second (IOPS) is the determining factor here.

ZFS is particularly slow to delete things. Normally, it will perform deletions in the background so you don't see the delay. If you're doing a huge number of them it can't hide it and must delay you.

Appendix: why are deletions slow?

Deleting a file requires a several steps. The file metadata must be marked as 'deleted', and eventually it must be reclaimed so the space can be reused. ZFS is a 'log structured filesystem' which performs best if you only ever create things, never delete them. The log structure means that if you delete something, there's a gap in the log and so other data must be rearranged (defragmented) to fill the gap. This is invisible to the user but generally slow.
The changes must be made in such a way that if power were to fail partway through, the filesystem remains consistent. Often, this means waiting until the disk confirms that data really is on the media; for an SSD, that can take a long time (hundreds of milliseconds). The net effect of this is that there is a lot more bookkeeping (i.e. disk I/O operations).
All of the changes are small. Instead of reading, writing and erasing whole flash blocks (or cylinders for a magnetic disk) you need to modify a little bit of one. To do this, the hardware must read in a whole block or cylinder, modify it in memory, then write it out to the media again. This takes a long time.

I don't know about ZFS, but some file-systems allow you to unlink a directory with contents, but have those contents just removed later during a garbage collection/defrag/cleanup phase. Does ZFS have any utilities to do such a lazy deletion perhaps? It will not actually speed up the OP's delete but would likely make it less problematic if it happens implicitly during housekeeping. — Vality, Sep 06 '16 at 19:33

score 7 · Answer 4 · answered Sep 07 '16 at 12:10

Ian Howson gives a good answer on why it is slow.

If you delete files in parallel you may see an increase in speed due to the deletion may use the same blocks and thus can save rewriting the same block many times.

So try:

find /tmp -print0 | parallel -j100 -0 -n100 rm

and see if that performs better than your 70 deletes per second.

score 2 · Answer 5 · answered Sep 05 '16 at 15:13

How is it possible that resilvering the whole array takes an hour, but deleting from the disk takes 4 days?

It is possible because the two operations work on different layers of the file system stack. Resilvering can run low-level and does not actually need to look at individual files, copying large chunks of data at a time.

Why do I have so bad performance? 70 deletions/second seems very very bad performance.

It does have to do a lot of bookkeeping...

I could delete the inode for /tmp2 manually, but that will not free up the space, right?

I don't know for ZFS, but if it could automatically recover from that, it would likely, in the end, do the same operations you are already doing, in the background.

Could this be a problem with zfs, or the hard drives or what?

Does zfs scrub say anything?

score 2 · Answer 6 · answered Sep 06 '16 at 17:44

Deleting lots of files is never really a fast operation.

In order to delete a file on any filesystem, you need to read the file index, remove (or mark as deleted) the file entry in the index, remove any other metadata associated with the file, and mark the space allocated for the file as unused. This has to be done individually for each file to be deleted, which means deleting lots of files requires lots of small I/Os. To do this in a manner which ensures data integrity in the event of power failure adds even more overhead.

Even without the peculiarities ZFS introduces, deleting 30 million files typically means over a hundred million separate I/O operations. This will take a long time even with a fast SSD. As others have mentioned, the design of ZFS further compounds this issue.

score 0 · Answer 7 · answered Sep 05 '16 at 10:29

0

Very simple if you invert your thinking.

Get a second drive (you seem to have this already)
Copy everything from drive A to drive B with rsync, excluding the /tmp directory. Rsync will be slower than a block copy.
Reboot, using drive B as the new boot volume
Reformat drive A.

This will also defragment your drive and give you a fresh directory (fine, defrag is not so important with an SSD but linearizing your files never hurt anything)

answered Sep 05 '16 at 10:29

peter

17
1

2

First of all copy everything except /tmp? So including /dev and /proc? Secondly, sound a bit kludgy to me, especially on a production server. – Hennes Sep 05 '16 at 11:48
I'm assuming he's smart enough to exclude non-files, mounted volumes, and the virtual-memory folder, most of which cannot be guessed here. Or do it from a maintenance boot where none of those things matter. – peter Sep 05 '16 at 13:23
I think you could also `zfs send/recv` (block-level copy) all other file systems except the root file system (where /tmp is located in this case) and copy the remaining data on the root file system manually (excluding /tmp of course). – user121391 Sep 05 '16 at 14:49
3

That will lose the snapshots and bypass some of the reliability features. Misses the point of using zfs. – JDługosz Sep 05 '16 at 20:34
2

@JDługosz valid points, but only relevant if the user cares. Sort of like "my backups are corrupted, how to repair?" -> "Do you need any backup files?" -> "No." -> "Reformat". – peter Sep 06 '16 at 14:48
1

The choice of ZFS implies that reliability is wanted. An operation that would allow silent data corruption is contrary to the decision to use ZFS as opposed to something faster/simpler/easier. At the very least, use the zfs replication features rather than rsync! But, tmp was a separate file system anyway so this is a pointless exercise. – JDługosz Sep 06 '16 at 16:43

Paul Smith · Answer 8 · 2016-09-08T15:14:24.980

-2

You have 30million entries in an unsorted list. You scan the list for the entry you want to remove and you remove it. Now you have only 29,999,999 entries in your unsorted list. If they are all in /tmp, why not just reboot?

Edited to reflect the information in the comments: Statement of problem: Removing most, but not all, of the 30M+ incorrectly created files in /tmp is taking a long time.
Problem 1) Best way to remove large numbers of unwanted files from /tmp.
Problem 2) Understanding why it is so slow to delete files.

Solution 1) - /tmp is reset to empty at boot by most *nix distributions. FreeBSD however, is not one of them.
Step 1 - copy interesting files somewhere else.
Step 2 - As root

 $ grep -i tmp /etc/rc.conf  
 clear_tmp_enable="YES" # Clear /tmp at startup.

Step 3 - reboot.
Step 4 - change clear_tmp_enable back to "No".
Unwanted files are now gone as ZFS on FreeBSD has the feature that "Destroying a dataset is much quicker than deleting all of the files that reside on the dataset, as it does not involve scanning all of the files and updating all of the corresponding metadata." so all it has to do at boot time is reset the metadata for the /tmp dataset. This is very quick.

Solution 2) Why is it so slow? ZFS is a wonderful file system which includes such features as constant time directory access. This works well if you know what you are doing, but the evidence suggests that the OP is not a ZFS expert. The OP has not indicated how they were attempting to remove the files, but at a guess, I would say they used a variation on "find regex -exec rm {} \;". This works well with small numbers but does not scale because there are three serial operations going on 1) get the list of available files (returns 30 million files in hash order), 2) use regex to pick the next file to be deleted, 3) tell the OS to find and remove that file from a list of 30 million. Even if ZFS returns a list from memory and if 'find' caches it, the regex still has to identify the next file to be processed from the list and then tell the OS to update its metadata to reflect that change and then update the list so it isn't processed again.

edited Sep 08 '16 at 15:14

answered Sep 06 '16 at 12:12

Paul Smith

97
1

2

I think you misunderstood the question. I needed to remove most of the files. That is, 30M+ files. – nagylzs Sep 06 '16 at 13:31
@nagylzs /tmp is cleared on reboot. If you want to delete *most*, then you only want to keep *some*, ie less then half, so copy out the ones you want to keep and then reboot to get rid of the rest. The reason your deletions are so slow is that having large numbers of files in a directory results in a large unsorted list that need to be processed to find the the file to be operated on, which takes time. The only problem here is PEBCAK. – Paul Smith Sep 06 '16 at 14:27
Zfs directories are *unsorted*? I thought zfs specifically handled large directories well. – JDługosz Sep 06 '16 at 16:41
2

Well, /tmp is not cleared, only X related files. At least on FreeBSD. It cannot be cleared anyway on boot, because it would take days for the rc script to delete normally. – nagylzs Sep 06 '16 at 17:26
@JDlugosz - ZFS is much better then most, but inode lists (which is all directories are) are unsorted. – Paul Smith Sep 07 '16 at 14:36
@nagylzs You can change the FreeBSD behaviour with $ grep -i tmp /etc/rc.conf clear_tmp_enable="YES" # Clear /tmp at startup. You don't need to worry about rc. /tmp mounts are kept separetly and can be cleared in one go. – Paul Smith Sep 07 '16 at 14:38
@PaulSmith [ZFS uses constant time operations](http://www.solarisinternals.com/wiki/index.php/ZFS_Performance#Concurrent.2C_constant_time_directory_operations) to *create, lookup, delete* etc. In large directories. Readir returns results in **hash order**. Directores in zfs are stored in a hash table, not unsorted. – JDługosz Sep 07 '16 at 19:28
@JDługosz that is on Solaris. On [FreeBSD](https://www.freebsd.org/doc/handbook/zfs-zfs.html) "Destroying a dataset is much quicker than deleting all of the files that reside on the dataset, as it does not involve scanning all of the files and updating all of the corresponding metadata." – Paul Smith Sep 08 '16 at 11:44
That's true even if they are hashed/sorted. A list is obtained, in hash order (appears unsorted), then the code goes through that list and asks each named fike to be deleted which means looking it up, and removing that entry. So it’s O(n) on the number of files. Just blowing away the dataset is a constant time (well, returning scattered resources to the pool scales with the size of the set not the number of files). But I think the slowness is caused by transactions and making sure blocks are flushed; and copying the updated blocks after each small change! Nothing to do with lookup speed. – JDługosz Sep 08 '16 at 14:27

Delete 10M+ files from ZFS, effectively

8 Answers8