Deleting millions of files

38

9

I had a dir fill up with millions of gif images. Too many for rm command.

I have been trying the find command like this:

find . -name "*.gif" -print0 | xargs -0 rm

Problem is, it bogs down my machine really bad, and causes time outs for customers since it's a server.

Is there any way that is quicker to delete all these files...without locking up the machine?

Corepuncher

Posted 2013-11-23T16:28:51.810

Reputation: 481

I"m at about 6 gb/hr deletion rate using the "nice find" command below. Probably will take 48 hrs straight to get rid of all the files.

The reason this happened was b/c a scour script failed. I had surpassed the "event horizon" with rm command, then it ran away. – None – 2013-11-23T18:12:27.160

3Would removing the whole dir not be substantially quicker? Just take out the "good" files before nuking the remaining ones... – tucuxi – 2013-11-23T18:42:41.683

Well, every file is bad right now, because it was moved to /dir_old , and I remade the /dir. But won't rmdir run into same limitation as rm * ? – None – 2013-11-23T19:43:08.847

@Corepuncher: I would expect that removing the entire directory (as with rm -rf would be faster. It's worth a try. – Jason R – 2013-11-23T19:44:56.710

I'm currently running "rm -rf" on the dir. It's been running for over 20 min now...no change in disk size yet. But also it didn't automatically return "arguement list too long" yet either. Only problem is, it's really hammering my machine and making other things slow/fail. Not sure how long to let it go. – None – 2013-11-23T20:01:15.577

@JasonR: “worth a try” – I suppose, but I wouldn’t expect rm –rf to be any better than find … -delete, and they would be only marginally better than find … -exec rm {} + or find … | xargs rm. The difference is that the first two commands do everything in one process, while the latter two fork and exec rm tens of thousands of times. But fork/exec’ing rm isn’t the bottleneck; the resource hog is the removal of the files, and even rm –rf has to remove each file individually. – Scott – 2013-11-26T23:08:34.503

Answers

44

Quicker is not necessarily what you want. You may want to actually run slower, so the deletion chews up fewer resources while it's running.

Use nice(1) to lower the priority of a command.

nice find . -name "*.gif" -delete

For I/O-bound processes nice(1) might not be sufficient. The Linux scheduler does take I/O into account, not just CPU, but you may want finer control over I/O priority.

ionice -c 2 -n 7 find . -name "*.gif" -delete

If that doesn't do it, you could also add a sleep to really slow it down.

find . -name "*.gif" -exec sleep 0.01 \; -delete

John Kugelman

Posted 2013-11-23T16:28:51.810

Reputation: 1 620

@Ola I completely agree. It should be ionice -c 3. Everything else chokes up a production server. Anyone: See anser of @user2719058 and the comments for the optimal solution in a high load environment. – Christopher Lörken – 2015-06-26T07:41:26.363

22

Since you're running Linux and this task is probably I/O-bound, I advise to give your command idle I/O scheduler priority using ionice(1):

ionice -c3 find . -name '*.gif' -delete

Comparing to your original command, I guess this may even spare some more CPU cycles by avoiding the pipe to xargs.

user2719058

Posted 2013-11-23T16:28:51.810

Reputation:

14

No.

There is no quicker way, appart from soft-format of the disk. The files are given to rm at once (up to the limit of the command line, it could be also set to the xargs) which is much better than calling rm on each file. So no, there is definitely no faster way.

Using nice (or renice on a running process) helps only partially, because that is for scheduling the CPU resource, not disk! And the CPU usage will be very low. This is a linux weakness - if one process "eats up" the disk (i.e. works a lot with it), the whole machine gets stuck. Modified kernel for real time usage could be a solution.

What I would do on the server is to manually let other processes do their job - include pauses to keep the server "breathe":

find . -name "*.gif" > files
split -l 100 files files.
for F in files.* do
    cat $F | xargs rm
    sleep 5 
done

This will wait 5 seconds after every 100 files. It will take much longer but your customers shouldn't notice any delays.

Tomas

Posted 2013-11-23T16:28:51.810

Reputation: 5 107

"The files are given to rm at once (up to the limit of the command line"—so when the shell is ordered to rm *, it expands * into the line with all of the filenames and pass it to rm? That's incredibly stupid. Why would shell expand wildcards? – None – 2013-11-23T23:15:18.113

:-D @Joker_vD, are you joking, as your name suggests? :-) – Tomas – 2013-11-23T23:28:33.993

2@Joker_vD: Compatibility with a Unix decision from 1970 or so. Windows doesn't do it. There, programs can pass wildcards to FindNextFile/FindNextFile, so they get the results one at a time. – MSalters – 2013-11-23T23:30:42.950

@Tomas Not in this case. Honestly, I can see 2 problems with such design immediately: first, command line isn't rubber; second, the program can't tell if it was called with * or /* and give a doubt to such decision of the user. – None – 2013-11-24T15:20:35.470

@MSalters I personally took advantage of this on Windows, to filter out "special" files/folders from processing unless specifically asked for. – None – 2013-11-24T15:23:26.360

1@Joker_vD There are a lot of good things about the shell doing wildcard expansion. It's different from Windows, but don't jump to the conclusion that it's incredibly stupid merely because it's different from what you're used to. If you want to know more, I encourage you to Google it or post a question on the relevant Stack Exchange site. It's a huge derail for this comment area. – John Kugelman – 2013-11-24T15:27:07.083

5

If the number of files that are to be deleted vastly outnumbers the files which are left behind, it may not be the most efficient approach to walk the tree of files to be deleted and do all those filesystem updates. (It analogous to doing doing clumsy reference-counted memory management, visiting every object in a large tree to drop its reference, instead of making everything unwanted into garbage in one step, and then sweeping through what is reachable to clean up.)

That is to say, clone the parts of the tree that are to be kept to another volume. Re-create a fresh, blank filesystem on the original volume. Copy the retained files back to their original paths. This is vaguely similar to copying garbage collection.

There will be some downtime, but it could be better than continuous bad performance and service disruption.

It may be impractical in your system and situation, but it's easy to imagine obvious cases where this is the way to go.

For instance, suppose you wanted to delete all files in a filesystem. What would be the point of recursing and deleting one by one? Just unmount it and do a "mkfs" over top of the partition to make a blank filesystem.

Or suppose you wanted to delete all files except for half a dozen important ones? Get the half a dozen out of there and ... "mkfs" over top.

Eventually there is some break-even point when there are enough files that have to stay, that it becomes cheaper to do the recursive deletion, taking into account other costs like any downtime.

Kaz

Posted 2013-11-23T16:28:51.810

Reputation: 2 277

4

Have you tried:

find . -name "*.gif" -exec rm {} +

The + sign at the end will cause find to include more files for the single rm command to be executed. Check this question for more details.

Bartosz Firyn

Posted 2013-11-23T16:28:51.810

Reputation: 148

It execute much faster than -print0 | xargs solution because rm process is not invoked for every file but for large set of them and therefore it is causing lower load. – None – 2013-11-23T16:42:40.513

@JohnKugelman You are correct, but it's a GNU extension that isn't always available with the native find command. – CodeGnome – 2013-11-23T17:04:32.063

OK, interesting, but this is quite new thing (as well as -delete) which doesn't always have to be there.. – Tomas – 2013-11-23T17:09:24.760

However this certainly brings nothing better compared to the OP's solution. – Tomas – 2013-11-23T17:11:00.597