59

The unix find(1) utility is very useful allowing me to perform an action on many files that match certain specifications, e.g.

find /dump -type f -name '*.xml' -exec java -jar ProcessFile.jar {} \;

The above might run a script or tool over every XML file in a particular directory.

Let's say my script/program takes a lot of CPU time and I have 8 processors. It would be nice to process up to 8 files at a time.

GNU make allows for parallel job processing with the -j flag but find does not appear to have such functionality. Is there an alternative generic job-scheduling method of approaching this?

PP.
  • 3,246
  • 6
  • 26
  • 31

5 Answers5

78

xargs with the -P option (number of processes). Say I wanted to compress all the logfiles in a directory on a 4-cpu machine:

find . -name '*.log' -mtime +3 -print0 | xargs -0 -P 4 bzip2

You can also say -n <number> for the maximum number of work-units per process. So say I had 2500 files and I said:

find . -name '*.log' -mtime +3 -print0 | xargs -0 -n 500 -P 4 bzip2

This would start 4 bzip2 processes, each of which with 500 files, and then when the first one finished another would be started for the last 500 files.

Not sure why the previous answer uses xargs and make, you have two parallel engines there!

Matt Kline
  • 103
  • 4
Gaius
  • 1,461
  • 1
  • 12
  • 19
  • 11
    With find/xargs, be careful: find defaults to newlines as output delimiters, but xargs defaults to any whitespace as input delimiters. Use -0 on both to be safe, or switch to GNU parallel which defaults to newlines as input delimiters (matching find's output). – ephemient Oct 23 '10 at 22:08
  • 1
    Wow, amazing! I just checked, and it's true, xargs has a `-P` option! – PP. Oct 25 '10 at 09:45
  • 2
    Beware of using the `xargs -P` - it has a never-fixed bug of garbling the output (unlike `parallel`) whenever 2 threads happen to produce output at same exact moment... – Vlad Jun 10 '19 at 19:19
  • @Vlad is this what you are referring to? https://unix.stackexchange.com/q/17673/77273 – Seanny123 Jun 23 '21 at 19:23
  • @Seanny123 - yes, kind of. Your example has a portion of out2 injected into middle of out1 line. In reality - it sometimes overwrites a portion of out1 with a portion of out2 (so you may lose some pieces of both, or so it appears on my screen). `parallel` is really superior in this matter and probably exists/popular thanks to that bug, plus the parallel got upgraded to support `--line-buffer` some years ago, which made it a total blast ;) – Vlad Jun 24 '21 at 04:00
48

GNU parallel can help too.

find /dump -type f -name '*.xml' | parallel -j8 java -jar ProcessFile.jar {}

Note that without the -j8 argument, parallel defaults to the number of cores on your machine :-)

ephemient
  • 1,420
  • 1
  • 11
  • 8
7

No need to "fix" find - make use of make itself to handle the parallelism.

Have your process create a log file or some other output file, and then use a Makefile like this:

.SUFFIXES:  .xml .out

.xml.out:
        java -jar ProcessFile.jar $< 1> $@

and invoked thus:

find /dump -type f -name '*.xml' | sed -e 's/\.xml$/.out/' | xargs make -j8

Better yet, if you ensure that the output file only gets created on successful completion of the Java process you can take advantage of make's dependency handling to ensure that next time around only unprocessed files get done.

Alnitak
  • 20,901
  • 3
  • 48
  • 81
4

Find has a parallel option you can use directly using the "+" symbol; no xargs required. Combining it with grep, it can rip through your tree quickly looking for matches. for example, if I'm looking for all files in my sources directory containing the string 'foo', I can invoke
find sources -type f -exec grep -H foo {} +

Mark Evans
  • 49
  • 1
  • 21
    Reading the find manual, you can see that the `-exec command +` syntax doesn't run it in parallel, but "group" many files together and run the command with multiple files as arguments at the same time. It happens that grep can look through its targets in parallel. – Gyscos Mar 21 '16 at 23:27
0

All the suggestions make the execution run in parallel but if your file tree is large enough the bottleneck may be in the find itself. A colleague of mine wrote locar as a parallel search which is very useful when your filesystem can do scans in parallel. It might not help if your filesystem is on a single HDD but if it is a raid device, an SSD or better yet a distributed filesystem it will help tremendously.

locar will do the file scan in parallel on multiple directories so you will get the list of files faster and can then also combine it with xargs or parallel to run things in parallel as well.

Baruch Even
  • 1,043
  • 6
  • 18