using awk with parallel

I have about 3,000 files that are each 300MB, and I'd like to search them for a series of substrings as quickly as possible with my 16 core server.

This is what I tried but it doesnt seem to parallelize searching the files.

sudo find /mnt2/preprocessed/preprocessed/mo* | sudo xargs awk '/substring/ {c++} END {print c}' | paste -sd+ | bc

It's pasted together from different how-to's, I don't fully understand it. Do you have any suggestions for how I can split up the file processing?

kelorek

Posted 2013-02-26T05:44:11.830

Reputation: 210

1You're likely I/O, not CPU-bound. – Nicole Hamilton – 2013-02-26T06:14:47.830

It's a high-I/O instance (hi1.4xlarge ec2), but you're probably right. I still want to know how to use GNU parallel in this context but haven't been able to get it to work. – kelorek – 2013-02-26T06:45:55.960

Answers

See whether you have the parallel program on your system. (It may come from GNU.) If you do, figure out how to use it. Otherwise,
Run your find with output to a file. Using a text editor, or possibly a script using tools like head, split that file into 16 fragment files with (approximately) equal numbers of lines (i.e., referencing equal numbers of found files). Then start 16 awk … | paste … | bc pipelines; one for each fragment file. (And add the 16 results.)

I’m wondering why you’re using awk to count occurrences of a string when grep -c is specifically designed to do that.

Scott

Posted 2013-02-26T05:44:11.830

Reputation: 17 653

GNU parallel is pretty compatible with xargs, in your case it can replace it. If you are only counting occurrences of substring use grep -c as Scott suggests:

sudo find /mnt2/preprocessed/preprocessed/mo* | 
  sudo parallel grep -c source | paste -sd+ | bc

Note that some versions of GNU/Linux install GNU parallel in "Tollef's parallel" compatible mode. You can change that by adding --gnu to the command line arguments to parallel. To make the change permanent add --gnu to ~/.parallel/config.

Thor

Posted 2013-02-26T05:44:11.830

Reputation: 5 178

Grep turns out to be much slower than awk for some reason, which is why I went with awk. – kelorek – 2013-02-26T16:29:03.777

This didn't work for me-- it doesn't process anything when I just use parallel in place of xargs. – kelorek – 2013-02-26T16:36:38.080

post the output of 'parallel --version' – Ole Tange – 2013-02-27T07:18:40.790

@kelorek: awk faster than grep? what versions of grep and awk are you using? In my tests counting occurrences in a 200M file awk takes 3.7s, grep -c takes 1.2s and grep -Fc takes 0.005s. – Thor – 2013-02-27T09:05:26.060