Why is the grep/-r/--include combination slower than the find/-exec/grep combination?

6

2

From my understanding, the two following commands roughly accomplish the same thing:

Command 1:

find -name "filename.xml" -exec grep someString {} \;

Command 2:

grep -r --include=filename.xml someString .

Still, when timing them after warming up in the same context, the first one was about 3 times faster than the second one (something like 4 seconds vs 12 seconds).

The number of file matching the filename pattern in the folder tree that I tested was very small, and each of these files were very small. This makes me think that most of the time was spent in finding the files matching the filename pattern, and not in the grepping of those matching files.

So why is there such a big difference in performance of those two command lines?

killy971

Posted 2013-01-18T02:49:48.963

Reputation: 169

What exactly do you mean by warming up? Also, a little more information about the files and their directories might be necessary. I tried this with on thousand 1 MB binary files in 1000 directories, each containing 10 files, and the find command was much slower. – Dennis – 2013-01-18T03:06:41.107

By warming up, I meant that I ran the command multiple times until stabilization of performance. Regarding the files and directories I used, it's a java code base, with xx,xxx files and just 10 to 20 small (< 100kb) xml files, which are the target of my grep command. – killy971 – 2013-01-18T03:19:26.313

So caching shouldn't be an issue. As I said, more information might be needed. Intuitively the find command should be slower, since it has to create a new process for every matching file. The test I performed confirms this. – Dennis – 2013-01-18T03:21:37.020

Interesting. Try strace -ff and see which one actually does more work at system level. – ckhan – 2013-01-18T06:22:06.227

3Note that if you end the find command with + instead of \;, it will spawn a minimal amount of grep processes (similar to how xargs works, which should make it even faster. See the -exec command {} + section of the man page. I would guess that the faster execution is some sort of caching anomaly, since in the current state with -exec \; the find version should be considerably slower, at least with many matches (and worse the more matches there are). – Daniel Andersson – 2013-01-18T08:08:27.580

As I mentioned, the directory tree I'm running these commands on contains a lot of files, likely between 10k and 100k, and there are only 10 files which match the filename pattern, and grep is therefore run only on those 10 files. Moreover, each of these file is very small, and the time spent on grepping them must be only a few ms. This is why I feel that it's the way "grep" visits the directory tree that is inefficient compared to "find". I'm still trying to figure out the meaning of what I see in the output of strace, but I can already tell that grep/incl performs a larger # of ops on files – killy971 – 2013-01-18T11:52:23.840

Try giving both commands a specific file name, as opposed to a pattern, and compare their times again. Do you still observe the same differences? – terdon – 2013-01-18T17:47:31.657

can you post a time for both totally, and a time for the command you "exec"? something like time grep... and time find ... and find ... -exec time ... ? – Florenz Kley – 2013-02-20T21:04:26.027

Answers

3

It's probably better to use a Wakizashi and not a Katana to peel potatoes, but neither is a good tool for the job. Same applies with digital tools, use them wisely.

This may sound like an empty suggestion, but in this case for instance, grep is executed once for every file in the find example. This is not wise performance wise. If you substitute the closing argument of find with '+' as opposed to '\;', grep will run only once for all files found.

To answer with certainty in a case like this, one would have to compare the relevant parts of the source code for grep and find to see which is faster at matching (finding) file names. Frankly, this is beyond my skills.

Intuitively I would say that find is optimized to look for files in directories while grep is optimized to look for strings in files. Further, the --include option should work with both upper case and lower case files, while the `-name

edit: (my findings were wrong)

Some basic investigation in a doc folder with ~35 thousand files files:

$ strace find . -name "moo" -exec grep a {} \+ 2>&1 |grep ^open |wc -l
4448

$ strace grep -r --include=moo  a . 2>&1 | grep ^open | wc -l
2289

The find combination opens a lot more files. This suggests the opposite of your findings. I did some basic timing (like Tom Wijsman).

DIR=imagemagick-6.7.8.7
$ findhtml $DIR |& top10    $ grephtml $DIR |& top10
  1617 mmap2                  316 read
  1176 fstat64                173 close
  1176 close                  164 fstat64
   735 open                   157 openat
   608 read                   148 ioctl
   588 mprotect                63 fcntl64
   441 brk                     25 getdents64
   294 munmap                  16 fstatat64
   294 ioctl                   11 mmap2
   147 write                    5 write
time: Real 0m2.0s           time: Real 0m0.3s

I found that the find strace points to /usr/lib/locale/locale-archive, but I'm not really sure what the implications are.

Ярослав Рахматуллин

Posted 2013-01-18T02:49:48.963

Reputation: 9 076

There is a similar tool written in Perl called ack that is designed to do both searching for files and grepping. Perhaps that would be better in your case – Ярослав Рахматуллин – 2013-03-03T08:12:00.690

Interesting, but in my case the fhandler count is almost the same, and actually it's even higher for find (~7%):

  • count for find: 156389
  • count for grep: 145734
  • < – killy971 – 2013-03-04T01:25:22.817

@killy971: This depends on the data set and search task, you can get quite different results depending on the amount of files you end up including in the search. – Tamara Wijsman – 2013-03-16T08:23:14.160

3

It is actually the opposite way around; the grep command tends to be more efficient in general.

I'll work on a Portage tree snapshot from Gentoo, which are publically available if you want to try.

 $ time find /usr/portage/sys-apps/ -name '*.ebuild' -exec grep DEPEND {} \; > /dev/null

real    0m1.184s
user    0m0.033s
sys     0m0.130s

 $ time grep -r --include '*.ebuild' DEPEND /usr/portage/sys-apps/ > /dev/null

real    0m0.017s
user    0m0.007s
sys     0m0.010s

Let's look which functions are called the most for each:

 $ (strace find /usr/portage/sys-apps/ -name '*.ebuild' -exec grep DEPEND {} \; > /dev/null) |& sed 's/[({].*//g' | sort | uniq -c | sort -r | head -n 10
   3574 fcntl
   1597 close
    794 newfstatat
    794 getdents
    689 wait4
    689 clone
    689 --- SIGCHLD 
    404 fstat
    397 openat
     20 mmap

 $ (strace grep -r --include '*.ebuild' DEPEND /usr/portage/sys-apps/ > /dev/null) |& sed 's/[({].*//g' | sort | uniq -c | sort -r | head -n 10
   2779 fcntl
   1493 close
   1382 read
   1096 fstat
   1087 openat
    794 getdents
    792 newfstatat
    691 ioctl
    689 lseek
     25 write

And also look at the calls that were long:

 $ (strace -T find /usr/portage/sys-apps/ -name '*.ebuild' -exec grep DEPEND {} \; > /dev/null) |& sed 's/\(.*\)<\(.*\)>/\2 \1/g' | sort -nk1r | head -n10
exit_group(0)                           = ?
0.001884 wait4(29725, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 29725 
0.001879 wait4(29475, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 29475 
0.001813 wait4(29430, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 29430 
0.001812 wait4(30089, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 30089 
0.001807 wait4(29722, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 29722 
0.001795 wait4(29645, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 29645 
0.001794 wait4(29848, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 29848 
0.001759 wait4(30032, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 30032 
0.001754 wait4(30093, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 30093

 $ (strace -T grep -r --include '*.ebuild' DEPEND /usr/portage/sys-apps/ > /dev/null) |& 
exit_group(0)                           = ?
0.002336 fcntl(3, F_SETFD, FD_CLOEXEC)           = 0 
0.000460 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200\30`C6\0\0\0"..., 832) = 832 
0.000313 close(3)                                = 0 
0.000295 execve("/bin/grep", ["grep", "-r", "--include", "*.ebuild", "DEPEND", "/usr/portage/sys-apps/"], [/* 75 vars */]) = 0 
0.000276 fcntl(3, F_SETFD, FD_CLOEXEC)           = 0 
0.000265 getdents(3, /* 244 entries */, 32768)   = 7856 
0.000233 fstat(3, {st_mode=S_IFREG|0644, st_size=826, ...}) = 0 
0.000162 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 
0.000137 lseek(3, 1402, 0x4 /* SEEK_??? */)      = -1 ENXIO (No such device or address) 

Quite interesting, you see in this duration output that find is waiting a lot whereas grep does some stuff that is required to start and stop the process. The wait calls take more than 0.001s whereas the find calls decreases to a steady ~0.0002s.

If you look at the wait4 calls in the count output, you will notice that there is an equal amount of clone calls and SIGCHLD signals occuring; this is because find calls the grep process for each file it comes across, this is where its efficiency suffers as cloning and waiting is costly.

There are occasions where it doesn't suffer; you could get a small enough set of files so there isn't much overhead of starting multiple grep processes, you could also have a very slow disk that neglects the overhead of starting a new process, and there are probably other reasons as well. Though when comparing the speed, we often look at how well one or another approach scales, and not look at special corner cases.

In you case you have mentioned that "This is why I feel that it's the way "grep" visits the directory tree that is inefficient compared to "find".", this may indeed be the case; as you can see 1382 read calls have been made whereas find does not do that, this makes grep more I/O intensive for you.

TL;DR: To see why your timings are inefficient, try to do this analysis again and pinpoint the issue in your case such that you know why your specific data and task are not efficient in grep; you'll discover how different grep can behave in your corner case...

So, as others suggested you will want to make sure that it doesn't call grep for each file, which can be done by replacing \; by + near the end.

 $ time find /usr/portage/sys-apps/ -name '*.ebuild' -exec grep DEPEND {} + > /dev/null

real    0m0.027s
user    0m0.010s
sys     0m0.013s

As you can see, 0.027s comes quite close to 0.017s; the difference is mostly attributable to the fact that it still has to call both find and grep as opposed to just grep alone. Or as shown in the comments, on some systems the + allows you to improve over grep.

Tamara Wijsman

Posted 2013-01-18T02:49:48.963

Reputation: 54 163

If you want to make this even close to fair, call grep once on all files as opposed to calling grep once for every file. I should have mentioned this in my answer too. Here is your test with find-first doing less: http://pastebin.mozilla.org/2220884

– Ярослав Рахматуллин – 2013-03-16T08:18:02.803

I did so at the end, but that's not what the question stated. If you want to be really fair, you would need to get all kinds of different data sets and different search tasks within those data sets; then draw graphs and be able to determine the point at which one approach becomes better than the other, but then again, that would require a huge amount of time... – Tamara Wijsman – 2013-03-16T08:19:32.750

You're right. I forgot to flush cache in the previous pastebin. The results still look better for find with that in mind: http://pastebin.mozilla.org/2220941

– Ярослав Рахматуллин – 2013-03-16T08:30:56.490