Efficient way to search string within file find and grep

4

1

I am searching all files containing a specific string on a filer (on an old HP-UX workstation).

I do not know where the files are located in the file system (there are many directories, with hudge number of scripts, plain-text and binary files).

I precise that the grep -R option does not exist on this system; so I am using find and grep in order to retrieve which files contains my string:

find . -type f -exec grep -i "mystring" {} \;

I am not satisfied with this command: it is too slow, and it does not print the name and path of file on which grep matched my string. Moreover if there is an error it will be echoed on my console output.

So I thought that I could do better:

find . -type f -exec grep -l -i "mystring" {} 2>/dev/null \;

But it is very slow.

Do you have a more efficient alternative to this command?

Thanks you.

zeropouet

Posted 2013-06-19T15:45:11.963

Reputation: 53

You want the -H option to print the file name along with the match. – nik – 2013-06-19T16:00:17.983

Think of reducing the file-set; work from sub-directories under your ., one at a time; see if you can reduce to specific file extensions or name patterns. – nik – 2013-06-19T16:03:54.797

1You should be able to make some assumptions about your files. For example, they have a miimum size of 1kb, a maximum of 1GB, they are not owned by root, they are writeable by user X, they have been created at least 3 days ago but no more than 10 years ago, they are not pdfs or .log files. All these can be encoded in a find command using ! and -or etc. – terdon – 2013-06-19T16:17:32.020

@nik (ignore my previous comment, wrong man page) the -l option should already do what -H does, -l prints the file name and stops at the first match. – terdon – 2013-06-19T16:31:05.767

The H option does not exist on my workstation (HP-UX Release 11i). It could be the good option on a linux system. – zeropouet – 2013-06-20T07:10:37.590

Answers

2

The fastest I can come up with is to use xargs to share the load:

find . -type f -print0  | xargs -0 grep -Fil "mypattern" 

Running some benchmarks on a directory containing 3631 files:

$ time find . -type f -exec grep -l -i "mystring" {} 2>/dev/null \;

real    0m15.012s
user    0m4.876s
sys     0m1.876s

$ time find . -type f -exec grep -Fli "mystring" {} 2>/dev/null \;

real    0m13.982s
user    0m4.328s
sys     0m1.592s


$ time find . -type f -print0  | xargs -0 grep -Fil "mystring" >/dev/null 

real    0m3.565s
user    0m3.508s
sys     0m0.052s

Your other options would be to streamline either by limiting the file list using find:

   -executable
          Matches files which are executable and  direc‐
          tories  which  are  searchable (in a file name
          resolution sense).  
   -writable
          Matches files which are writable.             

   -mtime n
          File's  data was last modified n*24 hours ago.
          See the comments for -atime to understand  how
          rounding  affects  the  interpretation of file
          modification times.
   -group gname
          File  belongs to group gname (numeric group ID
          allowed).
   -perm /mode
          Any  of  the  permission bits mode are set for
          the file.  Symbolic modes are accepted in this
          form.  You must specify `u', `g' or `o' if you
          use a symbolic mode. 
   -size n[cwbkMG]  <-- you can set a minimum or maximum size
          File uses n units  of  space.  

Or by tweaking grep:

You are already using grep's -l option which cause the file name to be printed and, more importantly, stops at the first match:

   -l, --files-with-matches
       Suppress normal output; instead print the name of each input file  from
       which  output would normally have been printed.  The scanning will stop
       on the first match.  (-l is specified by POSIX.)

The only other thing I can think of to speed things up would be to make sure your pattern is not interpreted as a regex (as suggested by @suspectus) by using the -F option.

terdon

Posted 2013-06-19T15:45:11.963

Reputation: 45 216

Thanks for xargs, I didn't think about it. It's a lot faster. I think that the -exec option is not really fast. I have found another solution to speed my search: I built an index of all files returned by find -type f. Then I used a for loop to search the string with the index built. – zeropouet – 2013-06-20T07:12:03.390

@zeropouet the exec option is not slow as such, it is just that xargs will optimize the command and launch many greps in parallel. Have a look at its -P option too. Specifically, try with -P 0. – terdon – 2013-06-20T23:21:51.327

1

Use grep -F, which tells grep to interpret the pattern as a string and not a regular expression (which I assume you do not require). It can be appreciably quicker than grep - depending on the size of files that are being parsed.

On Ubuntu and RHEL Linux it's the -H option will display the file path of a matched file.

find . -type f -exec grep -FHi "mystring" {} +

suspectus

Posted 2013-06-19T15:45:11.963

Reputation: 3 957