Strange execution time for recursive grep

3

1

I'm on a Linux Ubuntu machine, grepping a large number of files. Can anyone explain these timings?

>time grep -nIr acct_margin *
[... 3 results ...]
real    0m0.555s
user    0m0.400s
sys     0m0.150s

Case insensitive:

>time grep -niIr acct_margin *
[... 3 results ...]

real    0m45.739s
user    0m45.300s
sys     0m0.240s

With an "or" operator in the regex:

>time grep -nIr "acct_margin\|MARGIN_TABLE" *
[... 5 results ...]

real    0m15.892s
user    0m15.540s
sys     0m0.260s

Case insensitive:

>time grep -niIr "acct_margin\|MARGIN_TABLE" *
[... 5 results ...]

real    0m52.433s
user    0m52.050s
sys     0m0.200s

I've run the commands enough that the files should be cached (htop confirms open ram space).

Case insensitivity incurs a factor of nearly 100x slower? That should be a constant time increase in amount of processing. Pipe operator a factor of 30x slower? I guess I can't argue with the results, but I'm just curious how one writes grep such that it performs this way, and if there's any way around it. High performance grep, anyone?

Scott

Posted 2013-03-05T21:04:02.127

Reputation: 135

1Do the files contain Unicode or plain ASCII characters? – Dennis – 2013-03-05T21:07:03.947

plain ASCII. Mostly C++ files. – Scott – 2013-03-05T21:08:46.660

Answers

2

Case insensitivity incurs a factor of nearly 100x slower? That should be a constant time increase in amount of processing.

Well, no. For starters, there are 1024 different strings that match acct_margin with the -i switch.

Matching acct_margin is case sensitively easy; if the current character isn't an a, skip. With case-folding, you have to check if the current character is an a or an A. That's not only two tests, you also have a lot more matches you can't skip directly.1 It gets more difficult for non-ASCII characters.

Nevertheless, grep -i shouldn't be so slow. This seems to be bad implementation of case-folding when multi-byte characters are anticipated (even though none are present in the files).

Possible workarounds

  1. Do not use the -i switch and construct the case-insensitive regular expression by hand.

  2. If the files contain only ASCII characters, temporarily set the encoding identifier of the environment variable LANG to ISO-8859-1 (a.k.a. Western European, Latin-1 or Code Page 819).

  3. Select one of the non-default regular expression types (--fixed-strings, --extended-regexp or --perl-regexp).

  4. If the files contain only ASCII characters and case-folding the output is admissible, do not use the -i switch and pipe case-folded characters to grep.

Test setup

I've created a 500 MB file of Base64-encoded pseudo-random bytes on my computer2, killed all CPU- or memory-intensive processes and compared the execution times of the above workarounds to a plain grep or grep -i.

I've fed the input file to grep and tr using a redirector and discarded the output to eliminate potential sources of bias.

I've run every command several times; the execution times were stable (maximum variation of ±0.01s).

Bechmarks

$ # Make time output only elapsed time:
$ TIME="  %es"
$ # C create 500MB file of 76 printable ASCII characters per line:
$ < /dev/urandom base64 | head -c 500MB > file
$ # Verify that current character encoding is UTF8:
$ echo "  $LANG"
  en_US.UTF-8
# Benchmark of case-sensitive `grep' for comparison:
$ < file time grep test > /dev/null
  0.45s
$ # Benchmark of case-sensitive `grep --perl-regexp' for comparison:
$ < file time grep -P test > /dev/null
  0.82s
$ # Benchmark of case-insensitive `grep' for comparison:
$ < file time grep -i test > /dev/null
  21.08s
$ # Workaround 1:
$ < file time grep [Tt][Ee][Ss][Tt] > /dev/null
  1.53s
$ # Workaround 2:
$ (LANG=en_US.ISO-8859-1; < file time grep -i test > /dev/null)
  1.53s
$ # Workaround 3:
$ < file time grep -Pi test > /dev/null
  1.33s
$ # Workarounds 2 and 3 combined:
$ (LANG=en_US.ISO-8859-1; < file time grep -Pi test > /dev/null)
  1.25s
$ # Workarounds 4:
$ < file time tr [A-Z] [a-z] | grep test > /dev/null
  0.46s

There was no measurable difference between the default and the other non-default regular expression types.

There was no measurable difference between the last version of grep from the standard Ubuntu repositories (2.10) and the latest stable version of GNU grep (2.14).

Conclusions

  • Workaround 1 and 2 result in the same speed-up.

    This suggests that workaround 1 is similar to how grep's case-folding works when no multi-byte characters are anticipated.

  • While --perl-regexp is a lot slower for case-sensitive matching, it's magnitudes faster for case-insentive matching.

  • --perl-regexp becomes even faster if no multi-byte characters are anticipated.

  • At the expense of case-folding the output, workaround 4 can be as fast as case-sensitive matching.


1 I'm not suggesting that this is how grep works internally. THis is emant to illustrate why case-insensitive matching is more complicated.

2 Intel Core i5-3570K, 16 GiB RAM, Ubuntu 12.10

Dennis

Posted 2013-03-05T21:04:02.127

Reputation: 42 934

1Dennis, grep -i should NOT be that much slower if it's coded right. My own grep runs imperceptibly slower with -i and always faster than the Cygwin grep. For example, on just over 89KLOC of C, here were the times to search for "include" on my Win7 core i7 machine: My grep 0.03s with/without -i. Cygwin grep 0.09s, Cygwin grep -i 0.16s. – Nicole Hamilton – 2013-03-05T22:51:57.977

Bug/bad implementation was actually my first guess, especially since Ubuntu's grep is over a year old. – Dennis – 2013-03-05T23:16:31.220

They've probably chosen a poor compilation scheme which then doesn't execute well. (I deliberately don't read GNU code so as to ensure no GNU code or ideas creep into my own code. So I don't know our algorithms compare internally.) But I don't think the problem is year-old code. grep has been around for decades and should be really stable. I haven't touched the source for my own grep for 4 years and that was for something unrelated to my regex. I probably haven't changed my regex in about a decade. – Nicole Hamilton – 2013-03-05T23:31:49.310

Exactly: should run stable. It's been almost seven years now since GNU's Wget hangs indefinitely when an HTTPS connection times out (bug #385506)...

– Dennis – 2013-03-05T23:41:02.780

@NicoleHamilton: You were right. I've tried the latest stable version of grep; it made no difference. – Dennis – 2013-03-06T12:45:11.777