grepping large amounts of text

I have a few gigabytes of source code.

using recursive grep for a term can take a while.

I am using ext3.

Is there a faster way? Would using find be faster and if so why? Would using a filesystem like XFS give noticeably better results?

Joshxtothe4

Posted 2009-12-16T20:33:47.923

Reputation:

1with exec it can use grep which would be faster than just using grep – None – 2009-12-16T20:46:41.563

and why would that be faster? – akira – 2009-12-16T22:57:04.150

Answers

Have you tried ack? It works pretty well here, on a 1mm+ sized codebase.

Jeff Paquette

Posted 2009-12-16T20:33:47.923

Reputation: 150

ack is easier and faster than using find | grep, and I often use it, but it doesn't index the results anywhere for later use. – njd – 2010-02-09T12:49:17.390

You can get better performance with agrep, which uses a novel bitmasking algorithm for search.

If you're looking for symbols, ctags or etags might work well enough to build an index for search.

pestilence669

Posted 2009-12-16T20:33:47.923

Reputation: 216

Ctags indexes the results, so you can search quickly from your editor. Darren Hiebert's Exuberant Ctags (ctags.sourceforge.net) has improved options for recursive searching. – njd – 2010-02-09T12:54:19.963

The only way you'll get a significant improvement over grep is to use an indexed search system like Strigi. The filesystem makes very little difference unless you have a huge number of very small files.

Tim Sylvester

Posted 2009-12-16T20:33:47.923

Reputation: 273

This should likely be on superuser.

Grepping is not the ideal solution to your problem since it performs a linear search.

Index your files for search using a desktop indexing solution such as Beagle or Google Desktop.

Ben S

Posted 2009-12-16T20:33:47.923

Reputation: 1 902

I don't think the FS is going to make a big difference; chances are it's compute bound. You could check this using top to see if your CPUs are smoking.

You could also post your regexp here and let the smart people of SO have a crack at optimizing it. There are a variety of techniques for avoiding backtracking, etc.

Carl Smotricz

Posted 2009-12-16T20:33:47.923

Reputation: 629

fgrep is faster because it doesn't use regular expressions, it only searches for fixed strings. It's just an alias for grep -F. – Tim Sylvester – 2009-12-16T20:44:46.117

Right you are, thanks. I removed that part of my suggestion. – Carl Smotricz – 2009-12-16T20:50:12.427

If you only need to grep a subset of files then use find first. For example to only grep .h header files:

find path/to/source -name *.h -print0 | xargs -0 grep pattern

This will be faster since you're only accessing filenames most of the time, rather than file contents, which means many fewer disc accesses.

James

Posted 2009-12-16T20:33:47.923

Reputation: 270

1Better: find path/to/source -name *.h -exec grep pattern {} \; – Ewan – 2009-12-16T20:50:40.790

2Even better: find path/to/source -name *.h -exec grep pattern {} \+ (less grep invocations) – None – 2009-12-16T22:39:22.157

Here is what I understand -

You are searching source code for a term
You'd like to see which source files use that term
You probably have thousands of files (adding up to GBs)
Do you want to know all the occurrences of 'term' within each file or a yes/no indication of whether its been used in a file or not? (the -l flag does this).

You can use the policy of divide-and-rule. Partition your set into multiple file-sets, run multiple greps parallely.

Not sure if your need is a one-off thing or something repetitive in nature.

blispr

Posted 2009-12-16T20:33:47.923

Reputation: 127