Open source command line tools for indexing a large number of text files

I'm looking for any open source command line tool or tools which will allow me to index and search a large number of plain text files. Approximate search would be a plus. The tool only needs to print the files that match, although some match context would be useful. A GUI tool isn't useful for my application, nor is anything that searches files one by one (grep for example). I'm basically targeting unix platforms (osx, linux, bsd).

EDIT: I'm not interested in any sort of tool that is system-wide, or needs to run in the background. Basically, I want to build an index for a directory tree full of text files and then later be able to search against it. Preferably the index is one or a few files that I can specify the location of.

Any ideas?

ergosys

Posted 2011-03-13T05:04:42.497

Reputation: 255

Just about any way you do it you will have to scan each file for matches. Even if you dump everything into a DB, as one answer proposes, you still have to feed each file into the DB one by one. I don't know why grep wont work for you but it will give you the exact results your asking for, the matching file and the context of the match. Just redirect the output to a file and you have a searchable index. grep -r searchterm /somedir/* > index.txt – None – 2011-03-13T23:42:45.753

@Deleted Account, A query using grep is O(n) where n is the number of files. An index usually implies a data structure that gives you better than O(n) for most searches. Your index.txt idea is worse than grep by itself as it is an extra step, and I'm really not sure what the point would be. I don't have a problem with a database, I'd just prefer a lightweight one like sqlite or similar. – ergosys – 2011-03-14T01:28:43.727

Answers

I found what I was looking for. Swish++ can index of a directory of files (not just text), and is basically a set of command line tools. It appears to be a rewrite of Swish-e.

ergosys

Posted 2011-03-13T05:04:42.497

Reputation: 255

If you want to search for files by file name:

The standard Unix tool for this is locate. It builds a database of files in a cron job, then locate searches through the matches.

It's part of most Linux distributions (usually package "locate" or "mlocate").

If you want to search for files by content:

There are a variety of search engines available that will index documents for you (some even support other formats besides plain text, e.g. word processor document). Examples would be Beagle and Google desktop search. There's a fairly exhaustive list on Wikipedia:

http://en.wikipedia.org/wiki/List_of_search_engines#Desktop_search_engines

Edit:

If you don't want a search engine that runs in the background or automatically indexes all your files, you can probably still use a desktop search engine. Most of them let you control the indexing process, so you can start the indexing manually and specify which directories to index and where to put the index file.

sleske

Posted 2011-03-13T05:04:42.497

Reputation: 19 887

I guess (although I'm not fully sure) the question targeted to a full-text search of the files' content and not only of the file name (for which locate is indeed a perfect tool). – bmk – 2011-03-13T12:12:19.880

@bmk: Yes, I may have misunderstood the question. I edited my answer. – sleske – 2011-03-13T12:23:44.240

I am interested only in content. And don't most desktop engines work in the background and index all your files? I updated the question to make it clearer what I'm looking for. – ergosys – 2011-03-13T18:53:08.023

The best thing you could do is feed the text files into a MySQL database and use its FullText matching system. This will give very rapid searches with rankings on how well the results match with the search.

Interfacing a MySQL database with other systems, such as a website for document searching, etc, would be a simple enough task.

Useful resources:

MySQL basics: http://news.softpedia.com/news/MySQL-Basic-Usage-Guide-37081.shtml
How to use full text searching: http://devzone.zend.com/article/1304
MySQL Full Text Searching manual: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

Majenko

Posted 2011-03-13T05:04:42.497

Reputation: 29 007

I used to use swish-e, but that was about a decade ago. Development seems to have stalled since then (sometimes stalled means “stable”, not “dead”), but it might work for you.

Chris Johnsen

Posted 2011-03-13T05:04:42.497

Reputation: 31 786

seams stable: Subversion repo

– shellholic – 2011-03-14T11:26:45.087

If a bit of python scripting is OK for you, have a look at whoosh: https://bitbucket.org/mchaput/whoosh/wiki/Home

And, er, I guess you have looked at grep. If your filebase is small enough to fit into RAM (buffer cache) this is actually quite fast. Depends on how complex on how complex your queries are of course.

georg

Posted 2011-03-13T05:04:42.497

Reputation: 226