Indexing PDF files on Ubuntu

Question

I'm looking for a solution in Ubuntu that indexes PDF (and ps?) files for searching later.

The criteria would be:

Compatibility: Often extracting text varies, depending on what software was used to create the PDF. Some PDFs can also be "locked", which I guess one should respect.
Search functionality: wildcards, regex's, "fuzzy" matching.
Speed of search

In my case I want to index a folder of academic journal articles, hence the requirement that it works consistently regardless of what software created the PDF. I'm already using a reference manager so would rather not replace that.

For example: A good front-end to Beagle, and a plugin that allows it to index PDFs would be perfect.

You need to clarify your question. Best setup for what? Indexing it where? Displaying it how? — pauska, Jul 01 '09 at 13:54
There are GUI frontends to Beagle for all the major desktop environments, I think... — David Z, Jul 01 '09 at 16:02

score 2 · Accepted Answer · answered Jul 03 '09 at 06:51

2

Tracker does the same thing as Beagle and Strigi, but contrary to Beagle, it's written in pure C (Beagle is a Mono application). Allegedly, it is a lot faster than Beagle, though I haven't done the math myself.

I can't find you a link to Tracker, but I'm sure it's in the default Ubuntu repositories.

answered Jul 03 '09 at 06:51

wzzrd

10,269
2
32
47

Installed (and cron'd) on default ubuntu-desktop meta-package. – LiraNuna Jul 03 '09 at 07:21
Condirmed, tracker works great without any tweaking. – pufferfish Jul 09 '09 at 14:38

sleske · Answer 2 · 2009-07-01T16:00:53.313

1

Lucene does fulltext indexing of PDF, HTML, Microsoft Word, and OpenDocument. It's just a library, but there are several applications/CMS using it, or you could use it as a base for your own solution.

It is free software (Apache license).

Edit:

If you are looking for something with a frontend, you might consider Beagle or Strigi:

Beagle

Strigi

edited Jul 01 '09 at 16:00

answered Jul 01 '09 at 14:44

sleske

9,851
4
33
44

AFAIK Lucene is just a storage engine, albeit a very good one. I'm looking for something that has a front-end. – pufferfish Jul 01 '09 at 14:54

score 0 · Answer 3 · answered Jul 03 '09 at 02:18

0

I use google desktop for searching on linux. Not free, but it's the best i've found.

answered Jul 03 '09 at 02:18

Indexing PDF files on Ubuntu

3 Answers3