Simple open-source solution for PDF document storage and search-based retrieval

Question

I am involved with a very small residential management company which has a lot of documents that I want to digitise into PDF and put on the web for all residents to access. Many people are not technical, so something simple to use is essential.

I have the skills to set up a server based LAMP-based solution, albeit one that should not cost significant amounts of money to purchase or maintain, hence open-source, preferrably with a small memory footprint. Everything I have looked at so far though (such as Alfresco, KnowledgeTree and LogicalDOC) seem like major overkill, and complex both in terms of setup and for users.

I was thinking along the lines of something like AjaxExplorer, which seems to do the file browsing part of what I want to do admirably. In terms of full-text searching, is there a product that will work with AjaxExplorer, or something else that can work alongside it, that people would recommend as a relatively easy to configure tool for indexing and subsequently searching a document repository?

It would be acceptable to have separate areas of the front-end for browsing the file tree, and simple searching by filename / metadata and full text search, if, as I suspect, there is no suitable integrated solution.

What is the format of the files you wish to search? That will likely be the biggest factor in finding your search tool. — jeffatrackaid, Jan 04 '12 at 17:16
You may want to look at CMS systems. For example, I know someone that uses Plone for this purpose. I think it has a module for full-text PDF. — jeffatrackaid, Jan 04 '12 at 17:27
Thanks for the answers - not sure that they're exactly what I'm looking for. I may just use http://swish-e.org/ as an indexer as it seems pretty simple and capable. — Stev_k, Jan 05 '12 at 00:10
If you are using PDF as in "PDF file output for scanned paper documents", you might need [OCR to run over the data first](http://blog.konradvoelkel.de/2010/01/linux-ocr-and-pdf-problem-solved/) for indexing purposes. — the-wabbit, Jan 05 '12 at 00:13
There is a linux tool called pdftotext (if memory serves). I know for small deployments I've seen scripts run this and then just do a grep. Also there is a pdfgrep too floating about. — jeffatrackaid, Jan 05 '12 at 16:52

score 1 · Answer 1 · answered Jan 04 '12 at 17:27

Personally, I would just use a regular distrobution of Apache (without PHP) and then add a filter to serve just the .pdf documents:

There are many different ways you could do this. For example, this directive that you place within your "Directory" node in your httpd.conf file:

<Directory "C:/Apache2.2/htdocs">
    Options Indexes Includes MultiViews
    IndexOptions +ScanHTMLTitles -IconsAreLinks FancyIndexing FoldersFirst NameWidth=*
    AddIcon (IMG,/webicons/image3.gif) .gif .png .jpeg .jpg .xbm .PNG .JPG .GIF .tiff .bmp
    AddIcon (IMG,/webicons/compressed.gif) .7z .zip .cab .tar .jar .mdb .ldf .mdf .CAB
    AddIcon (IMG,/webicons/binary.gif) .exe .msi .rdp .pcf .dia .class .ks .keystore .scc
    AddIcon (IMG,/webicons/a.gif) .txt .log .properties .doc .xls .xml .ts .msg .dat .sql .csv .pem .sh .py .tlp .java .der .csr .key .crt .bat .cmd .inf
    AddIcon (IMG,/webicons/link.gif) .lnk .htm .url .URL
    AddIcon (IMG,/webicons/pdf.gif) .pdf
    AddIcon /webicons/folder.png ^^DIRECTORY^^
    #ForceType application/octet-stream
    ....
    ....

Then type: http://domain.com/pdf/blah.pdf

If you really must have a search feature, you could install PHP and use PHP flat file search .

score 1 · Answer 2 · answered Jan 04 '12 at 21:28

1

I have used MNOGOsearch for indexing a pile of PDF files. It do full text searches of PDF's and many other document types. You may also find the search front end quite familiar.

The *nix versions are GNU licensed.

http://www.mnogosearch.org/

answered Jan 04 '12 at 21:28

Tim

2,997
16
15

Cool software but it should be noted that beyond 3,000 documents, it costs around $1000 for a license. – djangofan Jan 05 '12 at 19:14
Never pushed that many documents before, thanks for the heads up! – Tim Jan 05 '12 at 19:39

bmaupin · Answer 3 · 2013-10-14T19:53:56.003

1

ownCloud is an open-source solution for storing files that can run on LAMP. It has a very clean interface, and while it has other features (calendar, contacts, music, pictures), they can all be easily disabled. As of version 3 it has an integrated PDF viewer. As of version 5, it has full-text PDF searching powered by Lucene.

edited Oct 14 '13 at 19:53

answered Jan 31 '12 at 16:56

bmaupin

306
2
13

Simple open-source solution for PDF document storage and search-based retrieval

3 Answers3