2

I am involved with a very small residential management company which has a lot of documents that I want to digitise into PDF and put on the web for all residents to access. Many people are not technical, so something simple to use is essential.

I have the skills to set up a server based LAMP-based solution, albeit one that should not cost significant amounts of money to purchase or maintain, hence open-source, preferrably with a small memory footprint. Everything I have looked at so far though (such as Alfresco, KnowledgeTree and LogicalDOC) seem like major overkill, and complex both in terms of setup and for users.

I was thinking along the lines of something like AjaxExplorer, which seems to do the file browsing part of what I want to do admirably. In terms of full-text searching, is there a product that will work with AjaxExplorer, or something else that can work alongside it, that people would recommend as a relatively easy to configure tool for indexing and subsequently searching a document repository?

It would be acceptable to have separate areas of the front-end for browsing the file tree, and simple searching by filename / metadata and full text search, if, as I suspect, there is no suitable integrated solution.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
Stev_k
  • 133
  • 1
  • 5
  • What is the format of the files you wish to search? That will likely be the biggest factor in finding your search tool. – jeffatrackaid Jan 04 '12 at 17:16
  • Mainly PDF - I'll edit the question – Stev_k Jan 04 '12 at 17:18
  • 1
    You may want to look at CMS systems. For example, I know someone that uses Plone for this purpose. I think it has a module for full-text PDF. – jeffatrackaid Jan 04 '12 at 17:27
  • Thanks for the answers - not sure that they're exactly what I'm looking for. I may just use http://swish-e.org/ as an indexer as it seems pretty simple and capable. – Stev_k Jan 05 '12 at 00:10
  • 1
    If you are using PDF as in "PDF file output for scanned paper documents", you might need [OCR to run over the data first](http://blog.konradvoelkel.de/2010/01/linux-ocr-and-pdf-problem-solved/) for indexing purposes. – the-wabbit Jan 05 '12 at 00:13
  • 1
    There is a linux tool called pdftotext (if memory serves). I know for small deployments I've seen scripts run this and then just do a grep. Also there is a pdfgrep too floating about. – jeffatrackaid Jan 05 '12 at 16:52

3 Answers3

1

Personally, I would just use a regular distrobution of Apache (without PHP) and then add a filter to serve just the .pdf documents:

There are many different ways you could do this. For example, this directive that you place within your "Directory" node in your httpd.conf file:

<Directory "C:/Apache2.2/htdocs">
    Options Indexes Includes MultiViews
    IndexOptions +ScanHTMLTitles -IconsAreLinks FancyIndexing FoldersFirst NameWidth=*
    AddIcon (IMG,/webicons/image3.gif) .gif .png .jpeg .jpg .xbm .PNG .JPG .GIF .tiff .bmp
    AddIcon (IMG,/webicons/compressed.gif) .7z .zip .cab .tar .jar .mdb .ldf .mdf .CAB
    AddIcon (IMG,/webicons/binary.gif) .exe .msi .rdp .pcf .dia .class .ks .keystore .scc
    AddIcon (IMG,/webicons/a.gif) .txt .log .properties .doc .xls .xml .ts .msg .dat .sql .csv .pem .sh .py .tlp .java .der .csr .key .crt .bat .cmd .inf
    AddIcon (IMG,/webicons/link.gif) .lnk .htm .url .URL
    AddIcon (IMG,/webicons/pdf.gif) .pdf
    AddIcon /webicons/folder.png ^^DIRECTORY^^
    #ForceType application/octet-stream
    ....
    ....

Then type: http://domain.com/pdf/blah.pdf

If you really must have a search feature, you could install PHP and use PHP flat file search .

djangofan
  • 4,172
  • 10
  • 45
  • 59
1

I have used MNOGOsearch for indexing a pile of PDF files. It do full text searches of PDF's and many other document types. You may also find the search front end quite familiar.

The *nix versions are GNU licensed.

http://www.mnogosearch.org/

Tim
  • 2,997
  • 16
  • 15
1

ownCloud is an open-source solution for storing files that can run on LAMP. It has a very clean interface, and while it has other features (calendar, contacts, music, pictures), they can all be easily disabled. As of version 3 it has an integrated PDF viewer. As of version 5, it has full-text PDF searching powered by Lucene.

bmaupin
  • 306
  • 2
  • 13