10

Open source preferred, but not necessary.

I've got Adobe Acrobat 8, and really like the OCR feature which can essentially put an invisible layer of OCR'd text on top of a scanned document. Thus what you see on screen is the original scanned document, but the result is searchable.

What I'm looking for is a way to automate this process. I've currently got a few scripts that we use for processing and archiving scanned files, and am looking for something that I can plug right in to this batch process to do OCR in a manner similar to what I can do with Acrobat.

All suggestions welcome, thanks!

HopelessN00b
  • 53,385
  • 32
  • 133
  • 208
Boden
  • 4,948
  • 12
  • 48
  • 70
  • 1
    P.S. - I do try to keep userland questions on superuser. However, the implementation that results from this question will definitely live on the server that I've got processing scanned documentation... so it was a tossup. – Boden Aug 14 '09 at 19:44

3 Answers3

8

I have this implemented in a company document archveiving project. Scanned file is a tif file(single page). Then using Cuneiform to create a hocr file of the single tif. Then using hocr2pdf to output the PDF file. If multiple scan pages, I use gs to combine the PDFs into a single PDF document. Works really well, OCR is good enough for our needs and is searchable in any PDF viewer.

xeon
  • 3,796
  • 17
  • 18
  • Interesting. Before I spend too much time looking at it, is the resulting PDF the image from the original scan with an embedded text layer, or is it text only? – Boden Aug 14 '09 at 19:45
  • Its the image of the original scan with embedded text layer. The hocr file is text output with html markups. – xeon Aug 14 '09 at 21:17
  • Excellent. I'm going to give it a shot. If it looks like it'll work I'll mark your answer accepted. Thanks! – Boden Aug 14 '09 at 23:46
  • 1
    Thanks again. A bit of a pain to install these two guys, but it's working. I wrote a simple script to check an FTP folder for new .tif files on which it runs cuneiform and hocr2pdf, then uploads the results into a sharpoint document library using curl. Thus people can archive documents right from the copy machine, and the archives are fully text searchable. Question: do you know what the "resolution overwrite" option in hocr2pdf does? – Boden Aug 21 '09 at 19:29
  • I am glad it is working out for you. I do not know that the -r argument does. – xeon Aug 21 '09 at 20:49
1

Have you looked at WatchOCR? You can download it from http://www.watchocr.com It is a free and open source OCR server that transforms image only pdfs into text searchable pdfs from a watched folder or network share.

rlangner
  • 11
  • 1
0

I like the sounds of xeon's answer, though OCRopus sounds like a lot of fun.

Kara Marfia
  • 7,892
  • 5
  • 32
  • 56
  • When I was researching and testing different solutions. I tried that and tesseract-ocr and they did not have a good way output to PDF at the time. I have not looked into if they have those features... I know tesseract-ocr has it in their timeline... – xeon Aug 14 '09 at 18:28