Any tools to automate OCR of scanned PDF files in a manner similar to Acrobat's OCR feature?

Question

Open source preferred, but not necessary.

I've got Adobe Acrobat 8, and really like the OCR feature which can essentially put an invisible layer of OCR'd text on top of a scanned document. Thus what you see on screen is the original scanned document, but the result is searchable.

What I'm looking for is a way to automate this process. I've currently got a few scripts that we use for processing and archiving scanned files, and am looking for something that I can plug right in to this batch process to do OCR in a manner similar to what I can do with Acrobat.

All suggestions welcome, thanks!

P.S. - I do try to keep userland questions on superuser. However, the implementation that results from this question will definitely live on the server that I've got processing scanned documentation... so it was a tossup. — Boden, Aug 14 '09 at 19:44

score 8 · Accepted Answer · answered Aug 14 '09 at 18:19

8

I have this implemented in a company document archveiving project. Scanned file is a tif file(single page). Then using Cuneiform to create a hocr file of the single tif. Then using hocr2pdf to output the PDF file. If multiple scan pages, I use gs to combine the PDFs into a single PDF document. Works really well, OCR is good enough for our needs and is searchable in any PDF viewer.

answered Aug 14 '09 at 18:19

xeon

3,796
17
18

Interesting. Before I spend too much time looking at it, is the resulting PDF the image from the original scan with an embedded text layer, or is it text only? – Boden Aug 14 '09 at 19:45
Its the image of the original scan with embedded text layer. The hocr file is text output with html markups. – xeon Aug 14 '09 at 21:17
Excellent. I'm going to give it a shot. If it looks like it'll work I'll mark your answer accepted. Thanks! – Boden Aug 14 '09 at 23:46
1

Thanks again. A bit of a pain to install these two guys, but it's working. I wrote a simple script to check an FTP folder for new .tif files on which it runs cuneiform and hocr2pdf, then uploads the results into a sharpoint document library using curl. Thus people can archive documents right from the copy machine, and the archives are fully text searchable. Question: do you know what the "resolution overwrite" option in hocr2pdf does? – Boden Aug 21 '09 at 19:29
I am glad it is working out for you. I do not know that the -r argument does. – xeon Aug 21 '09 at 20:49

score 1 · Answer 2 · answered Jul 06 '10 at 12:40

1

Have you looked at WatchOCR? You can download it from http://www.watchocr.com It is a free and open source OCR server that transforms image only pdfs into text searchable pdfs from a watched folder or network share.

answered Jul 06 '10 at 12:40

rlangner

11
1

score 0 · Answer 3 · answered Aug 14 '09 at 18:26

0

I like the sounds of xeon's answer, though OCRopus sounds like a lot of fun.

answered Aug 14 '09 at 18:26

Kara Marfia

7,892
5
32
56

When I was researching and testing different solutions. I tried that and tesseract-ocr and they did not have a good way output to PDF at the time. I have not looked into if they have those features... I know tesseract-ocr has it in their timeline... – xeon Aug 14 '09 at 18:28

Any tools to automate OCR of scanned PDF files in a manner similar to Acrobat's OCR feature?

3 Answers3