How can I convert scanned images as PDF to a searchable PDF file?

19

13

I have a PDF of a scanned book.

I'm looking for a free software that will perform OCR and then provide an option to save it as a PDF or document again.

Is there one?

yuval

Posted 2009-10-04T04:36:01.840

Reputation:

Question was closed 2015-06-15T16:01:43.993

You mean you want to convert the images in the pdf to text? – DaveParillo – 2009-10-04T05:01:03.303

yes, but I don't want a txt file as an output. I want to see the exact same pdf but with an option to press Ctrl+F and mark words etc.. – None – 2009-10-04T05:03:44.920

you will have a very hard time converting this PDF without loosing text formatting and style. i have yet to find OCR software able to properly preserve a document from scanned images. prepare for some donkey work (e.g. proofreading, etc.) :) – None – 2009-10-04T16:22:16.500

Answers

5

You could download the 30 day trial of Adobe Acrobat Pro and use the 'OCR Text Recognition' function ('Document > OCR Text Recognition > Recognise Text Using OCR...'). In the settings dialog, choose 'Searchable Image' as the output style. This will keep the page image but embed the OCR'ed text so the document will be searchable and allow text to be selected, copied and pasted.

After running the OCR you'll need to confirm or correct words that the OCR is unsure about using the 'Find OCR Suspects' functions.

pelms

Posted 2009-10-04T04:36:01.840

Reputation: 8 283

Although Adobe isn't free, it's by far the most capable OCR solution out there – James Healy – 2012-03-27T09:32:17.207

4

If you have a Google Account then Google Docs now includes the functionality to upload a PDF file and perform OCR on it.

I've tried it myself and it makes a fair stab at an admittedly well formatted PDF.

The formatting is pretty much destroyed but the text seems to survive.

Richard Lucas

Posted 2009-10-04T04:36:01.840

Reputation: 2 744

4

The following products were found listed on Internet, but I haven't used them.

Online OCR

OCR Terminal

OCR Terminal is an online OCR service that performs Optical Character Recognition (OCR) on your scanned images and pdf files and renders them into editable and text searchable documents.

Free OCR

Free-OCR.com is a free online OCR (Optical Character Recognition) tool. You can use this to perform OCR on any image you supply.
This service is free, no registration necessary. We also do not need your email address.
Just upload your image files. Free-OCR takes either a JPG, GIF, TIFF BMP or PDF (only first page). The only restriction is that the images must not be larger than 2MB, no wider or higher than 5000 pixels and there is a limit of 10 image uploads per hour.

Maestro Recognition Server is commercial, but has an online try-it demo.

Free software

FreeOCR - for images only.

FreeOCR is a scan & OCR program including the Tesseract free ocr engine also known as a Tesseract GUI. It includes a Windows installer and It is very simple to use and supports multi-page tiff's, fax documents as well as most image types including compressed Tiff's which the Tesseract engine on its own cannot read .It now has Twain scanning.

pdfsandwich - pdf -> pdf convertor.

pdfsandwich is a command line tool for OCR scanned books or journals. It is able to recognize the page layout even for multicolumn text.

Essentially, pdfsandwich is a wrapper script which calls the following binaries: convert, cuneiform, gs, and hocr2pdf. It is known to run on Unix systems and has been tested on Linux and MacOS X. It supports parallel processing on multiprocessor systems.

harrymc

Posted 2009-10-04T04:36:01.840

Reputation: 306 093

Looks like pdfsandwich has moved? http://www.tobias-elze.de/pdfsandwich/

– pioto – 2015-06-11T18:21:57.540

@pioto: It's not me that added pdfsandwich above, but I fixed the link as you suggested. – harrymc – 2015-06-11T19:10:46.033

I've just used pdfsandwich. It works and it's free! :) This will certainly help in my thesis, thanks! – Eddy – 2011-10-19T10:53:16.027

2

Cuneiform + hocr2pdf + Ghostscript: A DIY open-source solution.

I posted a an answer outlining a solution involving a version of the now open-source Cuneiform OCR system and hocr2pdf together with Ghostscript for putting the PDF pages together.

That was specifically for Linux but you can get Cuneiform and Ghostscript for Windows, too. I am not sure about hocr2pdf or an equivalent, though.

Jukka Matilainen

Posted 2009-10-04T04:36:01.840

Reputation: 2 304

1

Here is a very strange method, which involves letting Google index and OCR it for you on a website, then retrieving it.

jtbandes

Posted 2009-10-04T04:36:01.840

Reputation: 8 350

yeah, I saw that too... strange Indeed :) I might end up doing it... – None – 2009-10-04T05:19:37.633

0

Try PDFCubed.com Nothing to install, it is all done online. You can send your documents to be processed via the web, email, or dropbox. Scaned PDFs and TIFs are converted into searchable text pdfs and then can be retreived via the web, email, or dropbox.

rlangner

Posted 2009-10-04T04:36:01.840

Reputation: 38

0

Install Imagemagick. Open a cmd window or terminal:

convert myfile.pdf myfile-%02d.jpg

The output will be 1 jpg file for each page in your pdf, myfile-00.jpg, myfile-01.jpg, etc.

Pass each image though an ocr program. I don't have much experience with this, but there seem to be alot of choices.

Convert each page of text back into pdf. You could do this again with imagemagick, but there are other ways as well:

convert page-%02d.txt -density 300x300 -compress jpeg final.pdf

DaveParillo

Posted 2009-10-04T04:36:01.840

Reputation: 13 402

0

Your request seems to be a complicated solution to the problem, although I may not understand the problem correctly. At any rate:

Why not get a PDF writer that will allow you to enter the data directly on to the pdf page?

Xavierjazz

Posted 2009-10-04T04:36:01.840

Reputation: 7 993