OCR Image based PDF

2

Possible Duplicate:
Extracting text from a .PDF scanned book
How to do OCR on a PDF document?

I've got a >200 page pdf manual that was produced by scanning hard copy. I'd like to convert it to a searchable text format, but am not having any success finding a tool to do so. Google's search results are highly polluted with crippleware trial software that can only do the first few pages of the file. The only truly free application I found, FreeOCR's pdf renderer fails to handle anything beyond the first few pages of the file.

Google's pdf viewer does OCR; but doesn't appear to provide any export option other than copy/paste; in addition to being very tedious, what it puts on the clipboard is only plaintext; which means I'd lose all of the line art and significant formatting due to horizontal placement.

Dan is Fiddling by Firelight

Posted 2012-05-20T15:44:00.860

Reputation: 2 677

Question was closed 2012-05-22T03:10:50.480

@DanielAndersson Unfortunately, none of those were helpful. Blowing the file apart into hundreds of image files and then gluing them back together would be a massive waste of time (1st and 3rd link). I've already got plenty of tools that claim they'd do the job if I gave them money, but which I can't verify the claims of because the problematic parts of the file are beyond what they'd do for free (2nd link) – Dan is Fiddling by Firelight – 2012-05-20T17:46:10.967

Then put that info in your question as well so people know what you have tried and not. People aren't at this site because they like guessing :-) – Daniel Andersson – 2012-05-20T19:05:22.307

Answers

2

If you upload your PDF to Google Drive (Docs) and have your upload conversion settings to convert images to text and then convert the document to a Google Doc (this can all be done at upload). You should then be able to open the doc, click file > download as and select the format you want?

I just did this is a magazine page and it worked okay, not all of the fonts were recognised though.

sgtbeano

Posted 2012-05-20T15:44:00.860

Reputation: 575

The upload converter maxes out at a 2MB file size. If I import it by emailing it to myself (what I tried originally), I don't run into the limitation; but don't get the conversion options. – Dan is Fiddling by Firelight – 2012-05-20T16:27:37.417

How about this service? It says it doesn't have any upload limits? http://www.newocr.com/

– sgtbeano – 2012-05-20T16:45:46.820

That service sort of works; but by trashing everything that's not an letter it breaks a moderate amount of formatting (most seriously some structural formulas for chemicals). – Dan is Fiddling by Firelight – 2012-05-20T17:31:23.363

I used a pdf splitter to cut the file down below the upload limit; but the GoogleDoc converter didn't OCR the text at all; unlike what their PDF viewer does. – Dan is Fiddling by Firelight – 2012-05-20T18:16:49.340