Extracting text from a .PDF scanned book

I have a scanned a book in PDF format, but the quality is rather poor:

enter image description here

(The language is Romanian and it's a medical physiology book, in case you were wondering)

I want to extract text from the book (1500 pages) but keep the images the way they are. I really don't think I have any chance to find a solution, so I'll surely buy the book.

On the offchance, is there any powerful software that can do what I'm looking for? It also has to recognize Romanian.

ChristianM

Posted 2009-11-01T22:33:33.173

Reputation: 655

1buy it, it's legal. :) – None – 2009-11-01T23:02:13.273

What if this is a really old book he can't buy anymore? :) – Botond Balázs – 2009-11-03T07:55:26.807

@Botond, that is in fact a huge issue with Google Book Search. An estimated 70% of its books are in-copyright, but out-of-print. A class action settlement (negotiated between Google and a few lawyers working for the Authors Guild and AAP) states that for out-of-print Google does not need permission, unless the rights owners specifically opt out of the agreement. And, the way US law works, this is binding on every work of literature ever produced. As long as other companies do net get a similar deal, Google has a monopoly on old literature :-( See Boing Boing at http://tinyurl.com/yl5rlts

– Arjan – 2009-11-03T09:47:45.130

1The problem of the OP is to extract text from a book. This is still a problem even if he has bought the book. Legal issues, though worth considering, are out of scope here. – mouviciel – 2009-11-10T08:42:39.177

Answers

I bought the book !

ChristianM

Posted 2009-11-01T22:33:33.173

Reputation: 655

I have earlier posted an answer detailing how to use Cuneiform (open source software) to do OCR on PDF files and how to create a PDF file with the recognized text in a hidden text layer "behind" the original image. As far as I know, Cuneiform actually does support Romanian as well.

While the particular solution was for Linux, Cuneiform is available also for Windows.

Jukka Matilainen

Posted 2009-11-01T22:33:33.173

Reputation: 2 304

Adobe Acrobat Professional can do that. I'm not sure if there is a Romanian version...

Lukas

Posted 2009-11-01T22:33:33.173

Reputation: 1 156

ABBYY Fine Reader is very strong OCR software. It deals with very complex layouts and supports a lot of formats (including pdf). Romanian is supported with dictionary, i.e. software uses dictionary for hypothesis prioritizing during recognition. (here).

In any case, OCR-ing scientific literature, with has poor scan quality is difficult task. Be ready to spend a lot of time to help software with results check and layot fixes. On your scan I see a lot of very poor-quality text :(. I don't think any OCR software could work normally with it.

Konstantin Tenzin

Posted 2009-11-01T22:33:33.173

Reputation: 149

Recognita OmniPage is by far the best OCR program I've ever used. I'm sure it will recognize Romanian text; it had no problem with my native Hungarian. You can download a trial version from the link and use it to convert your book. The full version is unfortunately pretty pricey ($499.99)...

Botond Balázs

Posted 2009-11-01T22:33:33.173

Reputation: 304

Well, for text recognitions one usually searches for OCR (optical character recognition) programs. There is a variety of them around, so a simple google search will do more good than me here.

I didn't understand the last part "recognize Romanian" - you mean it has to recognize the Romanian language, or to be localized (translated) to Romanian ? In case of the first, I believe there will be no problem; if the second is the case, then I'm not so sure.

Also, if it is not a book by your local countrymen, then there is a chance it is already translated in english ... so if you have it in pdf in romanian, try searching for an english version ... then only problem is that's you know ... illegal (sometimes one doesn't have a choise though).

Rook

Posted 2009-11-01T22:33:33.173

Reputation: 21 622

I mean it has to recognize the Romanian Font/Romanian Characters. Someone edited my post.. don't really know why. :| – ChristianM – 2009-11-02T09:35:32.023

I don't think you should have any problems with that (only for really badly scanned teyt, when it cannot decide whether something is a letter or a blob, then you'll maybe have to manually correct) - I've used a variety of software on croatian language (we have some weird characters in our alphabet) and it worked out fine. – Rook – 2009-11-02T13:20:31.057

OCR often uses some spellcheck to make up for scan errors. So, that spellcheck must support Romanian then. (Yes, some OCR yields better results than the original text, due to this spellcheck mechanism.) – Arjan – 2009-11-03T08:34:33.470

These fonts are always tricky when using OCR software: ă, â, î, ş, ţ, Ă, Â, Î, Ş, Ţ. You'd be surprised how badly they come out when scanning a book. – alex – 2009-11-10T11:58:40.950

-1

Try PDFCubed.com. It's an online OCR service that makes creating a searchable text PDF easy. Scanned documents can be submitted via the web, email, or dropbox.

rlangner

Posted 2009-11-01T22:33:33.173

Reputation: 38