OCR that adds generated text to the original pdf and djvu files?

2

2

My OS is Ubuntu.

I found there are some applications can OCR a pdf or djvu file, generating another text file.

But I was wondering how to add the OCRed text onto the original pdf or djvu files, to make it text-selectable in original pdf or djvu files, as Adobe Acrobat does on Windows?

Tim

Posted 2011-05-07T19:59:34.563

Reputation: 12 647

Answers

2

For PDF there is pdfsandwich

pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only images (no text) will be processed by optical character recognition (OCR) and the text will be added to each page invisibly "behind" the images.

It's a 2 steps process :

  1. Add OCR text to a new PDF with (here I use tesseract OCR engine with french language) :

    pdfsandwich -sloppy_text -tesseract /path/to/tesseractbin -tesso -l fra ./original.pdf -o ./ocr.pdf

  2. Then convert the PDF/OCR to DjVu with :

    pdf2djvu -o ./ocr.djvu ./ocr.pdf

meda beda

Posted 2011-05-07T19:59:34.563

Reputation: 23

2

I started a Bash project on github to help convert from PDF to PDF+OCR and DjvU+OCR. It's based on the reply by @meda-beda and some edit I added.

It is a wrapper of pdfSandwich and pdf2djvu.

It was developed and tested under Ubuntu-12.10, I reckon there is still work to do on the option to tweak the resulting file (sometimes bigger than original).

Édouard Lopez

Posted 2011-05-07T19:59:34.563

Reputation: 220