How to create PDF with scanned pages but selectable text?

32

7

Today I recieved a PDF from our supplier and it contained several printed and scanned pages with signatures etc. I opened it in Acrobat Reader DC. But to my surprise the text from the evidently scanned images could be selected and copied as a text. See the screenshot:

PDF scanned with selectable text

There is evidently some OCR behind this since the copied text contains mistakes. But how is this possible? I have never seen this before, how can this be created?

Vojtěch Dohnal

Posted 2018-02-09T09:16:41.567

Reputation: 2 586

4

Packages like https://github.com/gkovacs/pdfocr allow this to happen for already existing image PDFS

– exussum – 2018-02-10T09:06:33.503

How it this different from results you get by Batch-OCR many PDFs?

– Dmitry Grigoryev – 2018-02-12T09:43:53.303

@DmitryGrigoryev I had never seen this type of PDF before, so I asked, what it was. There is nothing about printer's firmware OCR or OCRMyPDF in the answers, both question and answers are very different. I do not see anything duplicate except that both questions are about OCR and PDFs. – Vojtěch Dohnal – 2018-02-12T10:07:19.787

Well, I have never seen an OCR PDF which is different from what you have posted, that's why your question feels strange to me. – Dmitry Grigoryev – 2018-02-12T10:15:12.660

Answers

52

This has (contrary to some other answers here) most probably nothing to do with Acrobat at all.

Most (all?!) professional document scanners and most semi-professional ones will automatically perform OCR when you choose "Save as PDF" and have the "searchable" checkbox ticked in the settings. The cheaper "consumer grade" models will do the OCR on the attached PC, typical network scanners do it internally.

The word "searchable" means nothing more and nothing less than that the scanner will perform OCR, then generate a page with the scanned bitmaps within, and overlay them with invisible characters from the OCR, each placed over the respective character on the bitmap.

That way, you can search, and also select, copy, and paste the "bitmap" as if by magic. It's no magic at all, however. In reality, you're just copying invisible text.

The scanner may also do some additional magic such as compositing the large image from many small tiles which are also reused. This results in a much smaller document size than would actually be possible, but may also lead to funny surprises (not so funny if they happen to you!) such as the Xerox alters your bills story, ironically even when no OCR is done, depending on the firmware.

Damon

Posted 2018-02-09T09:16:41.567

Reputation: 4 002

Yes, this is most probably how they created it, I very much doubt they use full Adobe Acrobat. – Vojtěch Dohnal – 2018-02-09T13:46:32.393

We did it by having all the text behind the scanned image placed where the OCR reported where it found each text node. – Thorbjørn Ravn Andersen – 2018-02-10T14:40:24.913

10

But how is this possible?

Basically, a program performs OCR on the input file and then it places an invisible layer of text over the picture. Alternatively, it might also place a visible layer of text under the picture, giving the same effect.

When you select something, the picture doesn't matter because the text layer gets selected.

how can this be created?

There are several ways. Given that Acrobat has already been suggested, I will add some free options (and luckily you are not forced to have Windows to use them).

PDF-XChange Viewer

This is a native Windows program by Tracker Software. The freeware version runs fine under Wine if you use the 32-bit edition in a 32-bit prefix, therefore you can use it on Windows, macOS and Linux. In the last two cases, you would need PlayOnMac or PlayOnLinux respectively.

Here's a picture from this answer I left on Ask Ubuntu:

Screenshot of PDF-XChange Viewer under Wine

OCRmyPDF

This is a multiplatform program written in Python, based on Ghostscript, Tesseract and Unpaper. From the docs:

What OCRmyPDF does

OCRmyPDF analyzes each page of a PDF to determine the colorspace and resolution (DPI) needed to capture all of the information on that page without losing content. It uses Ghostscript to rasterize the page, and then performs on OCR on the rasterized image to create an OCR “layer”. The layer is then grafted back onto the original PDF.

It can be easily installed on Debian and Ubuntu derivatives:

apt-get install ocrmypdf

Or on macOS:

brew tap jbarlow83/ocrmypdf
brew install ocrmypdf

On Windows you would need to use the Docker image. See the official docs for details.

Usage is very simple and I suggest you use the optional -d (deskew) and -c (clean) parameters for better results. It will straighten every page and clean up small dots/imperfections before running the OCR process.

You can (and should) provide the language with -l.

Here's an example taken from this skewed document written in Italian:

Example for OCRmyPDF

The command I used was:

ocrmypdf -l ita -d -c input.pdf output.pdf

Online tools

There are a few online tools that do the same. Notable, PDF24 hosts a free web-based version of OCRmyPDF that can be used without limitations.

See also:

Andrea Lazzarotto

Posted 2018-02-09T09:16:41.567

Reputation: 772

Thank you for this answer, I tried OCRMyPDF and it worked very well but unfortunatly the language support that I need is not yet mature, so the results were not very usable yet. – Vojtěch Dohnal – 2018-02-12T13:43:06.733

@VojtěchDohnal which language are you interested in? Did you install the relevant language pack for Tesseract? See the list here: https://www.macports.org/ports.php?by=name&substr=tesseract-

– Andrea Lazzarotto – 2018-02-12T14:14:29.793

4

This is possibly because of a Acrobat OCR feature:

Acrobat can recognize text in any PDF or image file in dozens of languages. All you have to do is open the scanned document or image that you'd like to OCR, then click the blue Tools button in the top right of the toolbar. In that sidebar, select the Recognize Text tab, then click the In This File button.

...

With the text recognized, you can now markup the PDF using all the normal markup tools — you can highlight, cross out text, and more. You can even copy the text with the detected formatting, though that's often less accurate than the text recognition itself.

duDE

Posted 2018-02-09T09:16:41.567

Reputation: 14 097

This works in Reader as well? Other documents do not work this way for me... – Vojtěch Dohnal – 2018-02-09T09:26:30.467

I fear no, but take a look at this article: https://pdf.wondershare.com/pdf-software-comparison/adobe-reader-ocr.html

– duDE – 2018-02-09T09:41:58.853

3

From Adobe's website

Recognize text in a Scanned PDF file

When you scan paper documents to PDF, you’re really just taking pictures of those documents. That’s great for photos and other printed images, but what if you’ve got a 200-page document in which you need to find a particular word or phrase? Use Acrobat to recognize the text in that scanned file, making the text content searchable and usable.

  1. With your scanned document open in Acrobat, open up the Tools pane and expand the Text Recognition panel. If you can’t see “Text Recognition” in the Tools pane, you can add it by selecting the menu in the upper right corner (image below – see where that little red arrow is pointing? Click there).
  2. Click on “In This File” to scan the document you’ve got open. You can just accept the default settings and click “Okay” when the Recognize Text box pops up. Acrobat will convert the image into usable text; to test it out, just try editing a word or sentence with the Content Editing panel. Isn’t that awesome!?

Máté Juhász

Posted 2018-02-09T09:16:41.567

Reputation: 16 807

Thanks but I have just opened the PDF in Reader DC and did nothing special with it, other PDF documents with scanned pages do not work this way automatically... – Vojtěch Dohnal – 2018-02-09T09:31:04.843

5OCR was done BEFORE you've received the file, when text is recognized, it gets saved together with the pdf. – Máté Juhász – 2018-02-09T09:37:24.923

@VojtěchDohnal You probably need full acrobat, not just the reader – Thorbjørn Ravn Andersen – 2018-02-10T14:41:04.153