But how is this possible?
Basically, a program performs OCR on the input file and then it places an invisible layer of text over the picture. Alternatively, it might also place a visible layer of text under the picture, giving the same effect.
When you select something, the picture doesn't matter because the text layer gets selected.
how can this be created?
There are several ways. Given that Acrobat has already been suggested, I will add some free options (and luckily you are not forced to have Windows to use them).
PDF-XChange Viewer
This is a native Windows program by Tracker Software. The freeware version runs fine under Wine if you use the 32-bit edition in a 32-bit prefix, therefore you can use it on Windows, macOS and Linux. In the last two cases, you would need PlayOnMac or PlayOnLinux respectively.
Here's a picture from this answer I left on Ask Ubuntu:
![Screenshot of PDF-XChange Viewer under Wine](../../I/static/images/4c36627faa7933a0fac066239596191187dc7065c787c177f8548f2b70043854.png)
OCRmyPDF
This is a multiplatform program written in Python, based on Ghostscript, Tesseract and Unpaper. From the docs:
What OCRmyPDF does
OCRmyPDF analyzes each page of a PDF to determine the colorspace and
resolution (DPI) needed to capture all of the information on that page
without losing content. It uses Ghostscript to rasterize the page, and
then performs on OCR on the rasterized image to create an OCR “layer”.
The layer is then grafted back onto the original PDF.
It can be easily installed on Debian and Ubuntu derivatives:
apt-get install ocrmypdf
Or on macOS:
brew tap jbarlow83/ocrmypdf
brew install ocrmypdf
On Windows you would need to use the Docker image. See the official docs for details.
Usage is very simple and I suggest you use the optional -d
(deskew) and -c
(clean) parameters for better results. It will straighten every page and clean up small dots/imperfections before running the OCR process.
You can (and should) provide the language with -l
.
Here's an example taken from this skewed document written in Italian:
![Example for OCRmyPDF](../../I/static/images/6cd3fb9b534d7d6b5619c55bdf6835bfa2579b640728a33c3795880b901cc18e.png)
The command I used was:
ocrmypdf -l ita -d -c input.pdf output.pdf
Online tools
There are a few online tools that do the same. Notable, PDF24 hosts a free web-based version of OCRmyPDF that can be used without limitations.
See also:
4
Packages like https://github.com/gkovacs/pdfocr allow this to happen for already existing image PDFS
– exussum – 2018-02-10T09:06:33.503How it this different from results you get by Batch-OCR many PDFs?
– Dmitry Grigoryev – 2018-02-12T09:43:53.303@DmitryGrigoryev I had never seen this type of PDF before, so I asked, what it was. There is nothing about printer's firmware OCR or OCRMyPDF in the answers, both question and answers are very different. I do not see anything duplicate except that both questions are about OCR and PDFs. – Vojtěch Dohnal – 2018-02-12T10:07:19.787
Well, I have never seen an OCR PDF which is different from what you have posted, that's why your question feels strange to me. – Dmitry Grigoryev – 2018-02-12T10:15:12.660