Copy pdf text layer to another pdf

Suppose you've got 2 "scanned" pdf files.

Large, but without text layer.
Smaller (with lower quality images), but with correct text layer.

Both files contain equal images, different only by their compression.

The goal is to embed the same text layer to 1st pdf.

"Just OCR 1st file" is not a solution. I know Acrobat (and some other tools) are able to OCR without altering image layer, but I'm not happy with their OCR quality.

So, I see two possible ways:

Export-import text layer somehow
Replace images in image layer somehow.

Concerning 1st way, I've found nothing. Concerning 2nd way, I've found two tools, which are quite close hocr2pdf and pdf2text, but they are still not enough, as far as I understood. :(

PS: Use example:

I've just found another example where such operation is useful in a systematic manner.

If you've got scanned pdf-1 (without text layer) with, say , "jpg" image compression, Abbyy finereader gives you OCR'd pdf, pdf-2. It would be either quite large, if you choose lossless image compression, or it would have image quality significantly lower than pdf-1. In many cases, best choice is to keep source image compression as-is, and do not recompress the image.

i3v

Posted 2013-11-24T09:58:32.293

Reputation: 970

How is it possible that the scanned document has better quality than the original one? Am I missing something? – gronostaj – 2013-11-24T10:36:49.927

@gronostaj Em.. I've not said so... There's pdf-1, which has better quality images and pdf-2, which has lower quality images but features text layer. Neither of them is "original". In fact, one may treat pdf-2 like OCR'd and compressed pdf-1. If so, I'd like to combine both, to obtain both text layer and higher quality images. – i3v – 2013-11-24T13:13:44.927

Answers

This answer on stackoverflow has a solution. You can extract the text with coordinates from your pdf-2 using pdftotext -bbox or the Python package PDFMiner, then write this hidden text into a new PDF with the Python package ReportLab, then merge this hidden-text PDF with your pdf-1 using PDFtk (There's a GUI for Windows at the webpage; the command line for Unix is called PDFtk Server now.)

Or, you could try directly merging pdf-1 and pdf-2 using PDFtk. Run pdftk pdf-2 multistamp pdf-1 output out.pdf. This will put each page of pdf-1 in front of the corresponding page of pdf-2, so you will only see the images from pdf-1 (assuming they are scans, and do not have a transparent background), but the hidden text from pdf-2 will be included. The downside is that this may be very large, since it will include two copies of each page image. I have verified that this works, and the size of the output pdf is the sum of the sizes of the inputs.

Nick Matteo

Posted 2013-11-24T09:58:32.293

Reputation: 586

Thanks for the hint, PDFtk actually allowed me to solve this issue. By the way, windows “PDFtk Free” version is able to do the same thing, as long as there’s a “pdftk.exe” command-line tool as well. – i3v – 2014-03-15T13:18:36.963

To avoid situation, when size of the output pdf is equal to the sum of sizes of original pdfs, I’ve simply “optimized” pdf with text using Abode Acrobat’s pdf optimization feature (File->Save As->Save as type->Adobe PDF Files, Optimized (*.pdf)->Settings->Images->[set lowest possible ppi, for all image types, 9 for me, and etc.]->OK->Save). This would result in relatively small pdf, so, adding this it’s size to size of the original pdf won’t be a big deal. This workflow even works almost OK for pdfs with embedded comments (the issue is - highlighting becomes “not transparent”) – i3v – 2014-03-15T13:18:53.680

If it's a isolated case when you have to do that, LibreOffice + GIMP should do the job. First, use LibreOffice Draw to extract the high-quality scans. Then edit them with GIMP to remove scanned text. Finally, add the image to the OCRed file on a lower layer.

But if you're going to do it as a part of some routine, then you probably have a problem with your workflow.

gronostaj

Posted 2013-11-24T09:58:32.293

Reputation: 33 047

In fact, I don't need to "remove scanned text". Text layer should be below image layer. 2) I'm able to replace page images one-by-one... The only question is: how do I automate this for a large number of pages?

I'm sorry my initial question was so misleading. I've just added "PS" to initial question, hope it's showing possible application example. – i3v – 2013-11-26T19:51:12.573