44
33
How do I extract text from a PDF that wasn't built with an index? It's all text, but I can't search or select anything. I'm running Kubuntu, and Okular doesn't have this feature.
44
33
How do I extract text from a PDF that wasn't built with an index? It's all text, but I can't search or select anything. I'm running Kubuntu, and Okular doesn't have this feature.
26
I have had success with the BSD-licensed Linux port of Cuneiform OCR system.
No binary packages seem to be available, so you need to build it from source. Be sure to have the ImageMagick C++ libraries installed to have support for essentially any input image format (otherwise it will only accept BMP).
While it appears to be essentially undocumented apart from a brief README file, I've found the OCR results quite good. The nice thing about it is that it can output position information for the OCR text in hOCR format, so that it becomes possible to put the text back in in the correct position in a hidden layer of a PDF file. This way you can create "searchable" PDFs from which you can copy text.
I have used hocr2pdf to recreate PDFs out of the original image-only PDFs and OCR results. Sadly, the program does not appear to support creating multi-page PDFs, so you might have to create a script to handle them:
#!/bin/bash
# Run OCR on a multi-page PDF file and create a new pdf with the
# extracted text in hidden layer. Requires cuneiform, hocr2pdf, gs.
# Usage: ./dwim.sh input.pdf output.pdf
set -e
input="$1"
output="$2"
tmpdir="$(mktemp -d)"
# extract images of the pages (note: resolution hard-coded)
gs -SDEVICE=tiffg4 -r300x300 -sOutputFile="$tmpdir/page-%04d.tiff" -dNOPAUSE -dBATCH -- "$input"
# OCR each page individually and convert into PDF
for page in "$tmpdir"/page-*.tiff
do
base="${page%.tiff}"
cuneiform -f hocr -o "$base.html" "$page"
hocr2pdf -i "$page" -o "$base.pdf" < "$base.html"
done
# combine the pages into one PDF
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" "$tmpdir"/page-*.pdf
rm -rf -- "$tmpdir"
Please note that the above script is very rudimentary. For example, it does not retain any PDF metadata.
@GökhanSever I'll get this error: Tesseract Open Source OCR Engine v3.03 with Leptonica OSD: Weak margin (0.00) for 571 blob text block, but using orientation anyway: 0 /usr/bin/pdf2text: line 23: /tmp/tmp.XksXutALLp/page-0001.html: No such file or directory
when I use your version. Any idea what I'm doing wrong? – Wikunia – 2015-02-11T21:48:33.380
@Wikunia change $base.html to $base.hocr – David Milovich – 2018-07-20T23:37:03.657
Any idea to improve this script to add spell-checking stage to correct errors in recognition step? – Gökhan Sever – 2011-06-21T21:49:09.067
@Gökhan Sever, do you mean adding interactive spell-checking where the user is prompted for replacement for misspelled/unknown words? I think you could do that by adding something like aspell check --mode=html "$base.html"
in the script right after running cuneiform. – Jukka Matilainen – 2011-06-21T22:48:47.060
This is one solution. However without seeing the whole context of the text it is hard to make corrections. It would be nicer to see an interface built within the ocrfeeder. – Gökhan Sever – 2011-06-22T00:22:04.647
1By the way, I use tesseract for character recognition: replacing cuneiform line with: tesseract "$page" "$base" hocr – Gökhan Sever – 2011-06-22T00:22:31.727
1Small correction: The line for tesseract at least for other languages than English, here e.g. German ( = deu ) is: tesseract "$page" "$base" -l deu hocr
(of course you have to remove the ). – Keks Dose – 2012-10-12T15:45:28.883
As I had problems with not so accurate pdfs I changed the engine in gs from "tiff4" to "tiffgray" - and the result was very good: gs -SDEVICE=tiffgray -r300x300 -sOutputFile="$tmpdir/page-%04d.tiff" -dNOPAUSE -dBATCH -- "$input"
– None – 2013-02-02T08:35:54.233
15
See if pdftotext will work for you. If it's not on your machine, you'll have to install the poppler-utils package
sudo apt-get install poppler-utils
You might also find the pdf toolkit of use.
A full list of pdf software here on wikipedia.
Edit: Since you do need OCR capabilities, I think you'll have to try a different tack. (i.e I couldn't find a linux pdf2text converter that does OCR).
Convert pdf to image
gs: The below command should convert multipage pdf to individual tiff files.
gs -SDEVICE=tiffg4 -r600x600 -sPAPERSIZE=letter -sOutputFile=filename_%04d.tif -dNOPAUSE -dBATCH -- filename
ImageMagik utilities: There are other questions on the SuperUser site about using ImageMagik that you might use to help you do the conversion.
convert foo.pdf foo.png
Convert image to text with OCR
Taken from the Wikipedia's list of OCR software
2Does this program also work for handwritten text documents? – Ivo Flipse – 2009-08-24T09:31:05.627
1
No, I don't think it has OCR capabilities. It can just extract the text embedded in the pdf. Man page: http://linux.die.net/man/1/pdftotext
– None – 2009-08-24T10:50:59.800Yeah, this works for pdf documents that already come with the text embedded. My case is exactly one where it doesn't. – Helder S Ribeiro – 2009-08-27T03:28:39.533
1@obvio171 Added the best option I could find for getting OCR to work in your case. – None – 2009-08-27T06:53:37.203
13
Google docs will now use OCR to convert your uploaded image/pdf documents to text. I have had good success with it.
They are using the OCR system that is used for the gigantic Google Books project.
However, it must be noted that only PDFs to a size of 2 MB will be accepted for processing.
Update
1. To try it out, upload a <2MB pdf to google docs from a web browser.
2. Right click on the uploaded document and click "Open with Google Docs".
...Google Docs will convert to text and output to a new file with same name but Google Docs type in same folder.
This was really helpful :) I uploaded a 50 MB file yesterday and it worked. Looks like they've increased the size limit. – Gaurav – 2018-04-20T20:27:54.117
The answer is not really Ubuntu-specific but I want to really thank you: BRILLIANT solution! :) – Pitto – 2012-03-28T16:34:59.543
4
Best and easyest way out there is to use pypdfocr
it doesn't change the pdf
pypdfocr your_document.pdf
At the end you will have another your_document_ocr.pdf
the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.
pypdfocr
is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf
(module) does a symiliar job and can be used like this:
ocrmypdf in.pdf out.pdf
To install:
pip install ocrmypdf
or
apt install ocrmypdf
3
Geza Kovacs has made an Ubuntu package that is basically a script using hocr2pdf
as Jukka suggested, but makes things a bit faster to setup.
From Geza's Ubuntu forum post with details on the package...
Adding the repository and installing in Ubuntu
sudo add-apt-repository ppa:gezakovacs/pdfocr
sudo apt-get update
sudo apt-get install pdfocr
Running ocr on a file
pdfocr -i input.pdf -o output.pdf
GitHub repository for the code https://github.com/gkovacs/pdfocr/
2
PDFBeads works well for me. This thread “Convert Scanned Images to a Single PDF File” got me up and running. For a b&w book scan, you need to:
In the new folder, run
pdfbeads * > ../Output.pdf
This will put the collated, OCR'd PDF in the parent directory.
2
another script using tesseract :
#!/bin/bash
# Run OCR on a multi-page PDF file and create a txt with the
# extracted text in hidden layer. Requires tesseract, gs.
# Usage: ./pdf2ocr.sh input.pdf output.txt
set -e
input="$1"
output="$2"
tmpdir="$(mktemp -d)"
# extract images of the pages (note: resolution hard-coded)
gs -SDEVICE=tiff24nc -r300x300 -sOutputFile="$tmpdir/page-%04d.tiff" -dNOPAUSE -dBATCH -- "$input"
# OCR each page individually and convert into PDF
for page in "$tmpdir"/page-*.tiff
do
base="${page%.tiff}"
tesseract "$base.tiff" $base
done
# combine the pages into one txt
cat "$tmpdir"/page-*.txt > $output
rm -rf -- "$tmpdir"
2
Asprise OCR Library works on most versions of Linux. It can take PDF input and output as search PDF.
It's a commercial package. Download a free copy of Asprise OCR SDK for Linux here and run it this way:
aocr.sh input.pdf pdf
Note: the standalone 'pdf' specifies the output format.
Disclaimer: I am an employee of the company producing above product.
This post states that the product can do it, which is a helpful hint that should be posted as a comment. It doesn't explain how to actually solve the problem, which is what answers should do. Can you expand your answer so that someone can see how to do the solution? – fixer1234 – 2015-03-12T05:42:02.840
Thanks @fixer1234, I've edited it to include the command. – Asprise Support – 2015-03-12T10:17:48.050
1
Try Apache PDFBox to extract text content from PDF File. In case of images embedded into PDF files use ABBYY FineReader Engine CLI for Linux to extract text.
I found ABBYY OCR to be pretty pitiful, one of the least capable programs I've tried. It might be adequate with a really clean image of standard font text of typical body text size, with no mixed fonts, mixed sizes, complex layout, graphics, lines, etc. – fixer1234 – 2015-01-03T08:56:46.117
Ya i also tried, it works fine. I have some doubt, can u help me? – Praveen Kumar K R – 2015-01-03T09:01:19.570
If what you need isn't covered in other answers here, the best thing to do is ask your own question. That will get it exposure to a lot of eyes. – fixer1234 – 2015-01-03T16:21:17.963
Also see: https://softwarerecs.stackexchange.com/q/3412/26815
– None – 2018-03-07T11:38:25.410