How to extract text from pdf in script on Linux?

23

5

On Linux - How to extract text from a .pdf in which text really is text, not a scanned image? I want something I can use on the command line / in a script, not interactively. (I don't want to convert to .tif and use OCR - text is already available in the .pdf file, so why introduce inaccuracies from imperfect OCR?)

RobM

Posted 2010-11-05T19:30:38.543

Reputation: 231

similar question at askubuntu – Trevor Boyd Smith – 2018-05-01T12:18:30.093

Answers

25

pdftotext that comes with poppler will try to extract any text found in the PDF.

Ignacio Vazquez-Abrams

Posted 2010-11-05T19:30:38.543

Reputation: 100 516

1Thanks for your quick response, Ignacio! I was already checking out pdftotext that comes w xpdf (from foolabs.com) - your answer prompted me to take another look, and I got it working. Poppler appears to have evolved from xpdf, so I will take a look at that too. Thanks again! – RobM – 2010-11-05T19:56:16.507

10

Ignacio's answer is just fine. In fact, it'd be the first thing on my list. Well, that and perhaps to suggest the pdftohtml tool that also comes with poppler, combined with pdfreflow if you want to try to reassemble the text into paragraphs, etc. (Of course, this will give you HTML output, but converting HTML to plain text can be done in many ways.)

Here are some other options too.

The ebook-convert command line tool from Calibre, which can convert .PDFs to plain text (or RTF or a number of ebook formats, like ePub, etc.)

pdftxtextract from Podofo

Abiword can be called from the commandline to convert between any formats it can input from/export to, and with the appropriate import plugin, this includes PDFs:

abiword --to=txt file.pdf

(In fairness, I think AbiWord and calibre both use the poppler libraries, but I'm not positive.)

frabjous

Posted 2010-11-05T19:30:38.543

Reputation: 9 044

Thanks frabjous! In this case, I'm just extracting the text so that I can scan for specific strings (vendor names, account numbers) and patterns (invoice numbers and dates) - so no need to reformat or redisplay it. I appreciate the corroboration and the alternatives - and I'm sure others will too! -- Rob – RobM – 2010-11-17T17:15:24.280