Cleaning up pdftotext font issues

I'm using pdftotext to make an ASCII version of a PDF document (made with LaTeX), because collaborators prefer a simple document in MS Word.

The plain text version I see looks good, but upon closer inspection the f character seems to be frequently mis-converted depending on what characters follow. For example, fi and fl often seem to become one special character, which I will try to paste here: ﬁ and ﬂ.

What is the best way to clean up the output of pdftotext? I am thinking sed might be the right tool, but I'm not sure how to detect these special characters.

user31752

Posted 2010-12-09T23:06:55.840

Reputation:

fl, fi, ff, ffl, and ffi are common typographic ligatures, commonly replaced by a single character (and definitely with TeX): http://en.wikipedia.org/wiki/Typographic_ligature#Computer_typesetting - perhaps you just need to check that the font you're viewing the output in has them, and that the encoding is right.

– frabjous – 2010-12-10T03:28:34.073

oh, and you mean pdftotext from poppler, right, not pdftotex ? – frabjous – 2010-12-10T03:28:54.900

Do you have the original TeX source? Why not use, e.g., latex2rtf or oolatex (from TeX4ht) to generate a Word Processor file for the Word junkies? Compiling to PDF and then converting to plain text seems like a very weird route for conversion. – frabjous – 2010-12-10T03:40:55.310

Oh, and if you DO want to convert PDF to plain text, consider using ebook-convert from calibre (http://calibre-ebook.com) rather than pdftotext. It allows plain text output (and a variety of other formats), and handles ligatures for you.

– frabjous – 2010-12-10T03:43:15.903

I did mean pdftotext. Typo fixed. I have original TeX source, but latex2rtf and oolatex do not work as well as pdftotext. I use additional packages like siunitx and glossaries, and therefore it seems like going via the PDF is the best solution. I wish there were a better way. – None – 2010-12-10T18:06:40.283

Thanks for the ebook-convert suggestion, that seems to work better than pdftotext. – None – 2010-12-10T18:07:14.603

Answers

By default, pdftotext outputs unicode (UTF-8) data. If your terminal or text editor doesn't support UTF-8, ligatures such as "fi" and "fl" (which can be represented as a single character in unicode) will appear strangely, as you have noticed.

The simple fix is to tell pdftotext to output ASCII instead of unicode:

pdftotext -enc ASCII7 input.pdf output.txt

This should produce clean ASCII output, removing your need to clean it up manually afterwards.

davidg

Posted 2010-12-09T23:06:55.840

Reputation: 389

1this solution will also not work if you actually need unicode characters in your output. – amenthes – 2018-08-22T11:31:29.330

Thanks. I found the ebook-convert suggestion above to be the best. Your advice might improve the default behavior of pdfottext, but I think my terminal does support UTF-8, and ebook-convert seems to handle superscripts and other things better. – None – 2011-01-11T16:00:21.390

Assuming you're on some kind of Unix-based system, you could run this on the output of pdftotext:

sed -i -e 's/ﬃ/ffi/g' -e 's/ﬁ/fi/g' -e 's/ﬀ/ff/g' -e 's/ﬂ/fl/g' -e 's/ﬄ/ffl/g' output.txt

That should replace the ligatures with the individual letters they break into. (See my comments above for what ligatures have to do with this.)

I tested that on a text file generated through pdftotext from a LaTeX-generated PDF. And it worked fine. But if the LaTeX used a nonstandard encoding or font with additional ligatures there may be more to do.

You'll probably want to make sure the font you're using in your terminal has characters for the f-series ligatures. DejaVu Sans Mono is a good choice.

frabjous

Posted 2010-12-09T23:06:55.840

Reputation: 9 044

In case your terminal is not utf-8 (for example windows cmd.exe), you can also do this with the byte representation: sed -e 's/\\xEF\\xAC\\x80/ff/g' -e 's/\\xEF\\xAC\\x81/fi/g' -e 's/\\xEF\\xAC\\x82/fl/g' -e 's/\\xEF\\xAC\\x83/ffi/g' -e 's/\\xEF\\xAC\\x84/ffl/g' -e 's/\\xEF\\xAC\\x85/ft/g' -e 's/\\xEF\\xAC\\x86/st/g'. – amenthes – 2018-08-22T12:46:42.453