3
1
I'm using pdftotext
to make an ASCII version of a PDF document (made with LaTeX), because collaborators prefer a simple document in MS Word.
The plain text version I see looks good, but upon closer inspection the f character seems to be frequently mis-converted depending on what characters follow. For example, fi and fl often seem to become one special character, which I will try to paste here: fi and fl.
What is the best way to clean up the output of pdftotext? I am thinking sed
might be the right tool, but I'm not sure how to detect these special characters.
fl, fi, ff, ffl, and ffi are common typographic ligatures, commonly replaced by a single character (and definitely with TeX): http://en.wikipedia.org/wiki/Typographic_ligature#Computer_typesetting - perhaps you just need to check that the font you're viewing the output in has them, and that the encoding is right.
– frabjous – 2010-12-10T03:28:34.073oh, and you mean
pdftotext
from poppler, right, notpdftotex
? – frabjous – 2010-12-10T03:28:54.900Do you have the original TeX source? Why not use, e.g., latex2rtf or oolatex (from TeX4ht) to generate a Word Processor file for the Word junkies? Compiling to PDF and then converting to plain text seems like a very weird route for conversion. – frabjous – 2010-12-10T03:40:55.310
Oh, and if you DO want to convert PDF to plain text, consider using
– frabjous – 2010-12-10T03:43:15.903ebook-convert
from calibre (http://calibre-ebook.com) rather thanpdftotext
. It allows plain text output (and a variety of other formats), and handles ligatures for you.I did mean pdftotext. Typo fixed. I have original TeX source, but latex2rtf and oolatex do not work as well as pdftotext. I use additional packages like
siunitx
andglossaries
, and therefore it seems like going via the PDF is the best solution. I wish there were a better way. – None – 2010-12-10T18:06:40.283Thanks for the
ebook-convert
suggestion, that seems to work better thanpdftotext
. – None – 2010-12-10T18:07:14.603