Is it possible to remove ligatures from copied text?

I have a few PDFs that contain ligatures in the text (e.g., ff is combined into a single character, ﬀ).

Is there an easy way to remove them when copying the text from the PDF? (i.e., when I paste, I'd like the ﬀ to be pasted as ff).

I copy a lot of text from these PDFs into answers on Stack Overflow and I find the ligatures at best obnoxious (ok, I admit, I'm really picky :-P); the ligatures also do not show up correctly when copied into other places (e.g., if I copy them into Notepad, they show up as blocks).

I cannot modify the PDFs.

I use both Adobe Acrobat Reader and Foxit Reader, but I'd be open to trying a new PDF reader.

James McNellis

Posted 2010-07-18T19:54:42.183

Reputation: 261

Answers

In python this would be:

import unicodedata
# \uFB00 is the ff ligature.
unicodedata.normalize('NFKD',u'\uFB00').encode('ascii','ignore')

You could combine this with pyPdf to read the pdf files.

SiggyF

Posted 2010-07-18T19:54:42.183

Reputation: 266

The reader evince seems to decode ligatures when I tested this.

Btw. for pdflatex documents you can use this in the preamble to display ligatures in the PDF document but copy individual characters:

\input{glyphtounicode.tex}
\pdfgentounicode=1 %

till

Posted 2010-07-18T19:54:42.183

Reputation: 21

One possibility would be to use your favorite text-editor and simply replace them.

Another way would be to write a script which utilizes sed.~~..but that would be *NIX-Systems only, I fear.~~

Bobby

Posted 2010-07-18T19:54:42.183

Reputation: 8 534

GnuWin32 and you have sed on windows. – mbq – 2010-07-18T21:18:15.267

@mbq: It's also included in that? very good. Thx. – Bobby – 2010-07-18T21:26:00.070

My way was simply to copy and paste from the PDF to notepad (to remove any formatting) and then from notepad to Microsoft Word.

In Word all ligatures are changed with other formatting fonts.

I use find and replace for each of them (like ^l for manual line interruption and ^m for manual page interruption and so on, you can find all easily online) and I replace with correct form.

In 4 or 5 steps I cover all possibilities quite quickly. It is useful to remove additional paragraph interruptions too (^P)

Gentili Giuliano

Posted 2010-07-18T19:54:42.183

Reputation: 1

I answered a similar question in more depth - Why does the text `fi` get cut when I copy from a PDF or print a document?

You can replace the "broken" words in the copied text if you have a mapping from broken words to original words. I wrote a script to generate this mapping by removing ligatures from words and checking whether the resulting word is unique. For my dictionary of English words, 99.5% of all possible broken words are replaceable, and 92.3% of words that contain a ligature sequence (ff, fi, fl, ffi, or ffl) can be recovered. The difference between these two percentages is due to the surprisingly large number of legitimate words that are created by removing ligaments from other legitimate words (like butterfly --> buttery, fluffs --> us, and misfits --> mists).

Here's a CSV of guaranteed-replaceable "broken" words (and the words they used to be): http://www.filedropper.com/brokenligaturewordfixes

Jan Van Bruggen

Posted 2010-07-18T19:54:42.183

Reputation: 91

It's great that you're offering the file. Realistically, though, nobody with common sense would download an unknown file (especially from a brand new user). Don't take it personally if the file doesn't get much traffic. It doesn't mean your efforts aren't appreciated. – fixer1234 – 2015-08-28T08:00:44.740

Yeah, I understand. I wish there was a simple way to verify links like that, or even just to guarantee the file type. Thanks! – Jan Van Bruggen – 2015-08-28T16:35:38.747