Why does the text `fi` get cut when I copy from a PDF or print a document?

15

4

When I copy from an Adobe Reader PDF file that contains

Define an operation

I rather see

Dene an operation

when I paste the text, why is this?

How can I remedy this annoying problem?

I've also seen this occur in the past when I've print a Microsoft Office Word file to my printer.

Tamara Wijsman

Posted 2012-01-06T15:45:37.360

Reputation: 54 163

Answers

13

This sounds like a font issue. The PDF is probably using the OpenType fi ligature in the word define, and the current font of the destination application is missing that glyph.

I don't know if there's an easy way to get Acrobat to decompose the ligature on copy.

Your problems with printing are probably also font-related. Something's probably allowing the printer to substitute the document's font with its own built-in fonts and the printer's version of the font is also missing that particular glyph. You'd have to tell Windows to always download fonts to the printer to work around this problem.

Another possibility when printing: UniScribe may not be enabled. MS KB 2642020 talks about this and some possible workarounds (namely, to use RAW type printing rather than EMF type printing). Though the context is slightly different than your specific problem, the cause may be the same and the same workarounds may apply.

afrazier

Posted 2012-01-06T15:45:37.360

Reputation: 21 316

@afrazier, the solution you wrote in your comment beginning "From an app's print dialog:" worked for me. I suggest putting that text into your answer. (I could edit it, but I think the decision should be up to you.) – Alan – 2018-07-20T16:06:43.063

1Interesting about the ligatures, I wonder if it can somehow be configured to behave properly. Perhaps I could look how other PDF readers behave. Where exactly do I configure it so that the fonts get sent to the printer? – Tamara Wijsman – 2012-01-06T18:13:34.530

1From an app's print dialog: Click Properties (or Preferences, depending on the dialog version) for the printer, make sure you're on the Layout or Quality tabs, click the Advanced button. In the Graphic group, change the TrueType Font option to Download as Softfont. This covers most PostScript printers and printers using Windows built-in dialogs (I think), but other drivers may have things moved around, or flat out missing. – afrazier – 2012-01-06T18:25:36.633

You may find MS KB 2642020 of some use. I've edited my answer with that information. – afrazier – 2012-01-10T19:41:32.297

Thanks for describing the problem. I haven't tried to solve this yet but will sure try when I encounter a printing problem again. I guess one of both solutions would sure solve this very specific problem... :) – Tamara Wijsman – 2012-01-10T21:30:48.693

9

You can replace most of these "broken" words with the originals. You can safely replace a word if:

  • like dene or rey, it is not a real word
  • like define or firefly, there is one way to re-add ligature sequeneces (ff, fi, fl, ffi, or ffl) and make a real word

Most ligature problems fit these criteria. However, you cannot replace:

  • us because it is a real word, even though it might originally have been fluffs
    • also affirm, butterfly, fielders, fortifies, flimflam, misfits...
  • cus because it could become either cuffs or ficus
    • also stiffed/stifled, rifle/riffle, flung/fluffing...

In this 496-thousand-word English dictionary, there are 16055 words that contain at least one ff, fi, fl, ffi, or ffl, which turn into 15879 words when their ligatures are removed. 173 of those missing words collided like cuffs and ficus, and the last 3 are because that dictionary contains the words ff, fi, and fl.

790 of these "ligature-removed" words are real words, like us, but 15089 are broken words. 14960 of the broken words can be safely replaced with the original word, which means 99.1% of the broken words are fixable and 93.2% of the original words that contain a ligature can be recovered after copy-pasting a PDF. 6.8% of words containing ligature sequences are lost to the collisions (cus) and sub-words (us), unless you pick some way (word/document context?) to choose the best replacement for each of the words that don't have a guaranteed replacement.

Below is my Python script that generated the above statistics. It expects a dictionary text file with one word per line. At the end it writes a CSV file that maps fixable broken words to their original words.

Here's a link to download the CSV: http://www.filedropper.com/brokenligaturewordfixes Combine this mapping with something like a regex replacement script in order to replace most of the broken words.

import csv
import itertools
import operator
import re


dictionary_file_path = 'dictionary.txt'
broken_word_fixes_file_path = 'broken_word_fixes.csv'
ligatures = 'ffi', 'ffl', 'ff', 'fi', 'fl'


with open(dictionary_file_path, 'r') as dictionary_file:
    dictionary_words = list(set(line.strip()
                                for line in dictionary_file.readlines()))


broken_word_fixes = {}
ligature_words = set()
ligature_removed_words = set()
broken_words = set()
multi_ligature_words = set()


# Find broken word fixes for words with one ligature sequence
# Example: "dene" --> "define"
words_and_ligatures = list(itertools.product(dictionary_words, ligatures))
for i, (word, ligature) in enumerate(words_and_ligatures):
    if i % 50000 == 0:
        print('1-ligature words {percent:.3g}% complete'
              .format(percent=100 * i / len(words_and_ligatures)))
    for ligature_match in re.finditer(ligature, word):
        if word in ligature_words:
            multi_ligature_words.add(word)
        ligature_words.add(word)
        if word == ligature:
            break
        # Skip words that contain a larger ligature
        if (('ffi' in word and ligature != 'ffi') or
                ('ffl' in word and ligature != 'ffl')):
            break
        # Replace ligatures with dots to avoid creating new ligatures
        # Example: "offline" --> "of.ine" to avoid creating "fi"
        ligature_removed_word = (word[:ligature_match.start()] +
                                 '.' +
                                 word[ligature_match.end():])
        # Skip words that contain another ligature
        if any(ligature in ligature_removed_word for ligature in ligatures):
            continue
        ligature_removed_word = ligature_removed_word.replace('.', '')
        ligature_removed_words.add(ligature_removed_word)
        if ligature_removed_word not in dictionary_words:
            broken_word = ligature_removed_word
            broken_words.add(broken_word)
            if broken_word not in broken_word_fixes:
                broken_word_fixes[broken_word] = word
            else:
                # Ignore broken words with multiple possible fixes
                # Example: "cus" --> "cuffs" or "ficus"
                broken_word_fixes[broken_word] = None


# Find broken word fixes for word with multiple ligature sequences
# Example: "rey" --> "firefly"
multi_ligature_words = sorted(multi_ligature_words)
numbers_of_ligatures_in_word = 2, 3
for number_of_ligatures_in_word in numbers_of_ligatures_in_word:
    ligature_lists = itertools.combinations_with_replacement(
        ligatures, r=number_of_ligatures_in_word
    )
    words_and_ligature_lists = list(itertools.product(
        multi_ligature_words, ligature_lists
    ))
    for i, (word, ligature_list) in enumerate(words_and_ligature_lists):
        if i % 1000 == 0:
            print('{n}-ligature words {percent:.3g}% complete'
                  .format(n=number_of_ligatures_in_word,
                          percent=100 * i / len(words_and_ligature_lists)))
        # Skip words that contain a larger ligature
        if (('ffi' in word and 'ffi' not in ligature_list) or
                ('ffl' in word and 'ffl' not in ligature_list)):
            continue
        ligature_removed_word = word
        for ligature in ligature_list:
            ligature_matches = list(re.finditer(ligature, ligature_removed_word))
            if not ligature_matches:
                break
            ligature_match = ligature_matches[0]
            # Replace ligatures with dots to avoid creating new ligatures
            # Example: "offline" --> "of.ine" to avoid creating "fi"
            ligature_removed_word = (
                ligature_removed_word[:ligature_match.start()] +
                '.' +
                ligature_removed_word[ligature_match.end():]
            )
        else:
            # Skip words that contain another ligature
            if any(ligature in ligature_removed_word for ligature in ligatures):
                continue
            ligature_removed_word = ligature_removed_word.replace('.', '')
            ligature_removed_words.add(ligature_removed_word)
            if ligature_removed_word not in dictionary_words:
                broken_word = ligature_removed_word
                broken_words.add(broken_word)
                if broken_word not in broken_word_fixes:
                    broken_word_fixes[broken_word] = word
                else:
                    # Ignore broken words with multiple possible fixes
                    # Example: "ung" --> "flung" or "fluffing"
                    broken_word_fixes[broken_word] = None


# Remove broken words with multiple possible fixes
for broken_word, fixed_word in broken_word_fixes.copy().items():
    if not fixed_word:
        broken_word_fixes.pop(broken_word)


number_of_ligature_words = len(ligature_words)
number_of_ligature_removed_words = len(ligature_removed_words)
number_of_broken_words = len(broken_words)
number_of_fixable_broken_words = len(
    [word for word in set(broken_word_fixes.keys())
     if word and broken_word_fixes[word]]
)
number_of_recoverable_ligature_words = len(
    [word for word in set(broken_word_fixes.values())
     if word]
)
print(number_of_ligature_words, 'ligature words')
print(number_of_ligature_removed_words, 'ligature-removed words')
print(number_of_broken_words, 'broken words')
print(number_of_fixable_broken_words,
      'fixable broken words ({percent:.3g}% fixable)'
      .format(percent=(
      100 * number_of_fixable_broken_words / number_of_broken_words
  )))
print(number_of_recoverable_ligature_words,
      'recoverable ligature words ({percent:.3g}% recoverable)'
      '(for at least one broken word)'
      .format(percent=(
          100 * number_of_recoverable_ligature_words / number_of_ligature_words
      )))


with open(broken_word_fixes_file_path, 'w+', newline='') as broken_word_fixes_file:
    csv_writer = csv.writer(broken_word_fixes_file)
    sorted_broken_word_fixes = sorted(broken_word_fixes.items(),
                                      key=operator.itemgetter(0))
    for broken_word, fixed_word in sorted_broken_word_fixes:
        csv_writer.writerow([broken_word, fixed_word])

Jan Van Bruggen

Posted 2012-01-06T15:45:37.360

Reputation: 91

The link to the .csv is broken. That would be great if you could upload it again ! In any case, thanks for the code. – MagTun – 2017-04-18T11:26:55.600

@Enora I re-uploaded the CSV at the same link - hope it helps! I also noticed a few problems in the code/results (using periods as placeholders while the new dictionary has periods in its words, and not lowercasing words before comparing them). I believe all of the replacements are correct, but take them with a grain of salt and know that more good replacements are possible. I recommend automating the replacements with regex but then confirming each replacement is good with your own eyes. – Jan Van Bruggen – 2017-08-25T16:28:51.810

8

The issue here is, as the other answer notes, with ligatures. However, it has nothing at all to do with OpenType. The fundamental problem is that PDFs are a pre-print format that concerns itself only little with contents and semantics but instead is geared towards faithfully representing a page as it would be printed.

Text is laid out not as text but as runs of glyphs from a font at certain positions. So you get something like »Place glyph number 72 there, glyph number 101 there, glyph number 108 there, ...«. On that level there is fundamentally no notion of text at all. It's just a description how it looks. There are two problems extracting meaning from a bunch of glyphs:

  1. The spatial layout. Since PDF already contains specific information where to place each glyph there is no actual text underlying it as would be normal. Another side-effect is that there are no spaces. Sure, if you look at the text there are, but not in the PDF. Why emit a blank glyph when you could just emit none at all? The result is the same, after all. So PDF readers have to carefully piece together the text again, inserting a space whenever they encounter a larger gap between glyphs.

  2. PDF renders glyphs, not text. Most of the time the glyph IDs correspond with Unicode code points or at least ASCII codes in the embedded fonts, which means that you often can get ASCII or Latin 1 text back well enough, depending on who created the PDF in the first place (some garble everything in the process). But often even PDFs that allow you to get out ASCII text just fine will mangle everything that is not ASCII. Especially horrible with complex scripts such as Arabic which contain only ligatures and alternate glyphs after the layout stage which means that Arabic PDFs almost never contain actual text

The second problem is like the one you face. A common culprit here is LaTeX which utilises an estimated number of 238982375 different fonts (each of which is restricted to 256 glyphs) to achieve its output. Different fonts for normal text, math (uses more than one), etc. make things very difficult, especially as Metafont predates Unicode by almost two decades and thus there never was a Unicode mapping. Umlauts are also rendered by a diaeresis superimposed on a letter, e.g. you get »¨a« instead of »ä« when copying from a PDF (and of course cannot search for it either).

Applications producing PDFs can opt to include the actual text as metadata. If they don't, you're left at the mercy of how the embedded fonts are handled and whether the PDF reader can piece together the original text again. But »fi« being copied as a blank or not at all is usually a sign of a LaTeX PDF. You should paint Unicode characters on stones and throw them at the producer, hoping they will switch to XeLaTeX and thus finally arriving in the 1990s of character encodings and font standards.

Joey

Posted 2012-01-06T15:45:37.360

Reputation: 36 381