How and why do PDF renderers replace chars

Answers

Source PDF: http://download.microsoft.com/download/8/0/1/801a191c-029d-4af3-9642-555f6fe514ee/cff.pdf

The actual character code used is 0xDE in the compressed content stream. How that appears in your text editor of choice can vary.

BT
/F4 1 Tf
9.5 0 0 9.5 210 664.663 Tm
(Appendix B)Tj
1.2632 -1.3158 TD
-0.0002 Tc
-0.0021 Tw
(PredeÞned Encodings)Tj
ET

We have a Character Code, now what is the font? /F4 takes us to Obj 4322, which is a non-embedded simple font (single byte encoding), with MacRomanEncoding.

This encoding is defined in the PDF standard, in the Appendix Latin character set and encodings.

Note these values are in Octal, so 0xDE becomes o336 and looking under the MAC column we find that it is the ligature "fi" U+FB01.

Why is "fi" replaced with þ?

It is not, the "þ" is actual the Octal character code o336 which combined with the PDF MacRomanEncoding is the ligature "fi". If you had a text editor that supported PDF MacRomanEncoding you would see the ligature.

Ryan

Posted 2019-09-18T13:58:33.033

Reputation: 181

That is not the binary data that I see in the PDF file you linked to. Where did you get that text from? What application? Did you process the PDF somehow? – Ryan – 2019-09-19T17:49:18.507

mea culpa, I linked to the wrong version of the document. http://download.microsoft.com/download/8/0/1/801a191c-029d-4af3-9642-555f6fe514ee/cff.pdf

To answer your question, I use PdfSharp.Pdf.Content.ContentReader.ReadContent(page) to read the file.

– Christoph Bruns – 2019-09-20T12:50:32.950