Source PDF: http://download.microsoft.com/download/8/0/1/801a191c-029d-4af3-9642-555f6fe514ee/cff.pdf
The actual character code used is 0xDE in the compressed content stream. How that appears in your text editor of choice can vary.
BT
/F4 1 Tf
9.5 0 0 9.5 210 664.663 Tm
(Appendix B)Tj
1.2632 -1.3158 TD
-0.0002 Tc
-0.0021 Tw
(PredeÞned Encodings)Tj
ET
We have a Character Code, now what is the font? /F4
takes us to Obj 4322
, which is a non-embedded simple font (single byte encoding), with MacRomanEncoding
.
This encoding is defined in the PDF standard, in the Appendix Latin character set and encodings
.
Note these values are in Octal, so 0xDE
becomes o336
and looking under the MAC column we find that it is the ligature "fi" U+FB01
.
Why is "fi" replaced with þ?
It is not, the "þ" is actual the Octal character code o336
which combined with the PDF MacRomanEncoding is the ligature "fi". If you had a text editor that supported PDF MacRomanEncoding you would see the ligature.
That is not the binary data that I see in the PDF file you linked to. Where did you get that text from? What application? Did you process the PDF somehow? – Ryan – 2019-09-19T17:49:18.507
mea culpa, I linked to the wrong version of the document. http://download.microsoft.com/download/8/0/1/801a191c-029d-4af3-9642-555f6fe514ee/cff.pdf
To answer your question, I use
PdfSharp.Pdf.Content.ContentReader.ReadContent(page)
to read the file.