How to get CJK Unicode characters from a PDF that uses supplementary private use characters?

I have several PDF documents (such as this one) that appear to be written using standard Chinese ideograms, but when I extract the text, it turns out that it's encoded using characters from the Unicode supplemental private use areas.

Is there any reliable way to map from the private use characters back to the appropriate CJK characters?

pdf
unicode
chinese

Ben

Posted 2015-10-13T15:51:05.727

Reputation: 11

Answers

The general flow is probably

Extract font from PDF
Try to compare the font against different known encoding and see if it is any of those
Or alternatively it could be something that are actually privately used
Work out a reverse relationship by checking the conversion table if it's known what encoding it is, otherwise work from the extracted font from pdf

user930067

Posted 2015-10-13T15:51:05.727

Reputation: 141

Asked: 2015-10-13T15:51:05.727

Viewed: 178 times

Active: 2017-11-14T01:26:04.220