How to get CJK Unicode characters from a PDF that uses supplementary private use characters?

1

2

I have several PDF documents (such as this one) that appear to be written using standard Chinese ideograms, but when I extract the text, it turns out that it's encoded using characters from the Unicode supplemental private use areas.

Is there any reliable way to map from the private use characters back to the appropriate CJK characters?

Ben

Posted 2015-10-13T15:51:05.727

Reputation: 11

Answers

0

The general flow is probably

  • Extract font from PDF
  • Try to compare the font against different known encoding and see if it is any of those
  • Or alternatively it could be something that are actually privately used
  • Work out a reverse relationship by checking the conversion table if it's known what encoding it is, otherwise work from the extracted font from pdf

user930067

Posted 2015-10-13T15:51:05.727

Reputation: 141