How can I find out the encoding of this corrupted Chinese text, which an online tool fixes correctly?

The corrupted text ´ÓºÜ¾ÃÒÔÇ°¿ªÊ¼ is 14 characters long. Since the correct Simplified Chinese text 从很久以前开始 is 7 characters long, that immediately suggests that each Simplified Chinese character might correspond to two characters in the corrupted text.

The characters in the corrupted text have the following hex equivalents in UTF-16 (and also with cp936 as shown in the OP):

´ => b4
Ó => d3
º => ba
Ü => dc
¾ => be
Ã => c3
Ò => d2
Ô => d4
Ç => c7
° => b0
¿ => bf
ª => aa
Ê => ca
¼ => bc

I did that translation using a trivial Java program, but there are on-line sites that can do the same thing:

So all the Mandarin Tool needs to do is combine the hex values of the first two corrupted characters to get the first Simplified Chinese character using CP 936, and so on:

´ + Ó => b4 + d3 => b4d3 => 从
º + Ü => ba + dc => badc => 很
¾ + Ã => be + c3 => bec3 => 久
Ò + Ô => d2 + d4 => d2d4 => 以
Ç + ° => c7 + b0 => c7b0 => 前
¿ + ª => bf + aa => bfaa => 开
Ê + ¼ => ca + bc => cabc => 始

Presumably the Mandarin Tool verifies that the transformation of the corrupted text really does result in valid Simplified Chinese text.

Each Simplified Chinese cp936 value can then be mapped to its Unicode code point. For example, 从 = 0xB4D3 = code point 0x4ECE. And once you have the Unicode code point you can translate to any encoding you wish (cp936, GB 18030, UTF-16, etc).

One point I am unclear on in your question is the first listing, showing the 32-bit representations of each Simplified Chinese character (e.g. c2b4 c393 从). That doesn't look right, since the code point for a character (e.g. 0x4ECE for 从) and its 32-bit representation are the same thing. Or am I misunderstanding something?

skomisa

Posted 2015-03-28T03:19:42.800

Reputation: 138

Thank you for answering this old question, this gave me valuable insights to fix any corrupt encodings in the future! On the wikipedia article for UTF-8 it's explained how code points get transformed to UTF-8 bytes. With this corrupted text the cp936 code b4d3 of 从 is interpreted as two 11-bit codepoints 00b4 and 00d3 and the 11-bit codepoint 00b4 (in binary 00010110100) has the UTF-8 bytes c2b4 of my first listing. – rubystallion – 2018-12-31T11:24:52.033

@rubystallion OK, got it. I had completely misinterpreted "fixed length 32-bit" as "UTF-32", but that wasn't what you were saying at all - my mistake. – skomisa – 2018-12-31T21:06:35.617

How can I find out the encoding of this corrupted Chinese text, which an online tool fixes correctly?

Answers