How can I find out the encoding of this corrupted Chinese text, which an online tool fixes correctly?



I have a text in Simplified Chinese, which, when read as UTF-8 begins with ´ÓºÜ¾ÃÒÔÇ°¿ªÊ¼, which the online tool from MandarinTools (first search result for Repair Corrupted Chinese Email) fixes to the correct 从很久以前开始, but it's not clear how it fixed that. From using the online tool and a hex editor I know that each character is encoded as fixed length 32-bit:

c2b4 c393 从
c2ba c39c 很
c2be c383 久
c392 c394 以
c387 c2b0 前
c2bf c2aa 开
c38a c2bc 始

This also shows that a character is encoded as two 16-bit words in the c2**-c3** range. With UTF-16 the first 16-bit word is always 0 for these characters. UTF-8 only uses 24 bits per character for these and Codepage 936 only uses 16 bits per character here. Which method can I use to determine the correct encoding conversion?

utf-8 representation:

e4bb 8e 从
e5be 88 很
e4b9 85 久
e4bb a5 以
e589 8d 前
e5bc 80 开
e5a7 8b 始

cp936 representation:

b4d3 从
badc 很
bec3 久
d2d4 以
c7b0 前
bfaa 开
cabc 始


Posted 2015-03-28T03:19:42.800

Reputation: 167



The corrupted text ´ÓºÜ¾ÃÒÔÇ°¿ªÊ¼ is 14 characters long. Since the correct Simplified Chinese text 从很久以前开始 is 7 characters long, that immediately suggests that each Simplified Chinese character might correspond to two characters in the corrupted text.

The characters in the corrupted text have the following hex equivalents in UTF-16 (and also with cp936 as shown in the OP):

´ => b4
Ó => d3
º => ba
Ü => dc
¾ => be
à => c3
Ò => d2
Ô => d4
Ç => c7
° => b0
¿ => bf
ª => aa
Ê => ca
¼ => bc

I did that translation using a trivial Java program, but there are on-line sites that can do the same thing:


So all the Mandarin Tool needs to do is combine the hex values of the first two corrupted characters to get the first Simplified Chinese character using CP 936, and so on:

´ + Ó => b4 + d3 => b4d3 => 从
º + Ü => ba + dc => badc => 很
¾ + Ã => be + c3 => bec3 => 久
Ò + Ô => d2 + d4 => d2d4 => 以
Ç + ° => c7 + b0 => c7b0 => 前
¿ + ª => bf + aa => bfaa => 开
Ê + ¼ => ca + bc => cabc => 始 

Presumably the Mandarin Tool verifies that the transformation of the corrupted text really does result in valid Simplified Chinese text.

Each Simplified Chinese cp936 value can then be mapped to its Unicode code point. For example, = 0xB4D3 = code point 0x4ECE. And once you have the Unicode code point you can translate to any encoding you wish (cp936, GB 18030, UTF-16, etc).

One point I am unclear on in your question is the first listing, showing the 32-bit representations of each Simplified Chinese character (e.g. c2b4 c393 从). That doesn't look right, since the code point for a character (e.g. 0x4ECE for ) and its 32-bit representation are the same thing. Or am I misunderstanding something?


Posted 2015-03-28T03:19:42.800

Reputation: 138

Thank you for answering this old question, this gave me valuable insights to fix any corrupt encodings in the future! On the wikipedia article for UTF-8 it's explained how code points get transformed to UTF-8 bytes. With this corrupted text the cp936 code b4d3 of 从 is interpreted as two 11-bit codepoints 00b4 and 00d3 and the 11-bit codepoint 00b4 (in binary 00010110100) has the UTF-8 bytes c2b4 of my first listing. – rubystallion – 2018-12-31T11:24:52.033

@rubystallion OK, got it. I had completely misinterpreted "fixed length 32-bit" as "UTF-32", but that wasn't what you were saying at all - my mistake. – skomisa – 2018-12-31T21:06:35.617