3
1
I have a text in Simplified Chinese, which, when read as UTF-8 begins with ´ÓºÜ¾ÃÒÔÇ°¿ªÊ¼
, which the online tool from MandarinTools (first search result for Repair Corrupted Chinese Email) fixes to the correct 从很久以前开始
, but it's not clear how it fixed that. From using the online tool and a hex editor I know that each character is encoded as fixed length 32-bit:
c2b4 c393 从
c2ba c39c 很
c2be c383 久
c392 c394 以
c387 c2b0 前
c2bf c2aa 开
c38a c2bc 始
This also shows that a character is encoded as two 16-bit words in the c2**-c3** range. With UTF-16 the first 16-bit word is always 0 for these characters. UTF-8 only uses 24 bits per character for these and Codepage 936 only uses 16 bits per character here. Which method can I use to determine the correct encoding conversion?
utf-8 representation:
e4bb 8e 从
e5be 88 很
e4b9 85 久
e4bb a5 以
e589 8d 前
e5bc 80 开
e5a7 8b 始
cp936 representation:
b4d3 从
badc 很
bec3 久
d2d4 以
c7b0 前
bfaa 开
cabc 始
Thank you for answering this old question, this gave me valuable insights to fix any corrupt encodings in the future! On the wikipedia article for UTF-8 it's explained how code points get transformed to UTF-8 bytes. With this corrupted text the cp936 code b4d3 of 从 is interpreted as two 11-bit codepoints 00b4 and 00d3 and the 11-bit codepoint 00b4 (in binary 00010110100) has the UTF-8 bytes c2b4 of my first listing. – rubystallion – 2018-12-31T11:24:52.033
@rubystallion OK, got it. I had completely misinterpreted "fixed length 32-bit" as "UTF-32", but that wasn't what you were saying at all - my mistake. – skomisa – 2018-12-31T21:06:35.617