UTF-8 bit representation

4

1

I'm learning about UTF-8 standards and this is what I'm learning :

Definition and bytes used
UTF-8 binary representation         Meaning
0xxxxxxx                            1 byte for 1 to 7 bits chars
110xxxxx 10xxxxxx                   2 bytes for 8 to 11 bits chars
1110xxxx 10xxxxxx 10xxxxxx          3 bytes for 12 to 16 bits chars
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4 bytes for 17 to 21 bits chars

And I'm wondering, why 2 bytes UTF-8 code is not 10xxxxxx instead, thus gaining 1 bit all the way up to 22 bits with a 4 bytes UTF-8 code? The way it is right now, 64 possible values are lost (from 1000000 to 10111111). I'm not trying to argue the standards, but I'm wondering why this is so?

** EDIT **

Even, why isn't it

UTF-8 binary representation         Meaning
0xxxxxxx                            1 byte for 1 to 7 bits chars
110xxxxx xxxxxxxx                   2 bytes for 8 to 13 bits chars
1110xxxx xxxxxxxx xxxxxxxx          3 bytes for 14 to 20 bits chars
11110xxx xxxxxxxx xxxxxxxx xxxxxxxx 4 bytes for 21 to 27 bits chars

...?

Thanks!

Yanick Rochon

Posted 2011-01-13T02:36:33.377

Reputation: 932

If you drop 10xxxxxx you can use 10xxxxxx xxxxxxxx 2 bytes for 8 - 14 bits? – ony – 2019-05-04T14:08:20.450

Answers

8

UTF-8 is self-synchronising. Something examining the bytes can tell if it's at the start of a UTF-8 character, or part-way through one.

Let's say you have two characters in your scheme: 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

If the parser picks up at the second octet, it can't tell that it's not to read the second and third octets as one character. With UTF-8, the parser can tell that it's in the middle of a character and continue ahead to the start of the next one, while emitting some state to mention the corrupted symbol.

For the edit: if the top bit is clear, UTF-8 parsers know that they're looking at a character represented in one octet. If it is set, it's a multi-octet character.

It's all about error recovery and easy classification of octets.

Phil P

Posted 2011-01-13T02:36:33.377

Reputation: 1 773

This synchronization allows also traversing chars in UTF-8 strings backwards. – ony – 2019-05-04T14:09:10.043