4
1
I'm learning about UTF-8 standards and this is what I'm learning :
Definition and bytes used
UTF-8 binary representation Meaning
0xxxxxxx 1 byte for 1 to 7 bits chars
110xxxxx 10xxxxxx 2 bytes for 8 to 11 bits chars
1110xxxx 10xxxxxx 10xxxxxx 3 bytes for 12 to 16 bits chars
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4 bytes for 17 to 21 bits chars
And I'm wondering, why 2 bytes UTF-8 code is not 10xxxxxx
instead, thus gaining 1 bit all the way up to 22 bits with a 4 bytes UTF-8 code? The way it is right now, 64 possible values are lost (from 1000000
to 10111111
). I'm not trying to argue the standards, but I'm wondering why this is so?
** EDIT **
Even, why isn't it
UTF-8 binary representation Meaning
0xxxxxxx 1 byte for 1 to 7 bits chars
110xxxxx xxxxxxxx 2 bytes for 8 to 13 bits chars
1110xxxx xxxxxxxx xxxxxxxx 3 bytes for 14 to 20 bits chars
11110xxx xxxxxxxx xxxxxxxx xxxxxxxx 4 bytes for 21 to 27 bits chars
...?
Thanks!
If you drop
10xxxxxx
you can use10xxxxxx xxxxxxxx
2 bytes for 8 - 14 bits? – ony – 2019-05-04T14:08:20.450