Decode a UTF-8 character

3

The challenge

Your task is to decode the first character in a UTF-8 byte sequence. The input is a byte array or a byte string holding a UTF-8 byte sequence.

If the (start of the) sequence is valid, the output is the Unicode code point of the first character in the sequence which is a number between 0 and 0x10FFFF. Your program or function can return a numeric data type or return or print a textual representation of the number in any base.

If the sequence is invalid, your program must signal an error. You can return a unique value that is outside the range of Unicode code points, throw an exception, or whatever.

The UTF-8 string must be decoded according to RFC 3629. Here's the UTF-8 syntax from the RFC:

UTF8-octets = *( UTF8-char )
UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1      = %x00-7F
UTF8-2      = %xC2-DF UTF8-tail
UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
              %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
              %xF4 %x80-8F 2( UTF8-tail )
UTF8-tail   = %x80-BF

Any byte sequence that doesn't match the syntax above must result in an error. Besides the more obvious cases of invalid input, this means:

  • No overlong encodings.
  • No code points above 0x10FFFF.
  • No surrogates.

Trailing bytes after the first character are allowed and don't have to be checked for validity. Unexpected end of input must obviously result in an error but you can assume that the byte sequence isn't empty.

Zero-terminated strings are allowed. If you choose to work with zero-terminated strings, U+0000 can't be detected, so it's OK to only handle code points U+0001 to U+10FFFF. Otherwise, U+0000 must be handled.

Test input

Here are some hex byte sequences that can be used as test input. The valid sequences may optionally be followed by any other bytes.

01           Valid, U+0001   START OF HEADING
41           Valid, U+0041   LATIN CAPITAL LETTER A
7F           Valid, U+007F   DELETE
C3 9F        Valid, U+00DF   LATIN SMALL LETTER SHARP S
E2 80 94     Valid, U+2014   EM DASH
F0 9F A4 98  Valid, U+1F918  SIGN OF THE HORNS
F4 8F BF BF  Valid, U+10FFFF Noncharacter but valid

85           Invalid, starts with continuation byte
C0 80        Invalid, overlong two-byte sequence
C3 C0        Invalid continuation byte
D4           Invalid, unexpected end of input
E0 9F BF     Invalid, overlong three-byte sequence
E3 82        Invalid, unexpected end of input
ED A2 93     Invalid, surrogate U+D893
F0 8A B2 A0  Invalid, overlong four-byte sequence
F1 B3 B8     Invalid, unexpected end of input
F2 80 B2 53  Invalid continuation byte
F4 93 81 B3  Invalid, code point above U+10FFFF
F5           Invalid start byte
FF           Invalid start byte

Rules

  • Functions are allowed.
  • Using any UTF-8 decoding features of your programming language is not allowed. You can't just call ord in Perl, for example.
  • This is code golf. The shortest answer wins. No loopholes.

nwellnhof

Posted 2017-02-14T00:27:50.547

Reputation: 10 037

Question was closed 2017-02-14T21:07:10.727

3You say that we must reject invalid UTF-8, but you also say that we only need to decode one character. What if the first code point is valid UTF-8, but 1000 code points later the stream is invalid? – orlp – 2017-02-14T04:23:45.343

1How does the prohibition on UTF-8 decoding features work in cases where a language naturally does Unicode decoding as part of the process of reading input? (Not necessarily planning to answer in such a language, but I know they exist.) – None – 2017-02-14T05:35:43.737

@orlp Bytes after the first character don't have to be checked for validity. I just tried to make that point clearer in the description. – nwellnhof – 2017-02-14T15:13:15.717

@ais523 Then you have to find some way to get the input as a sequence of numbers representing the UTF-8 bytes. – nwellnhof – 2017-02-14T15:16:20.147

1This whole thing seems to be written for someone who already understands the specifications; for me it's pretty unclear. "Any byte sequence that doesn't match the syntax above", for example does not explain what the "syntax above" means. Also, what are "overlong encodings" and what are "surrogates"? – Jonathan Allan – 2017-02-14T19:53:37.520

No answers