Length of a UTF-8 byte sequence

15

1

Determine the length of a UTF-8 byte sequence given its first byte. The following table shows which ranges map to each possible length:

  Range    Length
---------  ------
0x00-0x7F    1
0xC2-0xDF    2
0xE0-0xEF    3
0xF0-0xF4    4

Notes on gaps in the table: 0x80-0xBF are continuation bytes, 0xC0-0xC1 would start an overlong, invalid sequence, 0xF5-0xFF would result in a codepoint beyond the Unicode maximum.

Write a program or function that takes the first byte of a UTF-8 byte sequence as input and outputs or returns the length of the sequence. I/O is flexible. For example, the input can be a number, an 8-bit character or a one-character string. You can assume that the first byte is part of a valid sequence and falls into one of the ranges above.

This is code golf. The shortest answer in bytes wins.

Test cases

0x00 => 1
0x41 => 1
0x7F => 1
0xC2 => 2
0xDF => 2
0xE0 => 3
0xEF => 3
0xF0 => 4
0xF4 => 4

nwellnhof

Posted 2018-10-06T12:35:14.103

Reputation: 10 037

Is an input of a list of the 8 bits acceptable? – Jonathan Allan – 2018-10-07T17:25:47.010

@JonathanAllan No, that would be taking flexible I/O too far. – nwellnhof – 2018-10-08T12:00:29.573

Answers

5

Forth, 6 bytes

x-size

see https://forth-standard.org/standard/xchar/X-SIZE

Input and output follows a standard Forth model:

Input

Memory address + length (i.e. 1) of a single-byte UTF-8 "string".

Output

UTF-8 sequence length in bytes.

Sample Code

Store 0xF0 in a memory cell, and invoke x-size:

variable v
0xF0 v !
v 1 x-size

Check the result:

.s <1> 4  ok

zeppelin

Posted 2018-10-06T12:35:14.103

Reputation: 7 884

Assuming this works in https://tio.run/#forth-gforth, could you show an example? I don't understand how you could have a single-byte UTF-8 string if the byte is 0xF0.

– Dennis – 2018-10-06T15:40:49.210

ockquote>

could you show an example? I don't understand how you could have a single-byte UTF-8 string if the byte is 0xF0.

I've added some sample code demonstrating how to do it.

Unfortunately, the TIO version of gforth does not seem to support the Unicode words (according to "see x-size", it is just hard-coded to return 1 there). – zeppelin – 2018-10-06T15:57:38.567

I see. That's not what I'd call an UTF-8 string though, as F0 alone is an invalid byte sequence, as far as UTF-8 is concerned. – Dennis – 2018-10-06T16:09:02.183

ockquote>

as F0 alone is an invalid byte sequence

True (that's why I've put the word "string" in quotes), but this task is specifically about recognizing the sequence by it's first byte, and Forth does not really care for it being invalid, which makes this solution possible, in turn. – zeppelin – 2018-10-06T18:40:47.240

6

Z80Golf, 19 14 bytes

00000000: 2f6f 3e10 37ed 6a3d 30fb ee07 c03c       /o>.7.j=0....<

Try it online!

-5 bytes thanks to @Bubbler

Example with input 0x41-Try it online! Assembly

Example with input 0xC2-Try it online!

Example with input 0xE0-Try it online!

Example with input 0xF4-Try it online!

Assembly:

;input: register a
;output: register a
byte_count:			;calculate 7^(log2(255^a))||1
	cpl			;xor 255
	ld l,a
	log2:
		ld	a,16
		scf
	log2loop:
		adc	hl,hl
		dec	a
		jr	nc,log2loop
	xor 7
	ret nz
	inc a

Try it online!

Logern

Posted 2018-10-06T12:35:14.103

Reputation: 845

Use Bash TIO to work with assembly, with easier-to-see examples. The link also has 15-byte version of your solution. Here are the improvements: xor 0xff -> cpl, no need to or a, jr nz, return -> ret nz, ld a,1 -> inc a.

– Bubbler – 2018-10-11T07:37:48.893

5

C (gcc), 39 bytes

t(char x){x=(__builtin_clz(~x)-24)%7u;}

Try it online!

user202729

Posted 2018-10-06T12:35:14.103

Reputation: 14 620

Why char and not int? – R.. GitHub STOP HELPING ICE – 2018-10-07T02:05:49.730

@R.. Because they get sign extended. For example ~(char)0xF0 == ~(int)0xFFFFFFF0 (assume char = signed char, sizeof(int) == 4) – user202729 – 2018-10-07T02:10:17.537

Ah, assuming char is signed. – R.. GitHub STOP HELPING ICE – 2018-10-07T02:12:00.120

4

Jelly,  8  7 bytes

+⁹BIITḢ

A monadic link accepting the byte as an integer.

Try it online! Or see all inputs evaluated.

If an input of a list of the 8 bits were acceptable then the method is only 6 bytes: 1;IITḢ, however it has been deemed as talking flexible I/O too far.

How?

+⁹BIITḢ - Link: integer       e.g.: 127 (7f)            223 (df)            239 (ef)            244 (f4)
 ⁹      - literal 256
+       - add                       383                 479                 495                 500
  B     - to a list of bits         [1,0,1,1,1,1,1,1,1] [1,1,1,0,1,1,1,1,1] [1,1,1,1,0,1,1,1,1] [1,1,1,1,1,0,1,0,0]
   I    - increments                [-1,1,0,0,0,0,0,0]  [0,0,-1,1,0,0,0,0]  [0,0,0,-1,1,0,0,0]  [0,0,0,0,-1,1,-1,0]
    I   - increments                [2,-1,0,0,0,0,0]    [0,-1,2,-1,0,0,0]   [0,0,-1,2,-1,0,0]   [0,0,0,-1,2,-2,1]
     T  - truthy indices            [1,2]               [2,3,4]             [3,4,5]             [4,5,6,7]
      Ḣ - head                      1                   2                   3                   4

Jonathan Allan

Posted 2018-10-06T12:35:14.103

Reputation: 67 804

3

Haskell, 28 bytes

f x=sum[1|y<-"Áßï",x>y]+1

Try it online!

ბიმო

Posted 2018-10-06T12:35:14.103

Reputation: 15 345

3

Jelly, 8 7 bytes

»Ø⁷Ba\S

Try it online!

How it works

»Ø⁷Ba\S  Main link. Argument: n (integer)

 Ø⁷      Yield 128.
»        Take the maximum of n and 128.
   B     Yield the array of binary digits.
    a\   Cumulatively reduce by AND, replacing 1's after the first 0 with 0's.
      S  Take the sum.

Dennis

Posted 2018-10-06T12:35:14.103

Reputation: 196 637

3

Python 2, 28 bytes

lambda x:1.4**(x/16-11)//1+1

Try it online!

ovs

Posted 2018-10-06T12:35:14.103

Reputation: 21 408

2

JavaScript (Node.js), 24 bytes

x=>7^Math.log2(255^x)||1

Try it online!

user202729

Posted 2018-10-06T12:35:14.103

Reputation: 14 620

2

Ruby, 27 23 bytes

->x{2+x[7]+(x/16<=>14)}

Try it online!

G B

Posted 2018-10-06T12:35:14.103

Reputation: 11 099

1

Charcoal, 12 bytes

I⌕⍘⌈⟦N¹²⁸⟧²0

Try it online! Link is to verbose version of code. Explanation:

     N          Input number
      ¹²⁸       Literal 128
   ⌈⟦    ⟧      Take the maximum
  ⍘       ²     Convert to base 2 as a string
 ⌕         0    Find the position of the first `0`
I               Cast to string
                Implicitly print

Neil

Posted 2018-10-06T12:35:14.103

Reputation: 95 035

1

05AB1E, 8 7 bytes

žy‚àb0k

Port of @Neil's Charcoal answer.
-1 byte thanks to @Grimy.

Input as integer.

Try it online or verify all test cases.

Explanation:

žy       # Push 128
  ‚      # Pair it with the (implicit) input-integer
   à     # Take the maximum of this pair (128 and input)
    b    # Convert it to a binary-string
     0k  # Get the 0-based first index of a "0" in this binary-string
         # (and output it implicitly as result)

Kevin Cruijssen

Posted 2018-10-06T12:35:14.103

Reputation: 67 575

1s) to for 7. Porting the other Jelly answer gives another 8: ₁+b¥η€ËO – Grimmy – 2019-10-04T12:06:09.533

@Grimy No idea why I didn't had in the first place.. :S But thanks for -1. – Kevin Cruijssen – 2019-10-04T12:19:49.693

1

Jelly, 7 bytes

»Ø⁷Bi0’

Port of my 05AB1E answer.

Try it online or verify all test cases.

Explanation:

 Ø⁷        # Push 128
»          # Take the max of 128 and the input
   B       # Convert it to binary
    i0     # Get the 1-indexed first index of a 0
      ’    # Decrease it by 1 to make it 0-indexed (and output it implicitly)

Kevin Cruijssen

Posted 2018-10-06T12:35:14.103

Reputation: 67 575

1

Perl 6, 18 bytes

{7-msb(255-$_)||1}

Try it online!

Port of user202729's JavaScript answer. Alternatives with WhateverCode:

(255-*).msb*6%34%7
-(255-*).msb%6%5+1

nwellnhof

Posted 2018-10-06T12:35:14.103

Reputation: 10 037

1

x86 Assembly, 11 bytes

00000000 <f>:
   0:   f6 d1                   not    %cl
   2:   0f bd c1                bsr    %ecx,%eax
   5:   34 07                   xor    $0x7,%al
   7:   75 01                   jne    a <l1>
   9:   40                      inc    %eax
0000000a <l1>:
   a:   c3                      ret

Try it online!

Port of user202729's JavaScript answer. Uses fastcall conventions.

nwellnhof

Posted 2018-10-06T12:35:14.103

Reputation: 10 037

1

Labyrinth, 35 bytes

? 28& 16/ )!@!
:_1 ";_ _3&""2
   @1

Try it online!

Unwrapped version of the code:

?:_128&1!@
      ;
      _16/_3&2!@
            )
            !
            @

Herman L

Posted 2018-10-06T12:35:14.103

Reputation: 3 611

0

C, 31 bytes

f(x){return(x-160>>20-x/16)+2;}

Try it online!

27 bytes with gcc (-O0)

f(x){x=(x-160>>20-x/16)+2;}

Alternatives, 31 and 33 bytes

f(x){return(10>>15-x/16)+7>>2;}
f(x){return x/128-(-3>>15-x/16);}

I found these expressions when playing around with the Aha! superoptimizer a few years ago.

nwellnhof

Posted 2018-10-06T12:35:14.103

Reputation: 10 037