Decode a UTF-8 string to character byte counts

3

Given a UTF-8 string, give a string that represents byte counts of each characters.

Rules:

  1. As long as the input string is encoded in UTF-8 (without BOM), its type doesn't matter. In C++, it can be char[] or std::string. In Haskell, it can be [Int8] or [CChar].

  2. The output must be (in usual case) a string of ASCII digit characters. Its encoding doesn't matter.

  3. When given an invalid UTF-8 sequence (including representation of surrogates), every bytes of the sequence are to be represented as the replacement character �.

  4. Reserved characters and noncharacters are still considered as valid UTF-8 sequences, and thus must have corresponding output digit.

  5. As this is a code-golf, the shortest code in bytes wins.

Example:

When given the following byte sequence:

00 80 C0 80 EB 80 80

The output must be:

1���3

Dannyu NDos

Posted 2019-09-29T05:30:55.623

Reputation: 2 087

Question was closed 2019-09-29T07:26:37.010

Recommended test case: b'Hej d\xc3\xc5!' (i.e. 48 65 6A 20 64 C3 C5 21) (invalid continuation byte (C5) after a valid start byte (C3)) (expected output: 11111��1) – pizzapants184 – 2019-09-29T06:42:07.677

1Closely related – Adám – 2019-09-29T07:25:46.947

Answers

2

Python 3, 135 bytes

def f(b):
	B=b.decode(errors="ignore")[:1].encode();l=len(B)
	if B and b[:l]!=B:return"�"+f(b[1:])
	return str(l)+f(b[l:])if b else''

Try it online!

Explanation:

def f(b):
	B=b.decode(errors="ignore")[:1].encode();l=len(B)
	 # B is the UTF-8-encoded version of the first Unicode character in the given bytes
	 # (Or the Nth, if the first N-1 bytes were invalid Unicode bytes)
	if B and b[:l]!=B:return"�"+f(b[1:])
	 # If the two encoded byte-strings don't match,
	 # then the first byte in b is invalid,
	 # so append "�" to the returned string, and recurse
	return str(l)+f(b[l:])if b else''
	 # If the byte-string is not empty,
	 #  append the length of the first character and recurse.
	 # Else (the input is empty) return the empty string (recursion base case)

pizzapants184

Posted 2019-09-29T05:30:55.623

Reputation: 3 174