Chopping Unicode Bytes

0

Introduction

Each Unicode codepoint can be represented as a sequence of up to 4 bytes. Because of this, it is possible to interpret some 2, 3, or 4-byte characters as multiple 1-byte characters. (See here for a UTF-8 to bytes converter).

Challenge

Given a UTF-8 character, output it split into a sequence of 1-byte characters. If the character is 1-byte already, return it unchanged.

  • Your program must take one, and exactly one, UTF-8 character as input. You may use any input method you wish, as long as it has been decided on meta that is is a valid method. You cannot take input as a bytearray or series of bytes; then the challenge would just be converting hex to ASCII.
  • Your program must output one or more ASCII 1-byte characters. Again, any output method is allowed as long as it has been marked valid on meta. Edit: As per the conversation in the comments, output should be in Code Page 850.

Note: see this post for valid I/O methods.

Example I/O

܀ (0x700)
܀ (0xdc 0x80)

a (0x61)
a (0x61)

聂 (0x8042)
Þüé (0xe8 0x81 0x82)

Rules

This is , so shortest answer in bytes wins!

sugarfi

Posted 2019-11-25T00:04:51.553

Reputation: 1 239

Question was closed 2019-11-25T03:10:44.383

6It looks like the hex values in the output examples map to the hex byte values of the split input, but I don't know where the actual output characters are coming from. I checked extended ASCII, UTF-8, UTF-16, Unicode... – Malivil – 2019-11-25T01:23:35.207

3Can we input as a byte array (obviously allowed input method for a string) and output as a byte array (obviously allowed output method for a string)? :) – my pronoun is monicareinstate – 2019-11-25T01:57:21.530

Could you explain the splitting? How does the two byte number 0x8042 split into the three bytes 0xe8, 0x81, 0x82 for example? – Jonathan Allan – 2019-11-25T02:04:50.107

@JonathanAllan I believe 0x8042 is the codepoint of the character, which translates to those bytes in UTF-8 – Jo King – 2019-11-25T02:07:40.120

@JoKing "translates" how? Should be in the question IMO... – Jonathan Allan – 2019-11-25T02:08:30.473

5@JoKing Hmmm - "Your program must take one, and exactly one, UTF-8 character as input" - if we can take that as bytes then this is just "convert to chars", and if we can output those chars as bytes it's a no-op :/ – Jonathan Allan – 2019-11-25T02:22:57.053

I believe the characters are taken from this image from asciitable.com (warning, there's like 20 ads on the page), though I have no idea what encoding it is

– Jo King – 2019-11-25T02:30:27.057

2

@JoKing Do you mean Code page 437?

– tsh – 2019-11-25T09:00:11.907

There are a lot of things about this question that need clarification. There's no such thing as a "UTF-8 codepoint"; maybe you meant a Unicode codepoint? Moreover, in UTF-8, it's not possible to interpret a 2-, 3-, or 4-byte character as multiple 1-byte characters. Moreover, given a Unicode character, it's not possible to "split [it] into a sequence of 1-byte characters," so you should clarify what you mean by that. Moreover, you say that the output must be "ASCII 1-byte characters" even though the vast majority of Unicode characters are impossible to represent as ASCII characters. – Tanner Swett – 2019-11-25T15:27:20.357

Moreover, you say that the input is "a UTF-8 character," but in the examples you list, the input characters are not in UTF-8. Moreover, since you wrote "1-byte characters" and not "bytes," and since the example outputs are given in the form of character, it sounds like you're asking us to apply a character encoding to the output, but it's not clear which one we're supposed to use—you say ASCII, but that encoding isn't possible to use, and the examples don't use it. – Tanner Swett – 2019-11-25T15:32:22.097

3I'm guessing that you're asking us to take a Unicode codepoint as a number and output its UTF-8 representation as a sequence of bytes, but that's only a guess. Is that right? – Tanner Swett – 2019-11-25T15:32:55.293

1

@tsh The third example, featuring Þ at 0xE8, suggests Code page 850, but nice pick indeed!

– Nacre – 2019-11-25T21:13:00.127

@TannerSwett - yes, you must take a unicode codepoint and output it as a sequence of bytes. – sugarfi – 2019-11-25T21:34:48.247

1

@sugarfi With the tool at http://www.ltg.ed.ac.uk/~richard/utf-8.html, your first example (܀) turns into Ü <80>, as the tool relies on ISO-8859-1 rather than code page 850. ISO-8859-1 features no printable char at 0x80. Do you code under DOS ?

– Nacre – 2019-11-25T22:02:11.813

@Nacre - funny, I looked up "ASCII 128" and the result was the character above. – sugarfi – 2019-11-26T00:52:02.567

2

Funny, I look up "ASCII 128" and the first result tells me "The ASCII table has 128 characters, with values from 0 through 127". You are going to have to specify which one byte code page you are talking about

– Jo King – 2019-11-26T10:38:55.630

As far as I understand that's the only thing holding this challenge up. Which code page should the output be in? Based on your example, @Nacre seems to be correct with code page 850. So either say "It's code page 850" or update the question and example to be something else (like UTF-8) – Malivil – 2019-11-26T13:10:41.820

@Malivil - will do. – sugarfi – 2019-11-26T20:36:26.673

I have updated my answer to conform to the confirmed code page – Malivil – 2019-11-26T20:51:36.580

The UTF-8 to bytes converter that you link in your question doesnt use CP850, I think it would be better to say that you can output the codepoint e.g. 0x80 if CP850 is not supported by your language. – frank – 2019-11-26T21:22:40.473

If the input is the code point, and the output is a series of bytes, then what is the point of specifying the output encoding? – Jo King – 2019-12-02T05:25:00.517

I'm not sure, people on this stream asked me to so i did – sugarfi – 2019-12-02T21:25:20.050

Answers

2

C# (Visual C# Interactive Compiler), 73 85 bytes

a=>Encoding.UTF8.GetBytes(a).Select(b=>Encoding.GetEncoding(850).GetString(new[]{b}))

Try it online!

+12 bytes to use the updated code page

Malivil

Posted 2019-11-25T00:04:51.553

Reputation: 345