Bytes/Character

28

1

Task

Given a UTF-8 string (by any means) answer (by any means) an equivalent list where every element is the number of bytes used to encode the corresponding input character.

Examples

!1

Ciao1 1 1 1

tʃaʊ1 2 1 2

Adám1 1 2 1

ĉaŭ2 1 2 (single characters)

ĉaŭ1 2 1 1 2 (uses combining overlays)

チャオ3 3 3

(empty input) → (empty output)

!±≡1 2 3 4

� (a null byte) → 1

Null bytes

If the only way to keep reading input beyond null bytes is by knowing the total byte count, you may get the byte count by any means (even user input).

If your language cannot handle null bytes at all, you may assume the input does not contain nulls.

Adám

Posted 2016-06-23T13:43:23.890

Reputation: 37 779

1If the input is empty can we output 0 or another falsey value? – Alex A. – 2016-06-23T16:30:42.657

@AlexA. No, that would prevent stringing together multiple results, and I already gave the spec for empty input. – Adám – 2016-06-23T16:49:37.483

That's fine but I don't get what you mean regarding stringing together results. – Alex A. – 2016-06-23T16:51:02.957

1@AlexA. Let's say we are receiving and counting multiple inputs, and each input gets run through the byte counter. The byte counts are appended to a result file. A non-empty answer to empty input would cause input and result file to get out of sync length-wise. – Adám – 2016-06-23T16:55:08.760

2Can I print the byte counts without separation? The highest possible value is 6, so it's unambiguous. – Dennis – 2016-06-23T18:28:27.523

1@Dennis Yes, that's fine. – Adám – 2016-06-23T18:29:43.160

You know what's amazing? Copying the two ĉaŭ test cases out of this question works and preserves the combining characters on the second one, even though they produce identical glyphs. – cat – 2016-06-23T19:19:43.017

@Adám I wish that had been added to the question in the first place, that will quite shorten some implementations – cat – 2016-06-23T19:22:53.317

@cat What had been added? – Adám – 2016-06-23T19:29:56.853

3Do we have to support null bytes? Those can be a real pain in some languages... – Dennis – 2016-06-23T20:10:03.040

@Dennis Yes, but feel free to include the shorter version that doesn't. – Adám – 2016-06-23T21:06:02.393

3You should add that to the post. I don't know most of the languages well enough to tell if it makes a difference, but I think it invalidates at least two of the answers. – Dennis – 2016-06-23T21:31:15.527

@Dennis I tried, but feel free to edit if you can make it better. – Adám – 2016-06-23T21:56:59.827

My language doesn't see a difference between a NUL byte and the end of a string. Can I request that the length of the string be given as a parameter? – cat – 2016-06-24T11:32:13.773

@cat That won't help you know where the null bytes are. See edit. – Adám – 2016-06-24T14:49:31.150

2@Adám yes it will. In C, for example, C strings end with a NUL byte, so you stop reading as soon as you find one. If you know the length of the string, you stop reading after that many bytes, NUL and all. – cat – 2016-06-24T14:56:22.553

1@cat Ah, ok, I'll add that you can get the byte count if so. – Adám – 2016-06-24T14:57:45.110

How strict are you on the output? Can the byte values be separated by newlines or do they have to be spaces? – JAL – 2016-06-27T22:13:13.387

1@JAL OP: by any means. Dennis: Can I print the byte counts without separation? The highest possible value is 6, so it's unambiguous. Adám: Yes, that's fine. – Adám – 2016-06-28T05:32:02.757

Answers

10

Pyth, 9 7 bytes

Thanks to @Maltysen for saving 2 bytes!

mlc.Bd8

Test suite

Converts every character of the input to it's binary representation and then splits this into chunks of length 8. The number of those chunks is then the amount of bytes needed to encode that character.

Denker

Posted 2016-06-23T13:43:23.890

Reputation: 6 639

@Maltysen That's clever, thanks! – Denker – 2016-06-23T18:08:49.600

1Same length answer that relies on a similar trick: mlhc8.B – FryAmTheEggman – 2016-06-23T19:07:06.363

@LeakyNun then it would be simple thing to give a test case that fails, wouldn't it? – Lause – 2016-06-24T05:18:06.617

To save another byte, instead of splitting into chunks of 8, take every 8th: ml%8.B (now the d is implicit). – Anders Kaseorg – 2016-07-19T06:22:41.253

21

Python 3, 42 36 bytes

lambda x:[len(i.encode())for i in x]

atlasologist

Posted 2016-06-23T13:43:23.890

Reputation: 2 945

13-1 byte: use map. lambda x:map(len,map(str.encode,x)) – NoOneIsHere – 2016-06-23T21:04:29.393

11

APL, 15 chars

≢¨'UTF-8'∘⎕ucs¨

In English: convert each character to UTF-8 (meaning: vector of bytes representation) and get its tally.

lstefano

Posted 2016-06-23T13:43:23.890

Reputation: 850

Save a byte: ≢¨'UTF-8'∘⎕ucs¨ – Adám – 2016-06-23T14:30:40.170

Indeed @Adám... Cheers. – lstefano – 2016-06-23T15:00:34.557

An interesting (but longer) array based approach: +⌿0 7 11 16∘.≤2⍟⎕UCS – Adám – 2017-01-05T11:44:58.433

Version 16.0: 0 7 11 16⍸2⍟⎕UCS – Adám – 2017-01-05T20:54:10.300

11

C, 68 65 bytes

b;main(c){for(;~c;b=c/64^2?b?putchar(b+48)/48:1:b+1)c=getchar();}

Thanks to @FryAmTheEggman for golfing off 3 bytes!

Test it on Ideone.

Dennis

Posted 2016-06-23T13:43:23.890

Reputation: 196 637

7

GolfScript, 16 bytes

{64/2=}%1,/{,)}*

Try it online!

Background

GolfScript doesn't have a clue what Unicode is; all strings (input, output, internal) are composed of bytes. While that can be pretty annoying, it's perfect for this challenge.

UTF-8 encodes ASCII and non-ASCII characters differently:

  • All code points below 128 are encoded as 0xxxxxxx.

  • All other code points are encoded as 11xxxxxx 10xxxxxx ... 10xxxxxx.

This means that the encoding of each Unicode character contains either a single 0xxxxxxx byte or a single 11xxxxxx byte and 1 to 5 10xxxxxx bytes.

By dividing all bytes of the input by 64, we turn 0xxxxxxx into 0 or 1, 11xxxxxx into 3, and 10xxxxxx into 2.

If we compare the quotient with 2 – pushing 1 for 2; and 0 for 0, 1, and 3 – each character will be turned into a 0, followed by 1 to 5 1's.

All that's left is to split the resulting string at occurrences of 0, count the number of 1's between those zeroes and add one to the amount.

How it works

{     }%          Map the following over all bytes in the input.
 64/                Divide the byte by 64.
    2=              Compare the quotient with 2, pushing 1 or 0.
        1,        Push range(1), i.e., [0].
          /       Split the array of Booleans around zeroes.
           {  }*  Fold; for each run of ones but the first:
            ,       Push its length.
             )      Increment.

Dennis

Posted 2016-06-23T13:43:23.890

Reputation: 196 637

6

JavaScript (ES6), 54 45 43 bytes

s=>[...s].map(c=>encodeURI(c).length/3-8&7)

Edit: Saved 2 bytes with help from @l4m2.

Neil

Posted 2016-06-23T13:43:23.890

Reputation: 95 035

s=>[...s].map(c=>encodeURI(c).length/3-4&3) – l4m2 – 2018-04-07T01:21:08.203

@l4m2 That fails for non-BMP characters but I was able to fix it up. – Neil – 2018-04-07T11:10:40.127

6

PowerShell v4, 58 bytes

[char[]]$args[0]|%{[Text.Encoding]::UTF8.GetByteCount($_)}

NB

OK, this should work, and does in almost all of the test cases except for which is somehow counted as 3,3 on my machine. That character even shows as 7 bytes on my computer. I suspect this is due to some sort of bug in the Windows or .NET version that I'm running locally, as @Mego doesn't have that issue. (Edit: @cat points out this is due to BOM. Thanks for solving that mystery, @cat!)

However, that still doesn't account for all of the problem. I think I know where some of the problems are coming from, though. Inside .NET, all strings are composed of UTF-16 code units (which is the System.Char type). With the very loose typecasting that PowerShell uses, there's a lot of implicit casting and conversion between types in the background. Likely this is a contributing factor to the behavior we're seeing -- for example, [system.text.encoding]::utf8.getchars([System.Text.UTF8Encoding]::UTF8.GetBytes('')) returns two unprintables, rather than a single character.


Explanation

Very straightforward code. Takes the input $args[0] and explicitly casts it as a char-array so we can loop through each component of the string |%{...}. Each iteration, we use the .NET call [System.Text.Encoding]::UTF8.GetByteCount() (the System. is implied) to get the byte count of the current character $_. That's placed on the pipeline for later output. Since it's a collection of [int]s that are returned, casting to an array is implicit.

Test Runs

PS C:\Tools\Scripts\golfing> .\bytes-per-character.ps1 'tʃaʊ'
1
2
1
2

PS C:\Tools\Scripts\golfing> .\bytes-per-character.ps1 'Adám'
1
1
2
1

PS C:\Tools\Scripts\golfing> .\bytes-per-character.ps1 'ĉaŭ'
2
1
2

PS C:\Tools\Scripts\golfing> .\bytes-per-character.ps1 'ĉaŭ'
1
2
1
1
2

PS C:\Tools\Scripts\golfing> .\bytes-per-character.ps1 'チャオ'
3
3
3

PS C:\Tools\Scripts\golfing> .\bytes-per-character.ps1 '!±≡'
1
2
3
3
3

Edited to add This does properly account for the null-bytes requirement that was added to the challenge after I originally posted, provided you pull the data from a text file and pipe it as follows:

PS C:\Tools\Scripts\golfing> gc .\z.txt -Encoding UTF8|%{.\bytes-per-character.ps1 $_}
2
1
1
1

z.txt

AdmBorkBork

Posted 2016-06-23T13:43:23.890

Reputation: 41 581

That character even shows as 7 bytes on my computer. Yes, that's because of Byte-Order Mark which is what you get on Windows with UTF-8. Tell Notepad++ to use UTF-8 without BOM (as you should *always avoid the BOM*, especially for compatiblity with Unicies) and you will find the file has a size of 4 bytes, because the BOM is 3 and 4 + 3 = 7 – cat – 2016-06-23T19:04:06.063

@cat Ah, yes, that makes sense. OK, so that accounts for the difference in file sizes. However, that still doesn't account for the differing behavior inside the shell itself. For example, saving it as UTF-8 without BOM, and running get-content -Encoding UTF8 .\z.txt|%{.\bytes-per-character.ps1 $_} still returns 3,3. – AdmBorkBork – 2016-06-23T19:13:10.070

1The -Encoding parameter does not appear to be supported. – Mego – 2016-06-23T19:16:50.807

But apparently it still works fine anyway

– AdmBorkBork – 2016-06-23T19:28:45.417

5

Java 10, 100 96 95 67 61 bytes

a->{for(var c:a)System.out.print(c.getBytes("utf8").length);}

-4 bytes removing spaces because this is allowed in the comments
-1 byte changing UTF-8 to utf8
-28 bytes going from Java 7 to 8 (a->{...} instead of void c(char[]i)throws Exception{...})
-3 bytes taking the input as String-array instead of character-array, and
-3 bytes going from Java 8 to 10 (var instead of String)

Explanation:

Try it online.

a->{                      // Method with String-array parameter and no return-type
  for(var c:a)            //  Loop over the input-array
    System.out.print(     //   Print:
      c.getBytes("utf8")  //    The bytes as array in UTF-8 of the current item,
       .length);}         //    and print the amount of bytes in this array

Kevin Cruijssen

Posted 2016-06-23T13:43:23.890

Reputation: 67 575

Does it work for null bytes? – cat – 2016-06-24T01:39:59.067

@cat The test case for null-bytes was later added. But yes, it does also work for null-bytes and I've added the test case. – Kevin Cruijssen – 2016-06-24T06:48:43.920

5

Perl 6,  77 69  63 bytes

put +$0 if $_».base(2).fmt("%8d")~~/^(1)**2..*|^(" ")/ while $_=$*IN.read: 1
put +$0 if $_».fmt("%8b")~~/^(1)**2..*|^(" ")/ while $_=$*IN.read: 1

put 1+$0 if $_».fmt("%8b")~~/^1(1)+|^" "/while $_=$*IN.read: 1
put 1+$0 if $_».fmt("%0.8b")~~/^1(1)+|^0/while $_=$*IN.read: 1

Since Perl 6 uses NFG strings I have to pull in the bytes directly, which sidesteps the feature.
(NFG is like NFC except it also creates synthetic composed codepoints)

The output is separated by newlines.

Test:

for text in '!' 'Ciao' 'tʃaʊ' 'Adám' 'ĉaŭ' 'ĉaŭ' 'チャオ' '' '!±≡' '\0';
do
  echo -en $text |
  perl6 -e 'put 1+$0 if $_».fmt("%8b")~~/^1(1)+|^" "/while $_=$*IN.read: 1' |

  # combine all of the lines into a single one for display purposes
  env text=$text perl6 -e 'put qq["%*ENV<text>"], "\t\t", lines.gist'
done
"!"     (1)
"tʃaʊ"      (1 2 1 2)
"Adám"      (1 1 2 1)
"ĉaŭ"       (2 1 2)
"ĉaŭ"     (1 2 1 1 2)
"チャオ"       (3 3 3)
""      ()
"!±≡"     (1 2 3 4)
"\0"        (4 1 4)

Explanation:

# turns the list in 「$0」 into a count, and adds one
# 「put」 prints that with a trailing newline
put 1+$0 

   # if the following is true
   if

       # format the input byte to base 2 and pad it out to 8 characters
       $_».fmt("%8b")

       ~~ # smart match against

       # check to see if it starts with more than one 1s, or a space
       # ( also sets 「$0」 to a list that is 1 shorter
       # than the number of bytes in this codepoint )
       / ^1 (1)+ | ^" " /

           # for every byte in STDIN
           while
               $_ = $*IN.read: 1

This works because the first byte in a multi-byte codepoint has the number of bytes encoded inside of it, and the other bytes in the codepoint have the highest bit set, but not the next highest. While the single byte codepoints don't have the highest bit set.

Brad Gilbert b2gills

Posted 2016-06-23T13:43:23.890

Reputation: 12 713

Can't do read:1 and/or /while$ instead? And if that works, if$? – Erik the Outgolfer – 2016-06-23T17:41:38.493

@EʀɪᴋᴛʜᴇGᴏʟғᴇʀ No because that would be parsed as something different. I can remove the space before while though. – Brad Gilbert b2gills – 2016-06-24T02:00:33.487

Can you explain the NFG countermeasures? – JDługosz – 2016-06-24T09:33:16.530

If I echo a NUL byte to this program's STDIN, it prints \n1\n1\n, is that intentional? Basically, does this handle NUL bytes? – cat – 2016-06-24T11:24:46.747

@cat Why wouldn't it? When I do this: perl -e 'print "\0"' | perl6 -e '...' I get 4 1 4 just like I would expect. ( The part about nuls was added after I posted though ) – Brad Gilbert b2gills – 2016-06-24T13:46:33.637

Ok, I just wasn't sure if I was doing something wrong, cheers – cat – 2016-06-24T13:48:51.287

@JDługosz The only countermeasure is to not use strings (Str class), and use buffers (Buf class) instead. (read returns a Buf[uint8]) – Brad Gilbert b2gills – 2016-06-24T13:50:32.580

I thought it was something about the difference between the two unstriked variations. You mean calling read rather than just letting $_ just appear on its own? – JDługosz – 2016-06-25T03:51:58.650

@JDługosz I mean calling $*IN.read instead of get $*IN.get $*IN.comb lines $*IN.lines slurp $*IN.slurp-rest $*IN.readchars or something else which returns a Str rather than a Buf[uint8] – Brad Gilbert b2gills – 2016-06-26T13:21:58.977

5

Ruby, 33 bytes

Barely edges out Python, yay! Try it online.

->s{s.chars.map{|c|c.bytes.size}}

Value Ink

Posted 2016-06-23T13:43:23.890

Reputation: 10 608

5

Python 3, 82 bytes

import math
lambda x:[ord(i)<128and 1or int((math.log2(ord(i))-1)//5+1)for i in x]

This is much longer than the other Python answer, and the majority of the other answers, but uses an approach involving logarithms that I haven't yet seen.

An anonymous function that takes input, via argument, as a string and returns a list.

Try it on Ideone

How it works

This method relies on the way in which UTF-8 encodes the code-point of a character. If the code-point is less than 128, the character is encoded as in ASCII:

0xxxxxxx

where x represents the bits of the code point. However, for code-points greater than or equal to 128, the first byte is padded with the same number of 1 s as the total number of bytes, and subsequent bytes begin 10. The bits of the code-point are then entered to give the shortest possible multibyte sequence, and any remaining bits become 0.

No. of bytes  Format
1             0xxxxxxx
2             110xxxxx 10xxxxxx
3             1110xxxx 10xxxxxx 10xxxxxx
4             11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
...           ...

and so forth.

It can now be noticed that for each number of bytes n, the upper limit for the number of code-point bits is given by (-n+7)+6(n-1) = 5n+1. Hence, the upper limit code-point c for each n is given, in decimal, by c= 2^(5n+1). Rearranging this gives n = (log2(c)-1)/5. So for any code-point, the number of bytes can be found by evaluating the above expression, and then taking the ceiling.

However, this does not work for code points in the range 64 <= c <= 127, since the lack of a padding 1 due to the ASCII-like encoding for 1 byte characters means that the wrong upper limit is predicted, and log2 is undefined for c = 0, which happens if a null byte is present in the input. Therefore, if c <= 127, a value of 1 is returned for n.

This is exactly what the code is doing; for each character i in the string x, the code-point is found using the ord function, and the ceiling of the expression is found by using integer rather than float division by 5 and then adding 1. Since Python's float type always represents integers as x.0, even after integer division, the result is passed to the int function to remove the trailing zero. If ord(i) <= 127, logical short-circuiting means that 1 is instead returned. The number of bytes for each character is stored as an element in a list, and this list is returned.

TheBikingViking

Posted 2016-06-23T13:43:23.890

Reputation: 3 674

3

Bash, 74 bytes

Golfed

xxd -p|fold -2|cut -c1|tr -d '89ab'|echo `tr -t '01234567cbef' '[1*]2234'`

Algorithm

hexdump input string, fold 2 chars per line, cut the first char only

echo -ne '!±≡' | xxd -p|fold -2|cut -c1

2
c
b
e
8
a
f
a
b
b

(4 high order bits of an each input byte as a hex char, one per line)

Remove "continuation bytes" 0x80..0xBF

tr -d '89ab'

2
c

e


f

(what is left, is 4 bits of the first byte of an each unicode char)

map the first bits into the char length, collapse the output and print

echo `tr -t '01234567cbef' '[1*]2234'`

1 2 3 4

Test

 U() { xxd -p|fold -2|cut -c1|tr -d '89ab'|echo `tr -t '01234567cbef' '[1*]2234'`;}

 echo -ne '!' | U 
 1

 echo -ne 'Ciao' | U
 1 1 1 1

 echo -ne 'tʃaʊ' | U
 1 2 1 2

 echo -ne 'Adám' | U
 1 1 2 1

 echo -ne 'ĉaŭ' | U
 2 1 2

 echo -ne 'ĉaŭ' | U
 1 2 1 1 2

 echo -ne 'チャオ' | U
 3 3 3
 echo -ne '!±≡' | U
 1 2 3 4

 echo -ne "\x0" | U
 1

 echo -ne '' | U

zeppelin

Posted 2016-06-23T13:43:23.890

Reputation: 7 884

+1 Nice approach. You actually read the result directly from the input. – Adám – 2016-11-17T14:09:43.597

The -t option to tr was unfamiliar to me, and is apparently a GNU extension. Piping to the command substitution after echo might also be worth a slightly more detailed explanation. – tripleee – 2018-05-02T10:30:00.667

3

Julia, 34 bytes

s->s>""?map(sizeof,split(s,"")):[]

This is an anonymous function that accepts a string and returns an integer array. To call it, assign it to a variable.

The approach is quite straightforward: If the input is empty, the output is empty. Otherwise we map the sizeof function, which counts the number of bytes in a string, to each one-character substring.

Try it online! (includes all test cases)

Alex A.

Posted 2016-06-23T13:43:23.890

Reputation: 23 761

s->[sizeof("$c")for c=s] saves a few bytes. – Dennis – 2016-06-24T00:33:51.000

Odd; does split("","") not return []? (JavaScript's "".split("") does.) – Neil – 2016-06-24T07:52:08.157

@Neil split("","") appears to give "" (unlike in Python which gives an exception) but I don't know anything about the compatibility of [] and "" in julia. – cat – 2016-06-24T11:26:50.210

@Neil No, split("", "") == [""], i.e. a one-element array containing an empty string, but the issue is that sizeof("") == 0, which the OP said is not allowed. – Alex A. – 2016-06-24T20:45:05.233

@Dennis That will fail for non-indexable strings. (Can't think of an example offhand though.) – Alex A. – 2016-06-24T20:46:04.733

Is that really a problem? for c=s doesn't iterate over indices, but over characters. – Dennis – 2016-06-25T01:59:19.530

3

JavaScript (Node), 27 bytes

s=>s.map(Buffer.byteLength)

This takes input as an array of individual characters, and returns an array of byte counts.

Buffer is a method of representing raw binary data. Buffer.byteLength(string) gives the number of bytes in the string. UTF-8 is the default encoding. Note that only Node.js has buffers, not browser JS. The rough browser equivalent is called Blob, which comes in at 31 bytes:

s=>s.map(e=>new Blob([e]).size)

Test

Save this file and run it through node, or try it online.

var f =
  s=>s.map(Buffer.byteLength)

var tests = [
  ["!"],
  ["C","i","a","o"],
  ["t","ʃ","a","ʊ"],
  ["A","d","á","m"],
  ["ĉ","a","ŭ"],
  ["c","̂","a","u","̆"],
  ["チ","ャ","オ"],
  [],
  ["!","±","≡",""]
];

tests.forEach(test => {
  console.log(test, f(test));
});

This should be the result:

$ node bytes.js
[ '!' ] [ 1 ]
[ 'C', 'i', 'a', 'o' ] [ 1, 1, 1, 1 ]
[ 't', 'ʃ', 'a', 'ʊ' ] [ 1, 2, 1, 2 ]
[ 'A', 'd', 'á', 'm' ] [ 1, 1, 2, 1 ]
[ 'ĉ', 'a', 'ŭ' ] [ 2, 1, 2 ]
[ 'c', '̂', 'a', 'u', '̆' ] [ 1, 2, 1, 1, 2 ]
[ 'チ', 'ャ', 'オ' ] [ 3, 3, 3 ]
[] []
[ '!', '±', '≡', '�' ] [ 1, 2, 3, 4 ]

NinjaBearMonkey

Posted 2016-06-23T13:43:23.890

Reputation: 9 925

3

PHP, 92 57 bytes

On second thought you can do this with much less faffing around:

<?php for(;$a=strlen(mb_substr($argv[1],$i++,1));)echo$a;

Try it online note that this is slightly longer as it uses stdin rather than a program argument.
This version requires you to ignore notices sent to stderr but that's fine.

old version:
Uses a rather different approach to the other php answer. Relies on the lack of native support for multi-byte strings in php.

<?php for($l=strlen($a=$argv[1]);$a=mb_substr($a,1);$l=$v)echo$l-($v=strlen($a));echo$l?:'';

user55641

Posted 2016-06-23T13:43:23.890

Reputation: 171

Nice answer! I think you can drop the opening tag entirely, or change it to <?= – cat – 2016-06-24T11:33:10.213

Without the tag it's a code snippet rather than a program, and even if that's allowed it makes me feel vaguely dirty. With the alternate tag you get a parse error (or at least I did on php 5.5 which is what I'm used to). – user55641 – 2016-06-24T14:21:43.843

Okay :) I don't know PHP (nor do I want to, cough) but I'll point you here: https://codegolf.stackexchange.com/questions/2913

– cat – 2016-06-24T14:27:25.537

3

Emacs Lisp, 55 49 bytes

(lambda(s)(mapcar'string-bytes(mapcar'string s)))

First dissects the string into a list of characters with (mapcar 'string s). The string function in Emacs Lisp takes a list of characters and builds a string out of them. Due to the way Emacs splits strings with mapcar (i.e. into a list of integers, not characters or strings), this explicit conversion is needed. Then maps the string-bytes function onto that list of strings.

Example:

(mapcar 'string "abc") ; => ("a" "b" "c")
(mapcar 'string-bytes '("a" "b" "c")) ; => (1 1 1) 

Testcases:

(mapcar
 (lambda(s)(mapcar'string-bytes(mapcar'string s)))
 '("!""Ciao""tʃaʊ""Adám""ĉaŭ""ĉaŭ""チャオ""""!±≡""\0"))
;; ((1) (1 1 1 1) (1 2 1 2) (1 1 2 1) (2 1 2) (1 2 1 1 2) (3 3 3) nil (1 2 3 4) (1))

Old answer:

(lambda(s)(mapcar(lambda(s)(string-bytes(string s)))s))

Ungolfed:

 (lambda (s)
   (mapcar
    ;; we can't use string-bytes directly,
    ;; since Emacs mapcar yields a list of ints instead of characters
    ;; therefore we need a wrapper function here. 
    (lambda (s)
      (string-bytes (string s)))
    s))

Testcases:

(mapcar
 (lambda(s)(mapcar(lambda(s)(string-bytes(string s)))s))
 '("!""Ciao""tʃaʊ""Adám""ĉaŭ""ĉaŭ""チャオ""""!±≡""\0"))
;; ((1) (1 1 1 1) (1 2 1 2) (1 1 2 1) (2 1 2) (1 2 1 1 2) (3 3 3) nil (1 2 3 4) (1))

Lord Yuuma

Posted 2016-06-23T13:43:23.890

Reputation: 587

What happens to the nil if you flattens the result? – Adám – 2016-06-27T19:07:58.053

1@Adám nil is an empty list (and the only way to say "false" in Emacs). While there is no standard flatten in Emacs (you can use dash's -flatten) any possible implementation would eliminate it. – Lord Yuuma – 2016-06-28T12:01:12.950

2

Haskell, 85 bytes

import Data.ByteString as B
import Data.ByteString.UTF8
(B.length.fromString.pure<$>)

Angs

Posted 2016-06-23T13:43:23.890

Reputation: 4 825

A little late, but this would be shorter as map$... – H.PWiz – 2018-01-25T12:24:48.703

2

PHP, 126 bytes

<?php $s=fgets(STDIN);echo $s!=''?implode(' ',array_map(function($x){return strlen($x);},preg_split('/(?<!^)(?!$)/u',$s))):'';

Try it online!

Michał Perłakowski

Posted 2016-06-23T13:43:23.890

Reputation: 520

You can start your code with <?=($s=fgets(STDIN))? – Marco – 2016-06-23T18:28:36.810

2

C#, 89 82 bytes

I=>{var J="";foreach(char c in I){J+=Encoding.UTF8.GetByteCount(c+"");}return J;};

A simple C# lambda that iterates through the string and returns the space separated list.

Edit: saved 6 bytes thanks to some very nice comments.

AstroDan

Posted 2016-06-23T13:43:23.890

Reputation: 171

pretty sure you can do var J="";... – cat – 2016-06-24T11:29:32.787

Also, the OP states in a comment that you do not need to space-separate the output so 1121 and 1 2 1 2 are both OK – cat – 2016-06-24T11:30:22.070

1@cat Thanks, saved me 6 bytes – AstroDan – 2016-06-24T12:50:09.623

Also, you have an extra space in } return J;}; – cat – 2016-06-24T12:51:24.250

Seems like you need to using System.Text or thereabouts -- imports are not free. – cat – 2016-06-24T20:38:35.713

1

Rust, 53 bytes

|s:&str|for c in s.chars(){print!("{}",c.len_utf8())}

Rust has utf-8 char primitives, iterators, and lambdas, so this was straightforward. Test code:

fn main() {
    let s = "Löwe 老虎 Léopard";
    let f =|s:&str|for c in s.chars(){print!("{}",c.len_utf8())};
    f(s);
}

Outputs

1211133112111114444 

Harald Korneliussen

Posted 2016-06-23T13:43:23.890

Reputation: 430

1

jq, 26 characters

(23 characters code + 3 characters command line option)

(./"")[]|utf8bytelength

Hopefully competing. Although utf8bytelength was added 9++ months before this question, it is still not included in released version.

Sample run:

bash-4.3$ ./jq -R '(./"")[]|utf8bytelength' <<< 'tʃaʊ'
1
2
1
2

bash-4.3$ ./jq -R '(./"")[]|utf8bytelength' <<< 'ĉaŭ '
1
2
1
1
2
1

bash-4.3$ ./jq -R '(./"")[]|utf8bytelength' <<< 'チャオ'
3
3
3

bash-4.3$ ./jq -R '(./"")[]|utf8bytelength' <<< ''

bash-4.3$ ./jq -R '(./"")[]|utf8bytelength' <<< '!±≡'
1
2
3
4

manatwork

Posted 2016-06-23T13:43:23.890

Reputation: 17 865

1

C (gcc), 53 bytes

k=49;f(char*s){*++s/64-2?k=puts(&k)+47:++k;*s&&f(s);}

Try it online!

l4m2

Posted 2016-06-23T13:43:23.890

Reputation: 5 985

1

SmileBASIC, 69 bytes

DEF C B
WHILE I<LEN(B)Q=INSTR(BIN$(B[I],8),"0")I=I+Q+!Q?Q+!Q
WEND
END

Input is an array of bytes.

The number of bytes in a UTF-8 character is equal to the number of leading 1 bits in the first byte (unless there are no 1s, in which case the character is 1 byte). To find the number of leading 1s, the program finds the first 0 in the binary representation, then adds 1 if this was 0.

0xxxxxxx - no leading ones, 1 byte
110xxxxx 10xxxxxx - 2 leading ones, 2 bytes
1110xxxx 10xxxxxx 10xxxxxx - 3 leading ones, 3 bytes
etc.

12Me21

Posted 2016-06-23T13:43:23.890

Reputation: 6 110

1

05AB1E, 15 bytes

ÇεDžy‹i1ë.²<5÷>

Try it online.
Header ε is used to for-each over all the test cases;
Footer ï]J]» to pretty-print the output character-lists (ï: decimals and characters to integers; ]: close if-else and for-each; J: Join digits together; }: close header foreach; »: Join by new-lines).

Explanation:

Ç                   # Convert each character to its unicode value
 εD                 # Foreach over this list
      i             #  If the current item
     ‹              #  is smaller than
   žy               #  128
       1            #   Use 1
        ë           #  Else
         .²         #   Use log_2
           <        #   minus 1
            5÷      #   integer-divided by 5
              >     #   plus 1

Since 05AB1E doesn't have any builtins to convert characters to amount of bytes used, I use Ç to convert the characters to their unicode values, and in a for-each do the following in pseudo-code:

if(unicodeValue < 128)
  return 1
else
  return log_2(unicodeValue-1)//5+1    # (where // is integer-division)

Inspired by @TheBikingViking's Python 3 answer.

Kevin Cruijssen

Posted 2016-06-23T13:43:23.890

Reputation: 67 575

1

Pyth, 17 bytes

mhxS+11+16,7lCdlC

Try it online!

Use the code-point of the characters with some arithmetics.

Leaky Nun

Posted 2016-06-23T13:43:23.890

Reputation: 45 011

4

There is a shorter answer alerady.

– Erik the Outgolfer – 2016-06-23T17:44:42.327

1

Factor, 57 87 82 80 bytes

[ [ dup zero? [ drop "1"] [ >bin length 4 /i 10 >base ] if ] { } map-as ""join ]

Explained:

USING: kernel math math.parser sequences ;
IN: byte-counts

: string>byte-counts ( str -- counts )
  [                  ! new quotation: takes a char as a fixnum
    dup zero?        ! true if this is a NUL byte
    [ drop "1" ]     ! NUL bytes have length 1
    [ >bin           ! else, convert to binary string
      length         ! length of binary string
      4              ! the constant 4
      /i             ! integer division
      number>string  ! 4 -> "4"
    ] if             ! conditionally execute one of the previous quotations
  ]                  ! end
  { } map-as         ! map and clone-like an { } array
  "" join ;          ! join array of 1strings on empty string

Unit tests:

USING: tools.test byte-counts ;
IN: byte-counts.tests

{ "1" } [ "!" string>byte-counts ] unit-test
{ "1111" } [ "Ciao" string>byte-counts ] unit-test
{ "1212"} [ "tʃaʊ" string>byte-counts ] unit-test
{ "1121" } [ "Adám" string>byte-counts ] unit-test
{ "212" } [ "ĉaŭ" string>byte-counts ] unit-test
{ "12112" } [ "ĉaŭ" string>byte-counts ] unit-test
{ "333" } [ "チャオ" string>byte-counts ] unit-test
{ "" } [ "" string>byte-counts ] unit-test
{ "1234" } [ "!±≡" string>byte-counts ] unit-test
{ "1" } [ "\0" string>byte-counts ] unit-test

They all pass, now. c:

cat

Posted 2016-06-23T13:43:23.890

Reputation: 4 989

1

C, 85 bytes.

l(unsigned char* c){while(*c){int d=(*c>>4)-11;
d=d<0?1:d+(d==1);putchar(48+d);c+=d;}}

Examines the high 4 bits of each byte to determine the encoding and the number of subsequent bytes to skip;

AShelly

Posted 2016-06-23T13:43:23.890

Reputation: 4 281

Does this work on null bytes? – cat – 2016-06-24T01:42:15.047

Yes, the while *c exits on an empty string, and the `c+=d' skips nulls in the middle of a multi byte codepoint. – AShelly – 2016-06-24T01:45:06.560

1That's incorrect. The end of a string (char*, really) in C is marked with a null byte. It is impossible to distinguish null bytes from the actual end of the string. – Dennis – 2016-06-24T02:04:07.457

@Dennis Precisely because there is no difference :) – cat – 2016-06-24T02:40:07.400

Right, but the OP said that null bytes must be supported.

– Dennis – 2016-06-24T02:41:28.633

ug. That was added after I wrote my answer. I honestly don't see a way to do that while still handling c-style strings. – AShelly – 2016-06-24T07:11:47.237

You really don't need the unsigned type specifier in there, regardless of whether chars are signed on a given platform or not -- modern platforms shouldn't have signed chars and anyways you can just say only works if your platform has unsigned chars – cat – 2016-06-24T11:36:33.733

1The OP stated in a comment (and it's now in the post) that you can request the length of the string in bytes as an argument, so do that and this will be valid again – cat – 2016-06-24T15:22:10.623

1

F#, 59 54 66 bytes

(s)=seq{for c in s->System.Text.Encoding.UTF8.GetByteCount([|c|])}

Technically, s is a char sequence, but it turns out there's an implicit conversion that allows a string to be passed in.

When testing this in the console with !±≡, it splits the kanji into two characters, each 3 bytes long. All the other test cases work fine.

Edit: It turns out common namespace imports are not implicit. Up another 12 chars.

sealed interface

Posted 2016-06-23T13:43:23.890

Reputation: 21

>

  • Timmy D's powershell answer has the same 6-bytes-per-kanji problem. I would attribute it to Windows being dumb and useless at Unicode. 2) If you get 6 bytes for the kanji when reading from a file enocded with UTF-8 without BOM then this is wrong and should be fixed. 3) Seems like F# needs statements like let f(x)= ... to end in ;;, like SML. 4) You can leave off assigning this anonymous function a name, i.e. (s)=seq{for c in s->Encoding.UTF8.GetByteCount([|c|])}.
  • – cat – 2016-06-24T11:09:17.157

    Also, I get error FS0039: The namespace or module 'Encoding' is not defined when trying to run this. What am I doing wrong? – cat – 2016-06-24T11:21:16.637

    Also, welcome to Programming Puzzles and Code Golf, this is a nice first answer! :D – cat – 2016-06-24T11:21:55.577

    @cat You need to open the System.Text namespace. I'm assuming namespace opens and entry code are included, coming from AstroDan's C# answer. – sealed interface – 2016-06-24T20:35:18.540

    You need to count the bytes of any import, #include, open, load, require, using, USING: etc here on PPCG. AstroDan's C# answer is similarly erroneous, and I notified them of that. – cat – 2016-06-24T20:40:51.910

    1

    Swift 2.2, 67 52 50 bytes

    for c in i.characters{print(String(c).utf8.count)}
    

    Horribly ugly. There's no way to get the UTF-8 length of a Character in Swift, so I need to iterate through the string by character, convert the Character to a String, and find the count of that single-character String (hey, at least there's a built-in method to do that). Looking for optimizations, possibly using a scanner.

    Revision 1: Saved 15 bytes by using count instead of underestimateCount().

    Revisions 2: Saved another 2 character by using a for-in loop instead of a for each closure.

    JAL

    Posted 2016-06-23T13:43:23.890

    Reputation: 304

    0

    Zsh, 41 bytes

    for c (${(s::)1})set +o multibyte&&<<<$#c
    

    Try it online!

    Zsh is UTF-8 aware, so we split the string on characters, then disable multibyte and print each character's length.

    GammaFunction

    Posted 2016-06-23T13:43:23.890

    Reputation: 2 838