Count the bytes of a program

21

4

Note 2: I accepted @DigitalTrauma's 6-byte long answer. If anyone can beat that I will change the accepted answer. Thanks for playing!

Note: I will be accepting an answer at 6:00pm MST on 10/14/15. Thanks to all that participated!

I am very surprised that this has not been asked yet (or I didn't search hard enough). Either way, this challenge is very simple:

Input: A program in the form of a string. Additionally, the input may or may not contain:

  • Leading and trailing spaces
  • Trailing newlines
  • Non-ASCII characters

Output: Two integers, one representing UTF-8 character count and one representing byte count, you may choose which order. Trailing newlines are allowed. Output can be to STDOUT or returned from a function. IT can be in any format as long as the two numbers are distinguishable from each other (2327 is not valid output).

Notes:

  • You may consider newline as \n or \r\n.
  • Here is a nice byte & character counter for your tests. Also, here is a meta post with the same thing (Thanks to @Zereges).

Sample I/O: (All outputs are in the form {characters} {bytes})

Input: void p(int n){System.out.print(n+5);}

Output: 37 37

Input: (~R∊R∘.×R)/R←1↓ιR

Output: 17 27

Input:


friends = ['john', 'pat', 'gary', 'michael']
for i, name in enumerate(friends):
    print "iteration {iteration} is {name}".format(iteration=i, name=name)

Output: 156 156

This is code golf - shortest code in bytes wins!

Leaderboards

Here is a Stack Snippet to generate both a regular leaderboard and an overview of winners by language.

To make sure that your answer shows up, please start your answer with a headline, using the following Markdown template:

# Language Name, N bytes

where N is the size of your submission. If you improve your score, you can keep old scores in the headline, by striking them through. For instance:

# Ruby, <s>104</s> <s>101</s> 96 bytes

If there you want to include multiple numbers in your header (e.g. because your score is the sum of two files or you want to list interpreter flag penalties separately), make sure that the actual score is the last number in the header:

# Perl, 43 + 2 (-p flag) = 45 bytes

You can also make the language name a link which will then show up in the leaderboard snippet:

# [><>](http://esolangs.org/wiki/Fish), 121 bytes

var QUESTION_ID=60733,OVERRIDE_USER=36670;function answersUrl(e){return"http://api.stackexchange.com/2.2/questions/"+QUESTION_ID+"/answers?page="+e+"&pagesize=100&order=desc&sort=creation&site=codegolf&filter="+ANSWER_FILTER}function commentUrl(e,s){return"http://api.stackexchange.com/2.2/answers/"+s.join(";")+"/comments?page="+e+"&pagesize=100&order=desc&sort=creation&site=codegolf&filter="+COMMENT_FILTER}function getAnswers(){jQuery.ajax({url:answersUrl(answer_page++),method:"get",dataType:"jsonp",crossDomain:!0,success:function(e){answers.push.apply(answers,e.items),answers_hash=[],answer_ids=[],e.items.forEach(function(e){e.comments=[];var s=+e.share_link.match(/\d+/);answer_ids.push(s),answers_hash[s]=e}),e.has_more||(more_answers=!1),comment_page=1,getComments()}})}function getComments(){jQuery.ajax({url:commentUrl(comment_page++,answer_ids),method:"get",dataType:"jsonp",crossDomain:!0,success:function(e){e.items.forEach(function(e){e.owner.user_id===OVERRIDE_USER&&answers_hash[e.post_id].comments.push(e)}),e.has_more?getComments():more_answers?getAnswers():process()}})}function getAuthorName(e){return e.owner.display_name}function process(){var e=[];answers.forEach(function(s){var r=s.body;s.comments.forEach(function(e){OVERRIDE_REG.test(e.body)&&(r="<h1>"+e.body.replace(OVERRIDE_REG,"")+"</h1>")});var a=r.match(SCORE_REG);a&&e.push({user:getAuthorName(s),size:+a[2],language:a[1],link:s.share_link})}),e.sort(function(e,s){var r=e.size,a=s.size;return r-a});var s={},r=1,a=null,n=1;e.forEach(function(e){e.size!=a&&(n=r),a=e.size,++r;var t=jQuery("#answer-template").html();t=t.replace("{{PLACE}}",n+".").replace("{{NAME}}",e.user).replace("{{LANGUAGE}}",e.language).replace("{{SIZE}}",e.size).replace("{{LINK}}",e.link),t=jQuery(t),jQuery("#answers").append(t);var o=e.language;/<a/.test(o)&&(o=jQuery(o).text()),s[o]=s[o]||{lang:e.language,user:e.user,size:e.size,link:e.link}});var t=[];for(var o in s)s.hasOwnProperty(o)&&t.push(s[o]);t.sort(function(e,s){return e.lang>s.lang?1:e.lang<s.lang?-1:0});for(var c=0;c<t.length;++c){var i=jQuery("#language-template").html(),o=t[c];i=i.replace("{{LANGUAGE}}",o.lang).replace("{{NAME}}",o.user).replace("{{SIZE}}",o.size).replace("{{LINK}}",o.link),i=jQuery(i),jQuery("#languages").append(i)}}var ANSWER_FILTER="!t)IWYnsLAZle2tQ3KqrVveCRJfxcRLe",COMMENT_FILTER="!)Q2B_A2kjfAiU78X(md6BoYk",answers=[],answers_hash,answer_ids,answer_page=1,more_answers=!0,comment_page;getAnswers();var SCORE_REG=/<h\d>\s*([^\n,]*[^\s,]),.*?(\d+)(?=[^\n\d<>]*(?:<(?:s>[^\n<>]*<\/s>|[^\n<>]+>)[^\n\d<>]*)*<\/h\d>)/,OVERRIDE_REG=/^Override\s*header:\s*/i;
body{text-align:left!important}#answer-list,#language-list{padding:10px;width:290px;float:left}table thead{font-weight:700}table td{padding:5px}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script> <link rel="stylesheet" type="text/css" href="//cdn.sstatic.net/codegolf/all.css?v=83c949450c8b"> <div id="answer-list"> <h2>Leaderboard</h2> <table class="answer-list"> <thead> <tr><td></td><td>Author</td><td>Language</td><td>Size</td></tr></thead> <tbody id="answers"> </tbody> </table> </div><div id="language-list"> <h2>Winners by Language</h2> <table class="language-list"> <thead> <tr><td>Language</td><td>User</td><td>Score</td></tr></thead> <tbody id="languages"> </tbody> </table> </div><table style="display: none"> <tbody id="answer-template"> <tr><td>{{PLACE}}</td><td>{{NAME}}</td><td>{{LANGUAGE}}</td><td>{{SIZE}}</td><td><a href="{{LINK}}">Link</a></td></tr></tbody> </table> <table style="display: none"> <tbody id="language-template"> <tr><td>{{LANGUAGE}}</td><td>{{NAME}}</td><td>{{SIZE}}</td><td><a href="{{LINK}}">Link</a></td></tr></tbody> </table>

GamrCorps

Posted 2015-10-14T01:54:16.973

Reputation: 7 058

does the output have to be space-separated? – Maltysen – 2015-10-14T03:24:15.803

no, it can be in any format as long as the numbers are distinguishable from each other (2327 is not valid output) – GamrCorps – 2015-10-14T03:25:10.017

Aren't there some UTF-8 characters that depending on the interpretation can be split into two other characters that generate the same byte values? How do we count those then? – Patrick Roberts – 2015-10-14T03:51:33.740

Honestly, I do not know what you mean. Therefore, count as you wish. – GamrCorps – 2015-10-14T03:52:15.280

@GamrCorps UTF-8 characters include non-ASCII characters, which are basically characters that cannot be represented by one byte but must be represented by two or even four bytes. Depending on how the characters are read in by a program, it is up to the program to choose how to interpret the stream of bytes. For example, a 2 byte UTF-8 can be interpreted as 2 sequential ASCII characters each of which are represented by the two bytes making up the originally intended character. – Patrick Roberts – 2015-10-14T03:56:04.350

@PatrickRoberts I would say to use the higher value. But my final judgement would have to go to whatever https://mothereff.in/byte-counter says. Just put a questionable charatcer in there and see what it reads as, and use that as the foundation.

– GamrCorps – 2015-10-14T04:02:50.083

Some of the answers count the character `` as two characters due (presumably) to the use of UTF-16 and its surrogate pairs rather than UTF-8. (Note that the byte count will be the same either way.) To confirm, since you've specified UTF-8 specifically, that makes such answers invalid, correct? – Alex A. – 2016-02-05T07:18:02.580

@AlexA. Yes. If answers count characters based on a non-UTF-8 encoding, the answer would be invalid. – GamrCorps – 2016-02-05T13:08:22.553

Nitpick: There's no such thing as a UTF-8 character. UTF-8 is an encoding that permits us to store Unicode characters as byte sequences. You are asking for the character and byte count of a UTF-8 data stream. – Dennis – 2016-02-05T18:50:53.247

Answers

32

Shell + coreutils, 6

This answer becomes invalid if an encoding other than UTF-8 is used.

wc -mc

Test output:

$ printf '%s' "(~R∊R∘.×R)/R←1↓ιR" | ./count.sh 
     17      27
$ 

In case the output format is strictly enforced (just one space separating the the two integers), then we can do this:

Shell + coreutils, 12

echo`wc -mc`

Thanks to @immibis for suggesting to remove the space after the echo. It took me a while to figure that out - the shell will expand this to echo<tab>n<tab>m, and tabs by default are in $IFS, so are perfectly legal token separators in the resulting command.

Digital Trauma

Posted 2015-10-14T01:54:16.973

Reputation: 64 644

13Definitely the right tool for the job. – Alex A. – 2015-10-14T03:02:31.040

1Can you remove the space after "echo"? – user253751 – 2015-10-14T22:37:26.997

@immibis Yes - nice - I couldn't see how that worked right away. – Digital Trauma – 2015-10-14T23:37:00.810

21

GolfScript, 14 12 bytes

.,p{64/2^},,

Try it online on Web GolfScript.

Idea

GolfScript doesn't have a clue what Unicode is; all strings (input, output, internal) are composed of bytes. While that can be pretty annoying, it's perfect for this challenge.

UTF-8 encodes ASCII and non-ASCII characters differently:

  • All code points below 128 are encoded as 0xxxxxxx.

  • All other code points are encoded as 11xxxxxx 10xxxxxx ... 10xxxxxx.

This means that the encoding of each Unicode character contains either a single 0xxxxxxx byte or a single 11xxxxxx byte (and 0 to 5 10xxxxxx bytes).

By dividing all bytes of the input by 64, we turn 0xxxxxxx into 0 or 1, 11xxxxxx into 3, and 10xxxxxx into 2. All that's left is to count the bytes whose quotient is not 2.

Code

                (implicit) Read all input and push it on the stack.
.               Push a copy of the input.
 ,              Compute its length (in bytes).
  p             Print the length.
   {     },     Filter; for each byte in the original input:
    64/           Divide the byte by 64.
       2^         XOR the quotient with 2.
                If the return is non-zero, keep the byte.
           ,    Count the kept bytes.
                (implicit) Print the integer on the stack.

Dennis

Posted 2015-10-14T01:54:16.973

Reputation: 196 637

9

Python, 42 40 bytes

lambda i:[len(i),len(i.encode('utf-8'))]

Thanks to Alex A. for the two bytes off.

Straightforward, does what it says. With argument i, prints the length of i, then the length of i in UTF-8. Note that in order to accept multiline input, the function argument should be surrounded by triple quotes: '''.

EDIT: It didn't work for multiline input, so I just made it a function instead.

Some test cases (separated by blank newlines):

f("Hello, World!")
13 13

f('''
friends = ['john', 'pat', 'gary', 'michael']
for i, name in enumerate(friends):
    print "iteration {iteration} is {name}".format(iteration=i, name=name)
''')
156 156

f("(~R∊R∘.×R)/R←1↓ιR")
17 27

The_Basset_Hound

Posted 2015-10-14T01:54:16.973

Reputation: 1 566

And here all this time I've been using just len() like a sucker. This is clearly superior. – Status – 2015-10-14T02:42:36.860

3Since output can be returned from a function, you could save a few bytes by making this lambda i:[len(i),len(i.encode('utf-8'))]. – Alex A. – 2015-10-14T03:24:05.300

@AlexA. Alright, changing. Never touched lambda before. – The_Basset_Hound – 2015-10-14T10:55:23.607

1Your lambda isn't formed quite correctly. If you give it a definition, it would be f=lambda i:[len(i),len(i.encode('utf-8'))], but since you're using an anonymous lambda function, it should just be lambda i:[len(i),len(i.encode('utf-8'))]. – Kade – 2015-10-14T13:17:18.763

1

You can save a few bytes with U8 instead of utf-8.

– Mego – 2016-01-05T07:53:37.157

5

Julia, 24 bytes

s->(length(s),sizeof(s))

This creates a lambda function that returns a tuple of integers. The length function, when called on a string, returns the number of characters. The sizeof function returns the number of bytes in the input.

Try it online

Alex A.

Posted 2015-10-14T01:54:16.973

Reputation: 23 761

4

Rust, 42 bytes

let c=|a:&str|(a.chars().count(),a.len());

jus1in

Posted 2015-10-14T01:54:16.973

Reputation: 81

3

Pyth - 12 9 bytes

Will try to get shorter.

lQh/l.BQ8

Test Suite.

Maltysen

Posted 2015-10-14T01:54:16.973

Reputation: 25 023

This gives a byte too much for the UTF-8 byte count. It's currently floor(… / 8) + 1, should be ceil(… / 8) – PurkkaKoodari – 2015-10-14T11:42:20.780

This helped me catch a bug in .B. Also, lQlc.BQ8 fixes the bug @Pietu1998 mentions while saving 1 byte, I think. – isaacg – 2016-01-05T10:01:04.013

3

Java, 241 90 89 bytes

int[]b(String s)throws Exception{return new int[]{s.length(),s.getBytes("utf8").length};}

SuperJedi224

Posted 2015-10-14T01:54:16.973

Reputation: 11 342

Love that you got Java to under 100 bytes. – GamrCorps – 2015-10-15T14:41:29.190

Well, it is just a method... – SuperJedi224 – 2015-10-15T19:12:18.570

1You could change getBytes("UTF-8") to getBytes("utf8"). And why throws Exception? – RAnders00 – 2016-01-02T19:20:58.540

Because getBytes throws an UnsupportedEncodingException when you give it an invalid encoding name. – SuperJedi224 – 2016-01-02T20:03:05.537

2

PowerShell, 57 bytes

$args|%{$_.Length;[Text.Encoding]::UTF8.GetByteCount($_)}

Andrew

Posted 2015-10-14T01:54:16.973

Reputation: 271

2

R, 47 bytes

a<-commandArgs(TRUE);nchar(a,"c");nchar(a,"b")

Input: (~R∊R∘.×R)/R←1↓ιR

Output:

[1] 17
[2] 27

If printing line numbers alongside output isn't allowable under the "any format" then cat can fix the issue:

R, 52 bytes

a<-commandArgs(TRUE);cat(nchar(a,"c"),nchar(a,"b"))

Input: (~R∊R∘.×R)/R←1↓ιR

Output: 17 27

SnoringFrog

Posted 2015-10-14T01:54:16.973

Reputation: 1 709

As a function, 39 bytes: function(s)c(nchar(s,"c"),nchar(s,"b")) – Alex A. – 2016-02-05T06:52:22.293

1Also just some general R golfing tips: You can use T in place of TRUE, = in place of <-, and input can come from scan, readline, or function, all of which are shorter than commandArgs. – Alex A. – 2016-02-05T06:56:13.147

2

C, 68 67 bytes

b,c;main(t){for(;t=~getchar();b++)c+=2!=~t/64;printf("%d %d",c,b);}

This uses the same idea as my other answer.

Try it online on Ideone.

Dennis

Posted 2015-10-14T01:54:16.973

Reputation: 196 637

1

Milky Way 1.6.2, 7 bytes (non-competing)

':y!^P!

Explanation

'        ` read input from the command line
 :       ` duplicate the TOS
  y      ` push the length of the TOS
   !  !  ` output the TOS
    ^    ` pop the TOS
     P   ` push the length of the TOS in bytes

Usage

./mw <path-to-code> -i <input>

Zach Gates

Posted 2015-10-14T01:54:16.973

Reputation: 6 152

I marked this as non-competing since the challenge predates the language. – Mego – 2016-01-05T07:51:40.630

1

Perl 6, 33 bytes

$x=get;say $x.chars," ",$x.codes;

Based on this blog post at Perl6Advent.

cat

Posted 2015-10-14T01:54:16.973

Reputation: 4 989

1

Brainfuck, 163 bytes

,[>+<,]>[>>+>+<<<-]>>>[<<<+>>>-]<<+>[<->[>++++++++++<[->-[>+>>]>[+[-<+>]>+>>]<<<<<]>[-]++++++++[<++++++>-]>[<<+>>-]>[<<+>>-]<<]>]<[->>++++++++[<++++++>-]]<[.[-]<]<

With linebreaks for readability:

,[>+<,]>
[>>+>+<<<-]>>>[<<<+>>>-]<<+>[<->[
>++++++++++<[->-[>+>>]>[+[-<+>]>.
+>>]<<<<<]>[-]++++++++[<++++++>-
]>[<<+>>-]>[<<+>>-]<<]>]<[->>+++++
+++[<++++++>-]]<[.[-]<]<

The most important part is the first line. This counts the number of characters inputted. The rest is just the long junk required to print a number greater than 9.

EDIT: Since BF cannot input/output anything but ASCII numbers from 1-255, there would be no way to measure the UTF-8 chars.

vasilescur

Posted 2015-10-14T01:54:16.973

Reputation: 341

This looks like it could be golfed more. But it probably can't. +1. – wizzwizz4 – 2016-03-23T17:47:55.757

0

beeswax, 99 87 bytes

A more compact version, 12 bytes shorter than the first:

p~5~q")~4~p")~7~g?<
>)'qq>@PPq>@Pp>Ag'd@{
     >@PPPq  @dNp"?{gAV_
     >@PPPP>@>?b>N{;

The same, as easier to follow hexagonal layout:

 p ~ 5 ~ q " ) ~ 4 ~ p " ) ~ 7 ~ g ? <
> ) ' q q > @ P P q > @ P p > A g ' d @ {
         > @ P P P q     @ d N p " ? { g A V _ 
        > @ P P P P > @ > ? b > N { ;

Output as characters, then bytecount, separated by a newline.

Example: the small letter s at the beginning of the line just tells the user that the program wants a string as input.

julia> beeswax("utf8bytecount.bswx")
s(~R∊R∘.×R)/R←1↓ιR
17
27
Program finished!

Empty string example:

julia> beeswax("utf8bytecount.bswx")
s
0
0
Program finished!

Beeswax pushes the characters of a string that’s entered at STDIN onto the global stack, coded as the values of their Unicode code points.

For easier understanding, here is the unwrapped version of the program above:

             >@{;    >@P@p >@PP@p>@P p
_VAg{?"pN>Ag"d?g~7~)"d~4~)"d~5~)"d@PPp
    ;{N< d?              <      < @PP<

For this example, the character α is entered at STDIN (code point U+03B1, decimal:945)

                                        gstack     lstack

_VA                                     [945,1]•   [0,0,0]•    enter string, push stack length on top of gstack
   g                                               [0,0,1]•    push gstack top value on top of local stack (lstack)
    {                                                          lstack 1st value to STDOUT (num. of characters)
     ?                                  [945]•                 pop gstack top value
      "                                                        skip next if lstack 1st >0
        N>                                                     print newline, redirect to right
          Ag                            [945,1]•   [0,0,1]•    push gstack length on top of gstack, push that value on lstack.
            "                                                  skip if lstack 1st > 0
              ?                         [945]•                 pop gstack top value
               g                                   [0,0,945]•  push gstack top value on lstack
                ~                                  [0,945,0]•  flip lstack 1st and 2nd
                 7                                 [0,945,7]•  lstack 1st=7
                  ~                                [0,7,945]•  flip lstack 1st and 2nd
                   )                               [0,7,7]•    lstack 1st = lstack 1st >>> 2nd  (LSR by 7)
                    "                                          skip next if top >0
                      ~4~)                         [0,0,0]•            flip,1st=4,flip,LSR by 4
                          "d                                   skip next if top >0... redirect to upper right
                           >@                                  redirect to right, flip lstack 1st and 3rd
                             PP@                   [2,0,0]•    increment lstack 1st twice, flip 1st and 3rd
                                p                              redirect to lower left
                                "                              (ignored instruction, not relevant)
         d?              <      <       []•                       redirect to left... pop gstack, redirect to upper right

         >Ag"d                          [0]•       [2,0,0]•    redir. right, push gstack length on gstack
                                                               push gstack top on lstack, skip next if lstack 1st > 0
                                                               redir. to upper right.
         >@                                        [0,0,2]•    redir right, flip lstack 1st/3rd
           {;                                                  output lstack 1st to STDOUT, terminate program

Basically, this program checks each codepoint value for the 1-byte, 2-byte, 3-byte and 4-byte codepoint limits.

If n is the codepoint value, then these limits for proper UTF-8 strings are:

codepoint 0...127         1-byte: n>>>7 = 0
          128...2047      2-byte: n>>>11= 0  → n>>>7>>>4
          2048...65535    3-byte: n>>>16= 0  → n>>>7>>>4>>>5
          65535...1114111 4-byte: the 3 byte check result is >0

You can find the numbers 7,4 and 5 for the shift instructions in the code above. If a check results in 0, the lstack counter is incremented appropriately to tally the number of bytes of the entered string. The @PP...@ constructs increment the byte counter. After each tally, the topmost Unicode point is popped from the gstack until it is empty. Then the byte count is output to STDOUT and the program terminated.

There are no checks for improper encoding like overlong ASCII encoding and illegal code points beyond 0x10FFFF, but I think that’s fine ;)

M L

Posted 2015-10-14T01:54:16.973

Reputation: 2 865

0

Swift 3, 37

{($0.characters.count,$0.utf8.count)} // where $0 is String

Usage

Test

{($0.characters.count,$0.utf8.count)}("Hello, world")

Apollonian

Posted 2015-10-14T01:54:16.973

Reputation: 61