Which language is shortest?

26

2

Create a program that find the latest 50 challenges with the -tag that have at least 20 answers. Then, extract the scores for each language in each of the challenges. If there are more than one answer using the same language, count all scores. Thereafter, take the top 20 most common languages and output a list with the language names, the number of answers, the average byte counts and the median byte counts. The list should be sorted by number of answers, in descending order.

You must account for variations in capitalization (for instance: Matlab = MATLAB).

In languages with many different version numbers (e.g. Python), count them as unique languages, so: Python != Python 2 != Python 2.7 != Python 3.x

Example output (output format is optional):

cJam,       66,  12.4,  8.5
Pyth,       58,   15.2,  19
Ruby,       44,   19.2,  22.5
Python,     34,   29.3,  32
Python 2.7, 22,   31.2,  40
...
...
Java,       11,   115.5, 94.5

Header formats that must be supported:

  • Starts with # Language name, or #Language name
  • Ends with xx bytes, xx Bytes or just xx
  • There can be a lot of garbage between the first comma and the last number.
  • If the language name is a link ([Name](link)), it can be skipped

If the answer has another header format, you may choose to skip it (or include it if your code can handle it).

As an example, all of the below headers must be supported:

# Language Name, N bytes
# Ruby, <s>104</s> <s>101</s> 96 bytes 
# Perl, 43 + 2 (-p flag) = 45 Bytes
# MATLAB, 5

Rules:

  • It's OK to use API or just the website-url
    • The following can be extracted from the byte count (nothing else), so no need to use a url-shortener (Maximum 44 bytes):
      • https:// (or http://)
      • codegolf
      • .stackexchange.com
      • /questions
  • The program can take input. The input will be included in the byte count.

Other than that, standard rules apply.

Stewie Griffin

Posted 2015-10-27T20:28:33.450

Reputation: 43 471

11I could tell you it's Pyth without having to do this challenge at all. – Alex A. – 2015-10-27T20:43:17.370

1is the " bytes" suffix common, let alone universal, enough to require it? – Sparr – 2015-10-27T20:54:15.177

@StewieGriffin I think Sparr is saying that, while it is common, it's not always used. – Celeo – 2015-10-27T21:00:07.603

As far as I can see, xx bytes is very common on recent challenges (at least since the leaderboard snippet was created). – Stewie Griffin – 2015-10-27T21:02:54.630

I've seen many cases where the user omits the comma, and even a few times, 103 <s>108</s> <s>110</s> bytes is used instead of left-to-right. Do we need to support these? – ETHproductions – 2015-10-27T21:08:04.770

@ETHproductions, 103 <s>108</s> doesn't have to be counted. I couldn't find a "rule" for cases where the comma was omitted because cases like Python 2 3 + 12 = 14 could be hard to handle. I didn't want to specify 100 different formats that must be supported, since that's just too cumbersome and there are a lot of corner cases (for instance score in parentheses, Pyth (5)). So no, you don't have to support headers without the comma (but you can if you want to). – Stewie Griffin – 2015-10-27T21:15:17.260

2I usually use "chars" or "characters" instead of "bytes" – Doorknob – 2015-10-27T21:38:15.533

Define latest: Is it creation date? Last activity? – pppery – 2015-10-31T01:59:13.943

Cretion date... – Stewie Griffin – 2015-10-31T09:48:59.580

Is it necessary to preserve the capitalization of the language name? If so, which one should I preserve? – pppery – 2015-10-31T14:17:06.910

@AlexA. You're wrong. Again. It's APL. :P – mbomb007 – 2015-11-03T22:05:05.190

Answers

13

R, 821 - 44 = 777 bytes

Updated results: please see the edit history to make sense of all the comments below.

           language num_answers avg_count median_count
1              RUBY          49  49.97959         30.0
2              CJAM          48  32.64583         22.0
3              PYTH          48  21.02083         14.0
4          PYTHON 2          46  86.78261         77.0
5             JULIA          43  58.90698         45.0
6           HASKELL          41  74.65854         56.0
7               PHP          40  73.52500         48.0
8              PERL          36  53.30556         34.0
9          PYTHON 3          34  90.91176         90.5
10       POWERSHELL          33  60.24242         44.0
11                C          32 221.84375         79.5
12                R          32  77.40625         62.5
13             JAVA          29 170.68966        158.0
14 JAVASCRIPT (ES6)          29  90.79310         83.0
15       JAVASCRIPT          28  68.39286         61.0
16               C#          25 193.92000        130.0
17      MATHEMATICA          23  56.04348         47.0
18           MATLAB          22  67.45455         55.0
19         TI-BASIC          19  47.05263         37.0
20              APL          18  16.55556         15.0

The code, which I could shorten a bit more:

W=library;W(XML);W(plyr)
X=xpathSApply;Y=xmlValue;D=data.frame;H=htmlParse;S=sprintf
Z="http://codegolf.stackexchange.com/"
R=function(FUN,...)do.call(rbind,Map(FUN,...))
G=function(url){d=H(url)
a=as.double(sub(".*?(\\d+)a.*","\\1",X(d,"//div[starts-with(@class,'status')]",Y)))
u=paste0(Z,X(d,"//*[contains(@class,'question-hyperlink')]",xmlGetAttr,"href"))
D(u,a)}
u=S("%s/questions/tagged/code-golf?page=%i",Z,1:50)
q=R(G,u)
u=with(q,head(u[a>20],50))
A=function(url){u=S("%s?page=%i",url,1:10)
f=function(u){d=H(u)
h=X(d, "//div[@class='post-text']//h1",Y)
p="^(.*?),.*? (\\d+)( [Bb]ytes)?$"
k=grep(p,h,v=T)
l=toupper(sub(p,"\\1",k))
c=as.double(sub(p,"\\2",k))
D(l,c)}
R(f,u)}
a=R(A,u)
L=names(tail(sort(table(a$l)),20))
x=subset(a,l%in%L)
arrange(ddply(x, "l",summarise,n=length(c),a=mean(c),m=quantile(c,0.5)),-n)

De-golfed:

library(XML)
library(plyr)
LoopBind <- function(FUN, ...) do.call(rbind, Map(FUN, ...))
GetQuestions <- function(url) {
  d = htmlParse(url)
  a=as.double(sub(".*?(\\d+)a.*","\\1",xpathSApply(d, "//div[starts-with(@class, 'status')]", xmlValue)))
  u=paste0("http://codegolf.stackexchange.com/",xpathSApply(d, "//*[contains(@class, 'question-hyperlink')]", xmlGetAttr, "href"))
  data.frame(u, a)
}
u <- sprintf("http://codegolf.stackexchange.com/questions/tagged/code-golf?page=%i", 1:50)
q <- do.call(rbind, Map(GetQuestions, u))
u <- with(q, head(u[a > 20], 50))

GetAnswers <- function(url) {
  u=sprintf("%s?page=%i",url,1:10)
  f=function(u) {
    d = htmlParse(u)
    h = xpathSApply(d, "//div[@class='post-text']//h1", xmlValue)
    p = "^(.*?),.*? (\\d+)( [Bb]ytes)?$"
    k = grep(p,h,v=T)
    l = toupper(sub(p,"\\1",k))
    c = as.double(sub(p,"\\2",k))
    data.frame(language=l,c)
  }
LoopBind(f,u)
}
a=LoopBind(GetAnswers, u)
L=names(tail(sort(table(a$l)),20))
x=subset(a,language%in%L)
arrange(ddply(x, "language", summarise, num_answers = length(c), avg_count = mean(c), median_count = quantile(c,0.5)),
        -num_answers)

flodel

Posted 2015-10-27T20:28:33.450

Reputation: 2 345

6How is the average length for C# over 6000 bytes? – SuperJedi224 – 2015-10-28T13:16:27.903

@SuperJedi224 - There might be some extremely long submissions that are skewing the average. That's why median is a useful statistic because it is resistant to outliers. – None – 2015-10-28T15:17:32.757

1I read somewhere that C# is the least golfable language. Now I know why... – ev3commander – 2015-10-28T19:03:08.140

@ev3commander - C# pales in comparison to Unary...

– Comintern – 2015-10-29T04:42:45.203

@Comintern: Eek... – ev3commander – 2015-10-29T19:06:17.293

Now golfed, with updated results. Apparently the C# outlier has dropped. I have stored the full results in a csv at https://drive.google.com/open?id=0BwaTHVmeEmT-RUNablBLZ0FyTDg if you want to have a look at the data, maybe to explain differences with @ppperry. (I am not claiming my code is exempt free of errors...)

– flodel – 2015-11-01T18:18:17.613

Does your code handle questions with more than one page of answers? – pppery – 2015-11-02T20:59:56.147

Yes. Up to 10 pages. – flodel – 2015-11-03T00:18:36.007

JAVASCRIPT (ES6): 83.0, JAVASCRIPT: 61.0 Seriously? Well, I guess that's because we're more likely to use ES6 on challenges that need more golfing. – ETHproductions – 2015-11-04T15:28:01.210

7

Python 2, 934 - 44 (url stuff) = 890 bytes

Using the API:

from urllib2 import urlopen as u
from gzip import GzipFile as f
from StringIO import StringIO as s;x="https://api.stackexchange.com/2.2%s&site=codegolf"
import re;j=u(x%'/search/advanced?pagesize=50&order=desc&sort=creation&answers=20&tagged=code-golf');q=s(j.read());g=f(fileobj=q);true=1;false=0;l=';'.join(str(a['question_id'])for a in eval(g.read())['items']);w=[]
def r(p):
 j=u(x%('/questions/%s/answers?page=%s&filter=!9YdnSMlgz&pagesize=100'%(l,p)));g.seek(0);q.truncate();q.write(j.read());q.seek(0);k=eval(g.read());w.extend(a['body_markdown']for a in k['items'])
 if k['has_more']:r(p+1)
r(1);x={};s=sorted
for m in w:
 try:
  l,n=re.match("(.*?),.*?([0-9]+)[^0-9]*$",m.splitlines()[0]).groups();l=re.subn("# ?","",l,1)[0].upper()
  if l not in x:x[l]=[]
  x[l]+=[(l,int(n))]
 except:pass
for l in s(x,cmp,lambda a:len(x[a]),1)[:20]:
 v=s(x[l])
 print l,len(v),sum(map(lambda a:a[1],v))/len(v),v[len(v)/2][1]

Note that this code does not pay attention to the API throttling.

Output:

RUBY 60 430 32
PYTH 57 426 16
CJAM 56 35 23
C 52 170 76
PYTHON 2 51 88 79
JULIA 42 63 48
HASKELL 42 81 63
JAVASCRIPT (ES6) 41 96 83
PERL 40 44 27
PYTHON 3 37 91 89
PHP 36 98 59
JAVASCRIPT 36 743 65
POWERSHELL 35 86 44
JAVA 32 188 171
R 30 73 48
MATLAB 25 73 51
MATHEMATICA 24 57 47
APL 22 14 13
SCALA 21 204 59
TI-BASIC 21 42 24

pppery

Posted 2015-10-27T20:28:33.450

Reputation: 3 987

1@StewieGriffin Interestingly, I had to add one extra slash to the second recursive query to qualify for the /questions reduction. – pppery – 2015-11-01T14:14:50.337

2The differences are because @flodel disallows suffixes other than bytes, while mine will handle other suffixes like chars. – pppery – 2015-11-02T14:14:33.630

Is it possible that your code combines C, C# and possibly C++? It seems unlikely that there are 73 C-answers. – Stewie Griffin – 2015-11-02T20:57:22.833

No, I don't think so. I end the language name on the first comma. – pppery – 2015-11-02T20:58:48.987

Ok, but it is quite strange don't you think? Especially since it is such a big mismatch between your and flodel's answer. In his answer, there are 32 C-answers, 25 C# answers, and C++ is not listed. In your list, C# is not on the list, and there are 73 C-answers. – Stewie Griffin – 2015-11-02T21:03:05.657

I don't know which answer is most right though, might be yours. And I'm wondering where the extremely long Pyth and Ruby answers are =P There must be one answer in each language that completely messes up the average. – Stewie Griffin – 2015-11-02T21:08:18.473

Maybe C programmers use suffixes other than bytes alot. – pppery – 2015-11-02T21:14:11.897

Might be, but still very unlikely that there are 73 C-answers on 50 questions...? – Stewie Griffin – 2015-11-02T21:25:55.343

1Looks like l=re.sub("# ?|,","",l) is what replaces C# with C. – flodel – 2015-11-03T01:17:04.780

@flodel Now recognized c and c# as seperate. – pppery – 2015-11-03T21:08:05.223