Guess the language

23

7

Introduction

This site is rapidly building up a huge dataset of code snippets, so let's do something with it!

Here's a data file. It contains 9,066 unique language+snippet pairs for 113 languages, all taken from this site. The format is tab-separated (language-TAB-snippet), with all newlines in the snippets replaced with <LF>, and all tabs replaced with 4 spaces. There are at least 5 snippets for each language.

[update: I've made a minor change to the data file to merge some Python & RegExp versions I missed before - the link above has been updated]

Challenge

Write a program or function which takes a code snippet and outputs the language it is written in (see below for details). The total size of your source + any data you require must be 300 bytes or less, and your program must output the correct language when given its own source code. Highest accuracy (most correct answers on the dataset above) wins.

Rules

  • The total size of your source code, resources, and any required compilation / runtime flags must not exceed 300 bytes.
  • Your answer will be tested against the dataset above; it will be given one of the "Snippet" values as input and its output will be compared against the "correct" output according to the dataset. This will be repeated for all entries in the dataset and the final number of correct answers is your score.
  • You can choose the input encoding - I'll assume UTF-8, so if you need another encoding specify it in your answer.
  • You don't need to use the <LF> replacement for newlines; if your entry expects to receive newlines as literal newlines (char 10), specify it in your answer.
  • Your entry must output the language it thinks the input snippet is written in. To avoid the need to compress lots of language strings, I'll allow mappings (If you want to output 3 for "Java", that's fine); just note the mappings in your answer.
  • You can only have 1 output mapping for each language (i.e. if 3 means "Java", you can't also have 4 meaning "Java").
  • When given its own source code, your program must produce the correct answer (must output the language it is written in).
  • You don't need to support all languages in the dataset, and you can support extra languages if you want to (e.g. if your entry isn't in one of the languages in the dataset).
  • Your program must be deterministic (providing the same input twice must produce the same output).

Tie-Breaking

  • Ties will be decided by reducing the dataset until one entry wins. The dataset will be reduced by removing all snippets for the most popular language (i.e. ties are broken by accuracy on rarer languages). For example, if A and B score 70% on the full dataset, all Python snippets will be removed. If A and B now both score 60%, CJam will be removed. If A now scores 50% but B scores 55%, B is the winner.
  • If 100% accuracy is achieved, ties will be decided using a second (blind) dataset containing more samples for the same languages.

Example 1

The Python script:

print("python")

This script successfully produces "python" when given its own source code, so it is valid. On the dataset, it scores 1008/9066 = 11.1%

Example 2

The JavaScript function:

function f(s){return /function/.test(s)?1:2}

With the mappings 1 → javascript, 2 → python. Again it successfully produces 1 ("javascript") for its own source, and on the dataset it scores 1092/9066 = 12.0%


Where did the data come from?

I created an SEDE query to pull samples from [code-golf] challenges on this site. From the resulting 10,000 answers, I used a hacked-together python script to find the code & language name for each, then filtered out any language with less than 5 examples. The data isn't 100% clean (I know there are some non-code snippets it pulled), but should be good enough.


Inspired by this challenge from earlier in the year: Who said that? 2016 Presidential election

Also partly related to What's the Language?

Dave

Posted 2016-07-09T00:34:08.140

Reputation: 7 519

3Seems incredibly challenging to do in 300 bytes. Perhaps allot more bytes? – Rohan Jhunjhunwala – 2016-07-09T00:38:05.483

1@RohanJhunjhunwala yes I imagine it will be quite a challenge! I'm not expecting anybody to get 100% accuracy though; the challenge is to get as high as possible. In the "who said that?" challenge I linked to, we all peaked at ~30% accuracy. I've tried to set the byte limit so that 50–70% should be possible here. Hopefully I got the balance right. Of course if you find a great solution which uses more bytes, post it! It just won't be competing (you can cut it down for a competing version). – Dave – 2016-07-09T00:49:19.040

Have you removed polyglots from the dataset, or will those just be "gotchas"? – Geobits – 2016-07-09T00:54:04.450

@Geobits any answers which marked themselves as polyglots have been removed (e.g. titles like "4 languages"), but otherwise no I haven't filtered them, so for example an answer listed as C++ which is also valid C would be a "gotcha". More generally, I haven't manually filtered these beyond checking the results repeatedly and tweaking my automated filtering to remove the worst of the bad data. – Dave – 2016-07-09T00:57:25.123

Do we have to differentiate between different versions for languages, i.e. Python 2 vs. Python 3? – R. Kap – 2016-07-09T01:28:58.750

@R.Kap no - I specifically removed versions where I found them since it would be nearly impossible in most cases to distinguish them (essentially it would add polyglots as Geobits mentioned). Having said that, looking at the data now I see that "python2" and "python3" made it through (I filtered out "python 2" / "python 3"). I think I'll try to fix that quickly before too many people start working on a dataset which is much harder than it ought to be… (sorry to anyone who's started! — hopefully this will be a minor change) – Dave – 2016-07-09T01:32:24.930

9It's gonna be so hard to differentiate among golflangs... – busukxuan – 2016-07-09T01:46:08.970

The fixed dataset is now available — link has been updated. I've checked over them all and the languages seem sane now, so I shouldn't need to change the data again. Sorry about that. – Dave – 2016-07-09T01:51:48.897

One possible way to improve the challenge would be to allow a program to know the origin of the first half of the snippets. The challenge would be to use those snippets as data for inferring the language of the second half of the snippets. – DavidC – 2016-07-09T01:54:09.927

@DavidC that would be pretty cool. As I implied in the intro, I'm hoping this sort of data can be used more now that it's building up. – Dave – 2016-07-09T02:00:01.420

Related: http://codegolf.stackexchange.com/q/21319/31716

– James – 2016-07-09T05:35:53.943

2Interesting facts: the 20 most commonly used (at least in your dataset) languages make up 81% of your dataset, and the 10 most commonly used ones make up 61%. Even just detecting differences between Javascript, Pyth, CJam, and Python is enough to get about 35%. – helloworld922 – 2016-07-09T21:31:05.410

Does the program actually have to parse the input? In other words, can we use any method to detect the programming language? – R. Kap – 2016-07-10T01:13:50.213

@R.Kap not sure what you're getting at, but no you don't HAVE to do anything with the input. See example 1; it just returns "Python" without looking at what it's given. – Dave – 2016-07-10T08:04:09.117

if anyone wants the source language distribution: http://pastebin.com/rvCbm7ik

– Jarmex – 2016-07-10T18:13:03.530

If I specify the result is the return value, can all other output be ignored? – Conor O'Brien – 2016-07-10T20:26:28.847

@CᴏɴᴏʀO'Bʀɪᴇɴ as long as your result is on a distinct channel, the others can be ignored. So if your result is to stdout, anything on stderr and the return code can be ignored. If your result is by return code, stdout/stderr can be ignored, etc. Same type of logic applies to functions. – Dave – 2016-07-10T21:34:39.587

Answers

17

C, 297 bytes, 43.194351% matched (v2)

This is the first non-golf challenge I've competed in. Surprisingly, golfing languages are actually rather easy to separate, with about 60% matching accuracy per language.

The code requires input as UTF-8 string, results based on version 2 of the supplied dataset. This code does not require <LF> to be replaced with actual newlines.

#define S(x)!!strstr(p,#x)
f(char*p){return S(#d)?:S(voi)?0:S(mai)|S(utc)?:S(mbd)|S(impor)|S(input)|S(def)|S(rang)?2:S(log)|S(fun)|S(=>)|S(lert)?3:S(<?)?4:S(echo)|S(sed)?5:S(+++)?6:S(<-)?7:S($_)|S(say)?8:S(\342)|S(\303)?9:S(->)|S(map)?10:S(@#)|S(]])|S([#)?11:S(V)|S(Q)?12:S(Z)|S(Y)?13:S(.)?14:15;}

Mapping table:

 0. java
 1. c
 2. python
 3. javascript
 4. php
 5. bash
 6. brainf*
 7. haskell
 8. perl
 9. apl
10. ruby
11. wolfram
12. pyth
13. matl
14. golfscript
15. cjam

The percentage is based on my hits/total calculation: 3916 hits/9066 total.

owacoder

Posted 2016-07-09T00:34:08.140

Reputation: 1 556

Off to a great start; that was quick! As for "My test code and text editor see 9068 entries instead of 9065 for some reason" - did you skip the headers and the blank line at the end? That would account for 2 extra lines. – Dave – 2016-07-09T02:04:52.613

Sorry; this isn't passing rule #7 (must output correct answer for own source code): right now it says 0 which is Python (I'm guessing it wouldn't be too much work to re-order the checks to fix that) – Dave – 2016-07-09T03:00:00.053

You can lose the space after define S(x) and the 1 wherever you have ?1. – feersum – 2016-07-11T13:48:08.137

Thanks! I didn't know about the ?1 trick, and forgot about the other. ;-) – owacoder – 2016-07-11T14:07:18.887

Wow this has been quietly improving! Verified the latest 43.19% :) – Dave – 2016-07-11T18:15:29.167

@Dave - Yes, it has improved, but I think it's pretty much optimal now. I can't get much more out of it. I probably could hit >50% if I had 400 bytes to work with. – owacoder – 2016-07-11T20:03:35.697

2

Python 3, 271 278 bytes, 25.049636% matched (v2, unverified)

def f(c):
 try:compile(c,'','exec');return 5
 except:
  for j in range(9):
   if any(l in c for l in [['echo'],['require'],['Main','string'],['document','alert','var ','function'],['String'],['def ','lambda','print '],['main','int','char'],['+++','<<<'],[]][j]):break
 return j

map:

0 = bash
1 = ruby
2 = c#
3 = javascript
4 = java
5 = python
6 = c
7 = brainf*
8 = cjam

much better golfed (probably still not great), finally broke the 25% barrier! Inputs have <LF> replaced with newline (\n)

helloworld922

Posted 2016-07-09T00:34:08.140

Reputation: 2 503

I like the exec attempt. By the way, you're allowed to ask for the input to have <LF> pre-replaced with \n, so you can save a few bytes there to potentially add some more fine-tuning. – Dave – 2016-07-10T08:47:23.707

Finally got around to verifying this. I only see 2103 (23.19%) for some reason - any idea why the difference? – Dave – 2016-07-11T18:36:40.540

hmmm... I'm not sure. Perhaps it has to do with how you're giving the input? I use decode('utf-8') to convert the raw input byte array to a native Python 3 unicode (utf-16?) string before giving it to my function. – helloworld922 – 2016-07-12T16:46:03.620

Ah OK. Could be an encoding issue; I'm just relying on whatever fileinput.input() does under-the-hood. I'll investigate. – Dave – 2016-07-12T17:38:54.067