Detect what programming language a snippet is

23

3

Your challenge is to take some source code as input, and output which programming language it is written in.

For example, you could have the input

class A{public static void main(String[]a){System.out.println("Hello, World!");}}

And output

Java

Your two main goals are diversity (how many programming languages you can detect) and accuracy (how good you are at detecting these languages).

For polyglots (programs valid in more than one language), you can decide what to do. You could just output the language that your program thinks is more likely, or you could output an error, or you could output an array of possible choices (which would probably get more upvotes than just an error!).

This is a , because it would be very difficult to specify a different objective winning criterion. Voters, please vote on how many languages it can detect and how accurate it is.

Doorknob

Posted 2014-02-18T22:14:39.623

Reputation: 68 138

That is impossible, cause print("") can be used in many languages. – Ismael Miguel – 2014-02-18T22:18:15.067

@IsmaelMiguel I never said "unambiguously." Sure, some programs run in many languages, but the goal is to identify as many programs as possible. – Doorknob – 2014-02-18T22:21:50.387

What you are asking is impossible in my opinion. – Ismael Miguel – 2014-02-18T22:22:36.050

1With your edit, now it seems more possible. – Ismael Miguel – 2014-02-18T22:27:49.223

4What about languages that are valid for EVERY input? Like whitespace. This sentence is a valid whitespace program. This whole page is a valid whitespace program. – Ismael Miguel – 2014-02-18T23:40:16.353

@IsmaelMiguel You could try detecting how likely the program is Whitespace (tabs, arrangement of chars, etc.). (That's the whole point of this challenge.) Or you could just not detect whitespace ;) – Doorknob – 2014-02-18T23:44:32.513

To do this properly you'd need a huge test set and train a classifier. – marinus – 2014-02-19T00:10:27.753

@marinus Not necessarily; for example, Python could be easily detected by searching for : in the right place. Similarly, PHP/Perl could be detected by $s in certain places, XML/CSS by obviousness, objective-C by its unique syntax, APL by special chars, etc. – Doorknob – 2014-02-19T00:17:10.413

Reasonable job for Amazon Mechanical Turk. – Darren Stone – 2014-02-19T03:30:17.503

1Is the input guaranteed to be a valid program? Like some input could be class A{public static void main(String[]a){System.println.out("Hello, World!");}} which is invalid. – Gaurang Tandon – 2014-02-19T05:29:58.273

1Or likewise will HTML input always start with <!DOCTYPE html> followed by the <html>,<body> and other tags (like meta) in their correct order? – Gaurang Tandon – 2014-02-19T11:23:56.580

@Gaurang Yes, but there could be different class name, different doctype, etc. – Doorknob – 2014-02-19T12:17:38.867

@Doorknob And, though I have 5-6 languages ready, I want just want to confirm that will the given code always produce output i.e. have a print() statement? – Gaurang Tandon – 2014-02-20T11:12:17.327

Related: http://codegolf.stackexchange.com/questions/15372/write-a-program-in-disguise (try testing your submissions on these)

– None – 2014-08-07T08:18:58.063

Answers

18

234 text formats - Unix Shell

(not all of them languages - I need to count them carefully)

file $1

I hesitate to post this somewhat smart-a$$ answer, but I don't see anything in the rules banning it and the file shell utility really does do a good job of this. e.g:

$ file golfscript.rb 
golfscript.rb: Ruby module source, ASCII text
$ file template.c 
template.c: ASCII C program text
$ file adams.sh
adams.sh: Bourne-Again shell script, ASCII text executable
$ 

Furthermore you can use the -k option to "keep going" when testing a polyglot:

 -k, --keep-going
         Don't stop at the first match, keep going.  Subsequent matches
         will be have the string ‘\012- ’ prepended.  (If you want a new‐
         line, see the -r option.)

Also, the -l option will give you an idea of how good the algorithm is for differing languages:

$ file -l | grep shell
unknown, 0: Warning: using regular magic file `/etc/magic'
Strength = 280 : shell archive text [application/octet-stream]
Strength = 250 : Tenex C shell script text executable [text/x-shellscript]
Strength = 250 : Bourne-Again shell script text executable [text/x-shellscript]
Strength = 240 : Paul Falstad's zsh script text executable [text/x-shellscript]
Strength = 240 : Neil Brown's ash script text executable [text/x-shellscript]
Strength = 230 : Neil Brown's ae script text executable [text/x-shellscript]
Strength = 210 : Tenex C shell script text executable [text/x-shellscript]
Strength = 210 : Bourne-Again shell script text executable [text/x-shellscript]
Strength = 190 : Tenex C shell script text executable [text/x-shellscript]
Strength = 190 : Bourne-Again shell script text executable [text/x-shellscript]
Strength = 180 : Paul Falstad's zsh script text executable [text/x-shellscript]
Strength = 150 : Tenex C shell script text executable [text/x-shellscript]
Strength = 150 : Bourne-Again shell script text executable [text/x-shellscript]
Strength = 140 : C shell script text executable [text/x-shellscript]
Strength = 140 : Korn shell script text executable [text/x-shellscript]
Strength = 140 : Paul Falstad's zsh script text executable [text/x-shellscript]
Strength = 130 : POSIX shell script text executable [text/x-shellscript]
Strength = 130 : Plan 9 rc shell script text executable []
$ 

This is file-5.09 (on Ubuntu 12.04)

Digital Trauma

Posted 2014-02-18T22:14:39.623

Reputation: 64 644

2Looks like a standard loophole, i.e. delegating solution to existing program. – Vi. – 2015-02-13T15:57:06.763

This actually does pretty well on a 16-language polyglot - https://gist.github.com/riking/9088817

– Riking – 2014-02-19T09:34:42.347

You might as well cut out the middle man and avoid the shell entirely: ln -s /usr/bin/file /usr/local/bin/myspecialtool. If your answer counts, then doesn't this count just as well? (Don't worry, I'm not serious.) – hvd – 2014-02-19T10:01:46.680

10

Bash — about 50 35 bytes per compilable language

Trick is to just compile, then you don't have to worry about linking errors from missing libraries, and it's more forgiving if you just have code snippets.

Thanks to Shahbaz for shorter forms!

gcc -c $1 && (echo C; exit 0)
g++ -c $1 && (echo C++; exit 0)
gpc -c $1 && (echo Pascal; exit 0)
gfortran -c $1 && (echo Fortran; exit 0)

etc...

user15259

Posted 2014-02-18T22:14:39.623

Reputation:

Since you mention number of bytes per compilable language, you might be interested in lines like: gcc -c $1 && (echo C; exit 0) – Shahbaz – 2014-02-20T12:51:16.150

Thank you, I'm not very good at really squeezing code! – None – 2014-02-20T14:16:30.180

Sure. The && and || in bash are really useful and help cleanup the code a lot. They are by no means used for obfuscation, so you'd do well to learn them. – Shahbaz – 2014-02-20T14:20:09.530

2You can also pass -fsyntax-only to only check the syntax and skip actual compilation. – peppe – 2014-02-21T21:59:15.910

7

18 programming languages, 1002 bytes, accuracy: test for yourself :)

(yep I know this is not code golf, but for the fun of it)

The program searches for iconic code snippets, the checks are ordered in a way that the most clear checks are at the top and programming languages embedded in other programming languages are below (e.g. HTML in PHP).

This obviously fails for programs like System.out.println('<?php');

t = (p) ->
    h = (x) -> -1 != p.indexOf x
    s = (x) -> 0 == p.indexOf x

    if h "⍵" then "APL"
    else if h "<?php" then "PHP"
    else if h("<?xml") and h "<html" then "XHTML"
    else if h "<html" then "HTML"
    else if h "<?xml" then "XML"
    else if h("jQuery") or h "document.get" then "JavaScript"
    else if h "def __init__(self" then "Python"
    else if h "\\documentclass" then "TeX"
    else if h("java.") or h "public class" then "Java"
    else if s("SELE") or s("UPDATE") or s "DELE" then "SQL"
    else if /[-\+\.,\[\]\>\<]{9}/.test p then "Brainfuck"
    else if h "NSString" then "Objective-C"
    else if h "do |" then "Ruby"
    else if h("prototype") or h "$(" then "JavaScript"
    else if h "(defun" then "Common Lisp"
    else if /::\s*[a-z]+\s*->/i.test p then "Haskell"
    else if h "using System" then "C#"
    else if h "#include"
        if h("iostream") or h "using namespace" then "C++"
        else "C"
    else "???"

program = ""
process.stdin.on 'data', (chunk) -> program += chunk
process.stdin.on 'end', -> console.log t program

Usage on node: coffee timwolla.coffee < Example.java

Demo (Online-Demo on JSFiddle):

[timwolla@~/workspace/js]coffee puzzle.coffee < ../c/nginx/src/core/nginx.c 
C
[timwolla@~/workspace/js]coffee puzzle.coffee < ../ruby/github-services/lib/service.rb
Ruby
[timwolla@~/workspace/js]coffee puzzle.coffee < ../python/seafile/python/seaserv/api.py
Python

TimWolla

Posted 2014-02-18T22:14:39.623

Reputation: 1 878

On my computer this outputs nothing, not even on input that should obviously work. Granted, I might be doing something wrong as I've never used Coffeescript before. – marinus – 2014-02-19T02:39:52.480

@marinus Note that when manually inputting code you need to send an EOF (STRG+D) to trigger execution. Generally: The detector should at least spit out three question marks. – TimWolla – 2014-02-19T02:41:23.520

Nope, nothing. Do I need to pass coffee any arguments? I had just tried redirecting files into it, but just running it and going ^D doesn't do anything either. – marinus – 2014-02-19T02:49:46.773

@marinus Try: npm install coffee-script && node_modules/.bin/coffee timwolla.coffee < timwolla.coffee in a temporary folder, this should spit out APL. (assuming you have a recent version of node and npm installed) – TimWolla – 2014-02-19T02:51:56.237

@marinus I just added an online demo on JSFiddle: http://jsfiddle.net/TimWolla/wkJKY/

– TimWolla – 2014-02-19T02:56:47.197

npm wouldn't work either so obviously it's something about my system. But allright, the fiddle convinced me, have an upvote. – marinus – 2014-02-19T03:17:09.960

5I'll start using lowercase omega more in my non-APL programs. – John Dvorak – 2014-02-19T08:44:31.353

Sorry dude, but that code is too inaccurate in my opinion. – Ismael Miguel – 2014-02-19T20:12:58.890

4

Just a few broad generalizations.

I think it's fairly accurate.

This is Ruby btw. Takes (multiline) input from stdin.

puts case $<.read
when /\)\)\)\)\)/
  "Lisp"
when /}\s+}\s+}\s+}/
  "Java"
when /<>/
  "Perl"
when /|\w+|/
  "Ruby"
when /\w+ :- \w+ \./
  "Prolog"
when /^[+-<>\[\],.]+$/
  "brainfuck"
when /\[\[.*\]\]/
  "Bash"
when /~]\.{,/
  "golfscript"
end

daniero

Posted 2014-02-18T22:14:39.623

Reputation: 17 193

I would think #include is a better predictor for c. What about #!/bin/(ba)?sh for bash/shell scripts? – Digital Trauma – 2014-02-19T04:01:12.107

@DigitalTrauma Yea, I think you're right about #include. For artistic reasons I'm not going to just catch the hash-bang where the name of the language is explicitly spelled out tho. – daniero – 2014-02-19T04:18:35.707

#include is a comment in ini files and php – Ismael Miguel – 2014-02-19T13:51:30.293

@IsmaelMiguel true, but generally it would appear more often in a C-program and I'm generalizing here. But, it messes up the Feng Shui of the program so I took it away. Now it doesn't look at any words at all, just typical traits of the syntax. – daniero – 2014-02-19T14:46:11.570

Now you have my upvote. And yes, you are right. But i like your answer btw. Try to include vb (searching for /^dim / should do it). And you can go even more crazy and add /^#include <luac(?:.h)?>/ to detect lua inside c. – Ismael Miguel – 2014-02-19T15:14:22.887

"/}\s+}\s+}\s+}/" Looks like I'm no longer a C# programmer... – NPSF3000 – 2014-02-19T22:50:03.290

1+1 for having prolog, but no C :) – SztupY – 2014-02-19T23:48:24.193

1I would add \$\w+ after the perl one to detect PHP. Also (\w+)::~\1 is usually a C++ destructor – SztupY – 2014-02-19T23:53:32.367

4

This answer is a proof of concept, that will not likely receive any more work from myself.

It falls short in several ways:

  • The output is not exactly as the question requests, but close enough and could easily be modified to produce the exact output required.
  • There are several ways to make the code perform better and/or better ways to represent the data structures.
  • and more

The idea is to set a list of keywords/characters/phrases that can identify a specific language and assign a score to that keyword for each language. Then check the source file(s) for these keywords, and tally up the scores for each language that you find keywords for. In the end the language with the highest score is the likely winner. This also caters for polyglot programs as both (or all) the relevant languages will score high.

The only thing to add more languages is to identify their "signatures" and add them to the mapping.

You can also assign different scores to different keywords per language. For example, if you feel volatile is used more in Java than in C, set the score for volatile keyword to 2 for Java and 1 for C.

public class SourceTest {

  public static void main(String[] args) {
    if (args.length < 1) {
      System.out.println("No file provided.");
      System.exit(0);
    }
    SourceTest sourceTest = new SourceTest();
    for (String fileName : args) {
      try {
        sourceTest.checkFile(fileName);
      } catch (FileNotFoundException e) {
        System.out.println(fileName + " : not found.");
      } catch (IOException e) {
        System.out.println(fileName + " : could not read");
      }
    }
    System.exit(0);
  }

  private Map<String, LanguagePoints> keyWordPoints;
  private Map<LANGUAGES, Integer> scores;

  private enum LANGUAGES {
    C, HTML, JAVA;
  }

  public SourceTest() {
    init();
  }

  public void checkFile(String fileName) throws FileNotFoundException, IOException {
    String fileContent = getFileContent(fileName);
    testFile(fileContent);
    printResults(fileName);
  }

  private void printResults(String fileName) {
    System.out.println(fileName);
    for (LANGUAGES lang : scores.keySet()) {
      System.out.println("\t" + lang + "\t" + scores.get(lang));
    }
  }

  private void testFile(String fileContent) {
    for (String key : keyWordPoints.keySet()) {
      if (fileContent.indexOf(key) != -1) {
        for (LANGUAGES lang : keyWordPoints.get(key).keySet()) {
          scores.put(lang, scores.get(lang) == null ? new Integer(1) : scores.get(lang) + 1);
        }
      }
    }
  }

  private String getFileContent(String fileName) throws FileNotFoundException, IOException {
    File file = new File(fileName);
    FileReader fr = new FileReader(file);// Using 1.6 so no Files
    BufferedReader br = new BufferedReader(fr);
    StringBuilder fileContent = new StringBuilder();
    String line = br.readLine();
    while (line != null) {
      fileContent.append(line);
      line = br.readLine();
    }
    return fileContent.toString();
  }

  private void init() {
    scores = new HashMap<LANGUAGES, Integer>();

    keyWordPoints = new HashMap<String, LanguagePoints>();
    keyWordPoints.put("public class", new LanguagePoints().add(LANGUAGES.JAVA, 1));
    keyWordPoints.put("public static void main", new LanguagePoints().add(LANGUAGES.JAVA, 1));
    keyWordPoints.put("<html", new LanguagePoints().add(LANGUAGES.HTML, 1));
    keyWordPoints.put("<body", new LanguagePoints().add(LANGUAGES.HTML, 1));
    keyWordPoints.put("cout", new LanguagePoints().add(LANGUAGES.C, 1));
    keyWordPoints.put("#include", new LanguagePoints().add(LANGUAGES.C, 1));
    keyWordPoints.put("volatile", new LanguagePoints().add(LANGUAGES.JAVA, 1).add(LANGUAGES.C, 1));
  }

  private class LanguagePoints extends HashMap<LANGUAGES, Integer> {
    public LanguagePoints add(LANGUAGES l, Integer i) {
      this.put(l, i);
      return this;
    }
  }
}

ufis

Posted 2014-02-18T22:14:39.623

Reputation: 51

2

Javascript - 6 languages - high accuracy

Current Languages: Java, C, HTML, PHP, CSS, Javascript

I work on the principle that whenever an input satisfies a criteria, it is given a score, and based on that score results are given.

Features:

  • No built-in functions that determine language type used.
  • Does not straightaway declare the input text is x language on seeing a keyword.
  • Suggests other probable languages also.

Should you feel that any of your inputs of the programs (that I have done till now) are not caught or get invalid results, then please report and I'd be happy to fix them.

Sample Input 1:

class A{public static void main(String[]a){System.out.println("<?php");}}

Sample Output 1:

My program thinks you have :
Java with a chance of 100%
Php with a chance of 25%
----------------

Explanation:

This should have failed the program and I would have printed PHP, but since my program works on the basis of scores, nothing fails and it easily identifies Java in the first place, followed by other possible results.

Sample Input 2:

class A{public static void main(String[]a){System.out.println("HelloWorld!");}}

Sample Output 2:

Java
----------------

Sample Input 3:

ABCDEFGHIJKLMNOPQRSTUVWXYZ

Sample Output 3:

Language not catched! Sorry.
----------------

The code:

// Helper functions

String.prototype.m = function(condition){
  return this.match(condition);
};

String.prototype.capitalize = function(){
  return this[0].toUpperCase() + this.substr(1);
};

function getFuncName(func){
  var temp =  func.toString();
  temp = temp.substr( "function ".length);
  temp = temp.substr( 0, temp.indexOf("("));
  return temp.capitalize();
}

// Get input
var lang_input = prompt("Enter programming language");

// Max score of 4 per lang

function java(input){
  var score = 0;
  score += input.m(/class[\s\n]+[\w$]+[\s\n]*\{/) ? 1 : 0;
  score += input.m(/public[\s\n]+static[\s\n]+void[\s\n]+main[\s\n]*/) ? 1 : 0;
  score += input.m(/\}[\s\n]*\}[\s\n]*$/) ? 1 : 0;
  score += input.m(/System[\s\n]*[.][\s\n]*out/) ? 1 : 0;
  return score;
}

function c(input){
  var score = 0;
  // if java has passsed
  if(checks[0][1] >= 3)return 0;

  score += input.m(/^#include\s+<[\w.]+>\s*\n/) ? 1 : 0;
  score += input.m(/main[\s\n]*\([\s\n]*(void)?[\s\n]*\)[\s\n]*\{/) ? 1 : 0;
  score += input.m(/printf[\s\n]+\(/) || input.m(/%d/) ? 1 : 0;
  score += input.m(/#include\s+<[\w.]+>\s*\n/) || input.m(/(%c|%f|%s)/) ? 1 : 0;
  return score;
}

function PHP(input){
  var score = 0;
  score += input.m(/<\?php/) ? 1 : 0;
  score += input.m(/\?>/) ? 1 : 0;
  score += input.m(/echo/) ? 1 : 0;
  score += input.m(/$[\w]+\s*=\s*/) ? 1 : 0;
  return score;
}

function HTML(input){
  var score = 0;
  // if php has passed
  if(checks[2][1] >= 2) return 0;

  score += input.m(/<!DOCTYPE ["' \w:\/\/]*>/) ? 1 : 0;
  score += input.m(/<html>/) && input.m(/<\/html>/) ? 1 : 0;
  score += input.m(/<body>/) && input.m(/<\/body/) ? 1 :  0;
  score += input.m(/<head>/) && input.m(/<\/head>/) ? 1 : 0;
  return score;
}

function javascript(input){
  var score = 0;
  score += input.m(/console[\s\n]*[.][\s\n]*log[\s\n*]\(/) ? 1 : 0;
  score += input.m(/[\s\n]*var[\s\n]+/) ? 1 : 0;
  score += input.m(/[\s\n]*function[\s\n]+[\w]+[\s\n]+\(/) ? 1 : 0;
  score += input.m(/document[\s\n]*[.]/) || 
           ( input.m(/\/\*/) && input.m(/\*\//) ) ||
           ( input.m(/\/\/.*\n/) )? 1 : 0;
  return score;
}

function CSS(input){
  var score = 0;
  score += input.m(/[a-zA-Z]+[\s\n]*\{[\w\n]*[a-zA-Z\-]+[\s\n]*:/) ? 1 : 0;
  // since color is more common, I give it a separate place
  score += input.m(/color/) ? 1 : 0;          
  score += input.m(/height/) || input.m(/width/) ? 1 : 0;
  score += input.m(/#[a-zA-Z]+[\s\n]*\{[\w\n]*[a-zA-Z\-]+[\s\n]*:/) ||
           input.m(/[.][a-zA-Z]+[\s\n]*\{[\w\n]*[a-zA-Z\-]+[\s\n]*:/) ||
           ( input.m(/\/\*/) && input.m(/\*\//) ) ? 1 : 0;
  return score;
}

// [Langs to check, scores]
var checks = [[java, 0], [c, 0], [PHP, 0], [HTML, 0], [javascript, 0], [CSS, 0]];
//Their scores

// Assign scores
for(var i = 0; i < checks.length; i++){
  var func = checks[i][0];
  checks[i][1] = func(lang_input);
}

// Sort the scores
checks.sort(function(a,b){ return b[1] - a[1]; });

var all_zero = true;

function check_all_zero(index){
  if(checks[index][1] > 0){ all_zero = false; return 0; } // someone is above zero

  // check next index only if it defined, else return zero
  if(checks[index + 1])
    check_all_zero(index + 1);
}

check_all_zero(0);

if(all_zero){
  console.log("Language not catched! Sorry.");
}else {
  var new_arr = [];                   // temp

  checks.map(function(value, index){
    if(value[1] > 0){
      var temp = [getFuncName(value[0]), value[1]];
      new_arr.push(temp);
    }
  });

  checks = new_arr.slice(0);          // array copy, because of mutation

  if(checks.length === 1){
    console.log(checks[0][0]);
  }else{
    console.log("My program thinks you have :");
    checks.map(function(value){
      var prob = (value[1]/4 * 100);
      console.log(value[0] + " with a chance of " + prob + "%");
    });
  }

} // Main else block finish

console.log("----------------");

Gaurang Tandon

Posted 2014-02-18T22:14:39.623

Reputation: 837