What's the Language?

17

3

Recently, the PPCG design leaderboard has been having some trouble parsing answer html headers.

In this challenge you'll be taking your own shot at parsing answer headers.


Example Test Cases

These example inputs (NOT actual test cases), just so you can get the idea of how inputs might be like

Input: <h1>Python 3, 32 bytes</h1>
Desired Output: Python 3

Input: <h1>JavaScript, 13 chars / 32 bytes</h1>
Desired Output: JavaScript

Input: <b>VeryBadlyFormattedHeader v3 : (32 bytes)</b>

Spec

Your program should be 150 bytes or below

You will be given a line of an answer header's html, you'll need to try to do your best to successfully extract the language. Input may contain unicode characters.

Output case matters.

Tests

Github Gist with test cases

There is one test case per line. The format is:

<lang_name> - <rest_of_the_line_is_the_header>

Scoring

Your score is:

 Number Correct
----------------
  Total Number

(which is a percent)

Tie-breaker is shortest code.

Downgoat

Posted 2016-01-23T05:28:30.280

Reputation: 27 116

There should be a tie-breaker like shortest code since 100% score is very achievable. – user81655 – 2016-01-23T07:15:18.047

1The most common header style # is not even present in the test cases ?! – edc65 – 2016-01-23T08:20:07.273

So the language is typically the first word the the input? – TanMath – 2016-01-23T09:43:43.643

@edc65 I believe this is about parsing the HTML output of the answer, not the Markdown source. – Kroltan – 2016-01-23T15:11:17.000

@Kroltan uh thanks for clarifying – edc65 – 2016-01-23T15:15:37.177

5Parsing HTML with regex? Whatever next? – Neil – 2016-01-23T20:22:18.190

You should have made them output the byte number as well, that's been causing us just as much trouble... – ETHproductions – 2016-01-24T03:16:43.477

@ETHproductions That could be made into a separate challenge considering its different enough – Downgoat – 2016-01-24T03:17:36.913

I would highly recommend changing the tie-breaker from shortest code to earlier submission. Here's why.

– Mego – 2016-01-24T21:40:21.893

Answers

11

Retina 0.8.2, 100%, 75 71 70 68 67 64 59 53 51 bytes

<.*?>

(,| [-&(–5]| [0-7]\d)(?! W|...\)).*

2 |:

This is essentially code golf now, so I had to switch languages.

Try it online!

Verification

$ wget -q https://gist.githubusercontent.com/vihanb/1d99599b50c82d4a6d7f/raw/cd8225de96e9920db93613198b012749f9763e3c/testcases
$ grep -Po '(?<= - ).*' < testcases > input
$ grep -Po '^.*?(?= - )' < testcases > output
$ mono retina/Retina.exe headers.ret < input | head -n -1 | diff -s - output
Files - and output are identical

How it works

The code consists of three simple substitutions (or eliminations). Instead of trying to match the language name, we get rid of all parts of the input string that do form part of it.

  1. <.*?> will match all HTML tags, so the substitution will eliminate them from the input.

    .*? matches any amount of characters, but since ? makes the quantifier lazy, it will match the least amount possible that still allows the entire pattern to match. This avoid deleting the entire input, which will always begin with a < and end with a >.

    The language name now begins with the first character of the remaining modified input string.

  2. After the language's name, we will almost always find one of the following endings:

    ,, -, &, (, , 5, or a space followed by two digits.

    The first two endings are rather common, and Python 2 &amp; PuLP... should be parsed as Python 2, Ruby (2.2.2p95)... as Ruby, >PHP – 3302 bytes as PHP, and Perl 5... as Perl.

    (,| [-&(–5]| \d\d).* would match all these endings (and all characters after them), but it will result in a few false positives:

    • , will match the comma in the language name Help, WarDoq!.

    • ( will match the version of JavaScript (ESx) and Java (1.8).

    • \d\d will match the version in Ti-Basic 84.

    We can fix the third problem case by using [0-7]\d instead of \d\d, to avoid matching the 8 in 84.

    For the other problem cases, we use the negative lookahead (?! W|...\)) that will prevent the preceding pattern from matching if it is followed by W (as in Help, WarDoq!) or by exactly three characters and a closing parenthesis (as in (ES6) or (1.8)).

    Putting it all together, (,| [-&(–5]| [0-7]\d)(?! W|...\)).* matches everything after the language name.

  3. We're left with two problem cases:

    <h1>Python <s>2</s> 3, <s>255</s> <s>204</s> <s>180</s> 178 bytes</h1>
    <h1><a href="http://sylwester.no/zozotez/" rel="nofollow">Zozotez Lisp</a>: 73</h1>
    

    gets parsed as

    Python 2 3
    Zozotez Lisp:
    

    We can fix the first by removing and the second one by removing : from the output.

    This is achieved by replacing 2 |: with the empty string.

Dennis

Posted 2016-01-23T05:28:30.280

Reputation: 196 637

16

Bash, 100%, 100 bytes

sed sX..s.2./s.XX|grep -Po '(?<=>)[^<]+?(?=(,(?! W)| [-&–5]| ?<| [0-79]\d| ?\((?!E|1\.)))'|head -1

Try it online on Ideone.

Verification

$ wget -q https://gist.githubusercontent.com/vihanb/1d99599b50c82d4a6d7f/raw/cd8225de96e9920db93613198b012749f9763e3c/testcases
$ grep -Po '(?<= - ).*' < testcases > input
$ grep -Po '^.*?(?= - )' < testcases > output
$ while read line; do bash headers.sh <<< "$line"; done < input | diff -s - output
Files - and output are identical

Dennis

Posted 2016-01-23T05:28:30.280

Reputation: 196 637

2Huh, I really thought this would very difficult... nice job! – Downgoat – 2016-01-23T07:01:01.307

4

CJam, 78.38% (76 bytes)

l{_'>#)>_c'<=}g_'<#<_{",-"&}#)_{_1$=',=+(<}{;}?

Try it online! or count the correct headers.

Dennis

Posted 2016-01-23T05:28:30.280

Reputation: 196 637

4

Jolf, 13 bytes, 85.94%, noncompeting

ρΥpΔid=',H',E
  pΔi         remove all HTML tags from input, preserving content
 Υ   d=',H     slice from beginning until a member is a comma
ρ         ',E  replace said comma with the empty string  

I had this update on my computer. It was to my chagrin that I had forgotten to update the respective code for the interpreter. I don't feel like shooting for 100%. Maybe everyone should format their headers the same ¯\_(ツ)_/¯

Conor O'Brien

Posted 2016-01-23T05:28:30.280

Reputation: 36 228

Stripping tags is a good idea. That saved three bytes in my Retina answer. – Dennis – 2016-01-23T18:54:13.020

@Dennis Thanks! Glad it helped. – Conor O'Brien – 2016-01-23T18:54:55.293