Meta regex golf

29

8

In the spirit of this xkcd

enter link description here

Write a program that plays regex golf with arbitrary pairs of lists. The program should at least attempt to make the regex short, a program that just outputs /^(item1|item2|item3|item4)$/ or similar is not allowed.

Scoring is based on the ability to generate the shortest regex. The test lists are those of successful and unsuccessful US presidential candidates, found here (thanks @Peter). Of course, the program must work for all disjoint lists, so simply returning an answer for the president one doesn't count.

Manishearth

Posted 2014-01-06T08:10:43.477

Reputation: 709

3/^item1|atem2|item3|item4$/ probably has unintended precedence (string has to either begin with item1, contain atem2, contain item3, or end with item4). – Konrad Borowski – 2014-01-06T08:43:19.237

7This would be a more interesting challenge if it had a scoring system based primarily on the size of the regexes generated. – Peter Taylor – 2014-01-06T10:33:51.140

@PeterTaylor Have any example lists? – Manishearth – 2014-01-06T10:35:02.287

1

In the spirit of the XKCD title text, successful and unsuccessful US presidential candidates. (NB I made this list by hand following Wikipedia, so there might be small errors; I've removed from the list of losers all surnames matching a winner, because otherwise distinguishing the lists is impossible, but I've deliberately not otherwise deduplicated).

– Peter Taylor – 2014-01-06T11:15:53.977

4I wonder if Randall Munroe is a better writer of code-golf challenges than we are... – Johannes Kuhn – 2014-01-06T12:36:00.710

6I wonder if Randall Munroe is gonna throw down on this question. – boothby – 2014-01-07T08:00:21.707

1Yay, a list of answers to run the program from panel 2 and produce the grep-regex for panel 3! – Bergi – 2014-01-07T19:59:01.630

Answers

8

Perl (111 110 122 characters)

use Regexp::Assemble;@ARGV=shift;my$r=new Regexp::Assemble;chomp,add$r "^\Q$_\E\$"while<>;$_=as_string$r;s/\(\?:/(/g;print

This uses CPAN module called Regexp::Assemble in order to optimize the regular expressions. Because what is the better language for regular expressions then Perl.

Also, readable version, just for fun (made with help of -MO=Deparse).

use Regexp::Assemble;
my $r = Regexp::Assemble->new;
while (<>) {
    chomp($_);
    $r->add("^\Q$_\E\$");
}
$_ = $r->as_string;
# Replace wasteful (?:, even if it's technically correct.
s/\(\?:/(/g;
print $_;

Sample output (I did CTRL-D after item4).

$ perl assemble.pl
item1
atem2
item3
item4
^(item[134]|atem2)$

Also, as a bonus, I'm writing the regex for every word in the question.

^(a((ttemp)?t|llowed\.|rbitrary)?|\/\^item1\|atem2\|item3\|item4\$\/|s(ho(rt,|uld)|imilar)|p((air|lay)s|rogram)|(Writ|mak|Th)e|l(ists\.|east)|o([fr]|utputs)|t(h(at|e)|o)|(jus|no)t|regex|golf|with|is)$

Also, list of presidents (262 bytes).

^(((J(effer|ack|ohn)s|W(ashingt|ils)|Nix)o|Van Bure|Lincol)n|C(l(eveland|inton)|oolidge|arter)|H(a(r(rison|ding)|yes)|oover)|M(cKinley|adison|onroe)|T(a(ylor|ft)|ruman)|R(oosevelt|eagan)|G(arfield|rant)|Bu(chanan|sh)|P(ierce|olk)|Eisenhower|Kennedy|Adams|Obama)$

Konrad Borowski

Posted 2014-01-06T08:10:43.477

Reputation: 11 185

This appears to read stdin for one list and force the other to be empty. Surely that's not what the question is asking for? – Peter Taylor – 2014-01-06T10:23:32.163

1@PeterTaylor: Well, it's not that the second list is of any importance. Unless second list has duplicates of first list, the regexp is valid. It would be nice to have a shorter regexp, but I'm sort of lazy. – Konrad Borowski – 2014-01-06T10:26:48.573

IMO you should at least have a way to take it as input, even if you then discard it. – Peter Taylor – 2014-01-06T10:32:47.810

@PeterTaylor: If you say so. My program now takes two arguments, one of them being the first list. – Konrad Borowski – 2014-01-06T10:34:40.027

This is nice! The question has changed slightly, could you run it on the given list pair and count the characters in the result? – Manishearth – 2014-01-06T12:38:14.400

@Manishearth: Done. – Konrad Borowski – 2014-01-06T12:40:57.163

4This is cool; but it produces needlessly long expressions since it creates exclusion (for any other list) by matching every possible full word. Which is not quite the same spirit as the original golf. – Nicole – 2014-01-07T04:57:38.600

4

Not my solution (obviously I'm not peter norvig!) but here's a solution of the (slightly modified) question courtesy of him: http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313.ipynb

the program he gives is as follows (his work, not mine):

def findregex(winners, losers):
    "Find a regex that matches all winners but no losers (sets of strings)."
    # Make a pool of candidate components, then pick from them to cover winners.
    # On each iteration, add the best component to 'cover'; finally disjoin them together.
    pool = candidate_components(winners, losers)
    cover = []
    while winners:
        best = max(pool, key=lambda c: 3*len(matches(c, winners)) - len(c))
        cover.append(best)
        pool.remove(best)
        winners = winners - matches(best, winners)
    return '|'.join(cover)

def candidate_components(winners, losers):
    "Return components, c, that match at least one winner, w, but no loser."
    parts = set(mappend(dotify, mappend(subparts, winners)))
    wholes = {'^'+winner+'$' for winner in winners}
    return wholes | {p for p in parts if not matches(p, losers)}

def mappend(function, *sequences):
    """Map the function over the arguments.  Each result should be a sequence. 
    Append all the results together into one big list."""
    results = map(function, *sequences)
    return [item for result in results for item in result]

def subparts(word):
    "Return a set of subparts of word, consecutive characters up to length 4, plus the whole word."
    return set(word[i:i+n] for i in range(len(word)) for n in (1, 2, 3, 4)) 

def dotify(part):
    "Return all ways to replace a subset of chars in part with '.'."
    if part == '':
        return {''}  
    else:
        return {c+rest for rest in dotify(part[1:]) for c in ('.', part[0]) }

def matches(regex, strings):
    "Return a set of all the strings that are matched by regex."
    return {s for s in strings if re.search(regex, s)}

answer = findregex(winners, losers)
answer
# 'a.a|i..n|j|li|a.t|a..i|bu|oo|n.e|ay.|tr|rc|po|ls|oe|e.a'

where winners and losers are the winners and losers lists respectively (or any 2 lists of course) see the article for detailed explanations.

Mike H-R

Posted 2014-01-06T08:10:43.477

Reputation: 131

8While the linked article is interesting and I enjoyed reading it, this would have been better posted as a comment on the question instead of as an answer since it doesn't answer the posed question. – Gareth – 2014-01-07T11:04:39.633

1

You're right, it might have been better as a comment, I posted it as an answer simply because it answers the question perfectly. I didn't copy out the solution as I thought that'd be disingenuous and trying to take credit for someone elses work, in addition to providing a program that plays regex golf with 2 pairs of lists it also gives a fitness function and detailed code explanation along with the parallel to the set cover problem which I hadn't considered. If you still think it's not relevant, let me know, I'll delete and post as a comment.

– Mike H-R – 2014-01-07T20:54:01.223

1If you're worried about taking credit for someone else's work, flag and ask for a mod to make your answer "Community wiki". – Peter Taylor – 2014-01-08T08:44:45.380

1@PeterTaylor cool, I didn't know that was the protocol, done. – Mike H-R – 2014-01-08T11:24:00.647

2

My solution written in Factor:

USING:
    formatting fry
    grouping
    kernel
    math math.combinatorics math.ranges
    pcre
    sequences sets ;
IN: xkcd1313

: name-set ( str -- set )
    "\\s" split members ;

: winners ( -- set )
    "washington adams jefferson jefferson madison madison monroe
monroe adams jackson jackson vanburen harrison polk taylor pierce buchanan
lincoln lincoln grant grant hayes garfield cleveland harrison cleveland     mckinley
 mckinley roosevelt taft wilson wilson harding coolidge hoover roosevelt
roosevelt roosevelt roosevelt truman eisenhower eisenhower kennedy johnson     nixon
nixon carter reagan reagan bush clinton clinton bush bush obama obama" name-set ;

: losers ( -- set )
    "clinton jefferson adams pinckney pinckney clinton king adams
jackson adams clay vanburen vanburen clay cass scott fremont breckinridge
mcclellan seymour greeley tilden hancock blaine cleveland harrison bryan bryan
parker bryan roosevelt hughes cox davis smith hoover landon wilkie dewey dewey
stevenson stevenson nixon goldwater humphrey mcgovern ford carter mondale
dukakis bush dole gore kerry mccain romney" name-set winners diff
    { "fremont" } diff "fillmore" suffix ;

: matches ( seq regex -- seq' )
    '[ _ findall empty? not ] filter ;

: mconcat ( seq quot -- set )
    map concat members ; inline

: dotify ( str -- seq )
    { t f } over length selections [ [ CHAR: . rot ? ] "" 2map-as ] with map ;

: subparts ( str -- seq )
    1 4 [a,b] [ clump ] with mconcat ;

: candidate-components ( winners losers -- seq )
    [
        [ [ "^%s$" sprintf ] map ]
        [ [ subparts ] mconcat [ dotify ] mconcat ] bi append
    ] dip swap [ matches empty? ] with filter ;

: find-cover ( winners candidates -- cover )
    swap [ drop { } ] [
        2dup '[ _ over matches length 3 * swap length - ] supremum-by [
            [ dupd matches diff ] [ rot remove ] bi find-cover
        ] keep prefix
    ] if-empty ;

: find-regex ( winners losers -- regex )
    dupd candidate-components find-cover "|" join ;

: verify ( winners losers regex -- ? )
    swap over [
        dupd matches diff "Error: should match but did not: %s\n"
    ] [
        matches "Error: should not match but did: %s\n"
    ] 2bi* [
        dupd '[ ", " join _ printf ] unless-empty empty?
    ] 2bi@ and ;

: print-stats ( legend winners regex -- )
    dup length rot "|" join length over /
    "separating %s: '%s' (%d chars %.1f ratio)\n" printf ;

: (find-both) ( winners losers legend -- )
    -rot 2dup find-regex [ verify t assert= ] 3keep nip print-stats ;

: find-both ( winners losers -- )
    [ "1 from 2" (find-both) ] [ swap "2 from 1" (find-both) ] 2bi ;



IN: scratchpad winners losers find-both 
separating 1 from 2: 'a.a|a..i|j|li|a.t|i..n|bu|oo|ay.|n.e|ma|oe|po|rc|ls|l.v' (55 chars 4.8 ratio)
separating 2 from 1: 'go|e..y|br|cc|hu|do|k.e|.mo|o.d|s..t|ss|ti|oc|bl|pa|ox|av|st|du|om|cla|k..g' (75 chars 3.3 ratio)

It's the same algorithm as Norvig's. If hurting readability is the goal then you can probably golf away many more characters.

Björn Lindqvist

Posted 2014-01-06T08:10:43.477

Reputation: 590

1

FYI, you are missing a bunch of the losers from the official list (Burr, Jay, Badnarik, probably others I'm not seeing). So, your results are incorrect; for example, the first regex doesn't work, because it matches Burr and Jay.

– elixenide – 2014-01-09T20:56:04.270

1

My code is not in a very golf-ish and condensed fashion, but you can check it at https://github.com/amitayd/regexp-golf-coffeescript/ (or specifically the algorithm at src/regexpGolf.coffee).

It's based on Peter Norvig's algorithm, with two improvements:

  1. Create parts to use with character sets (i.e. use [ab]z, [ac]z, and [bc]z if valid parts are az, bz and cz).
  2. Allow to construct "top optimal paths" of covers, and not just a cover made of the best candidate at each iteration.

(And also threw in an optional randomness)

For the sets of winners/losers in this quiz I found a 76 chars regex using it:

[Jisn]e..|[dcih]..o|[AaG].a|[sro].i|T[ar]|[PHx]o|V|[oy]e|lev|sh$|u.e|rte|nle

Some more details in my blog post about porting the solver to coffeescript.

Amitay Dobo

Posted 2014-01-06T08:10:43.477

Reputation: 111

2Could you please contain your code in your answer? Otherwise, we can't see the code without clicking on the link! – wizzwizz4 – 2016-03-01T17:05:56.700