Partial Ordering of Regex Patterns

25

6

For the purpose of this challenge, we say that a regex pattern matches a string if the entire string is matched by the pattern, not just a substring.

Given two regex patterns A and B, we say that A is more specialized than B  if every string that is matched by A is also matched by B  but not the other way around. We say that A is equivalent to B if both patterns match exactly the same set of strings. If neither pattern is more specialized than the other nor are they equivalent, we say that A and B are incomparable.

For example, the pattern Hello, .*! is more specialized than .*, .*!; the patterns (Hello|Goodbye), World! and Hello, World!|Goodbye, World! are equivalent; and the patterns Hello, .*! and .*, World! are incomparable.

The relation "more specialized than" defines a strict partial order on the set of regex patterns. In particular, for all patterns A and B, exactly one of the following is true:

  • A is more specialized than B (A < B).
  • B is more specialized than A (A > B).
  • A and B are equivalent (A = B).
  • A and B are incomparable (AB).

When A and B are incomparable, we can further distinguish between two cases:

  • A and B are disjoint (AB), meaning that no string is matched by both of them.
  • A and B are intersecting (AB), meaning that some strings are matched by both.

Challenge

Write a program or a function that takes a pair of regex patterns and compares them using the above order. That is, if the two patterns are A and B, the program should determine whether A < B, A > B,
A = B or AB.

×92% Bonus An additional bonus is given if, when the patterns are incomparable, the program determines whether they're intersecting or disjoint.

Input and Output

The program should accept two regex patterns, as strings, in the flavor defined below. You may read the input through STDIN, the command line, as function arguments or an equivalent method. You may assume that the patterns are valid.

The program should produce one of exactly four distinct outputs (or five distinct outputs if you're going for the above bonus,) depending on the result of the comparison (the exact outputs are up to you.) You may write the output to STDOUT, return it as the function's result or use an equivalent method.

Regex Flavor

You may support whatever regex features you like, but you must support the following ones:

  • Alternation with |.
  • Quantification with *.
  • Grouping with ( and ).
  • Matching any character (possibly excluding newlines) with ..
  • (Optional: ×80% Bonus) Matching simple and negated character classes with […] and [^…], respectively. You don't have to support any predefined character classes (e.g. [:digit:]) but you should support character ranges.
  • Character escaping with \. It should at least be possible to esacpe special characters (i.e. |*().[^-]\) and preferably also common special characters in other flavors (e.g. {}), but the behavior when escaping nonspecial characters is unspecified. In particular, you don't have to support special escape sequences such as \n for a newline and the like. A possible implementation is simply to take the character following the \ as a literal.

You may assume the existance of an input character that cannot be matched by any literal (i.e. it can only be matched by . and negated character classes.)

Additional Rules

  • You may use regex libraries or builtin regex functionality only for the purpose of string matching and replacement.
  • You may ignore any locale-related issues, such as collation rules.
  • To state the obvious: your program must terminate. It should execute in a reasonable amount of time given typical patterns (definitely no more than an hour, preferably a lot less.)

Scoring

This is code-golf. Your score is the product of the code size, in bytes, and any of the bonuses. The lowest score wins.

Test Cases

The format of the test cases is as follows:

<Test ID>
<Pattern A>
<Ordering>
<Pattern B>

<Test ID>
<Pattern A>
<Ordering>
<Pattern B>

...

Where <Test ID> is an identifier for the test case, <Pattern A> and <Pattern B> are the regex patterns and <Ordering> is the ordering between them, and is one of:

  • <: <Pattern A> is more specialized than <Pattern B>.
  • >: <Pattern B> is more specialized than <Pattern A>.
  • =: The patterns are equivalent.
  • |: The patterns are incomparable and disjoint.
  • X: The patterns are incomparable and intersecting.

The special value <empty pattern> stands for the empty pattern.

A. Basic patterns

A1
Hello
=
Hello

A2
Hello
|
World

A3
Hell(|o)
=
Hell|Hello

A4
x
<
.

A5
.
>
x

A6
\.
<
.

A7
x|y
=
y|x

A8
x|y
X
y|z

A9
x|y
|
a|b

A10
x|y|z
>
x|z

A11
x.
X
.x

A12
.*
>
<empty pattern>

A13
()*
=
<empty pattern>

A14
x*
>
xxxx

A15
x*
>
x|xx|xxxx*

A16
x*
=
|x|xx|xxxx*

A17
x*
<
.*

A19
(.|..)*
=
.*

A20
(...)*
X
(..)*

A21
...*
X
(..)*

A22
...*
<
..*

A23
(...)*
|
Four

A24
a*
X
a.*

A25
a*
|
b.*

A26
x*
=
(x|)*

A27
(((.*)*)*)*
=
.*

B. Complex patterns

B1
aaa*
=
aa*a

B2
aa*a
=
a*aa

B3
a*
X
aaax||a|aa|aaa

B4
aaaaaaaaaaaaaa*
X
(aa)*|(aaa)*

B5
a*b*
<
(a|b)*

B6
((a|b)(a|b))*
>
(ab|ba)*

B7
((a|b)(a|b))*
=
(aa|ab|ba|bb)*

B8
.*.*
=
.*

B9
(x*y*z*)*
=
(x|y|z)*

B10
(...|.....)*
>
.........*

B11
(...|.....)*
=
|...|.....|......|.........*

B12
(...|......)*
<
|...|.......*

B13
(...|....|.....)*
=
|....*

B14
(...|x...............|x..x)*
>
(...|x..............|x..x)*

B15
(...|x...............|x..x)*
<
(..|...|x..............|x..x)*

B16
(____)*_x__(____)*
|
(____)*x___(____)*

B17
(____)*_x__(____)*
X
(_____)*x_____(_____)*

B18
(____)*_x__(____)*
|
(______)*x______(______)*

B19
(____)*_x__(____)*
<
_*x__*

C. Basic patterns with character classes

C1
[x]
=
x

C2
[xy]
|
xy

C3
[abc]
=
a|b|c

C4
[abc]
>
[ab]

C5
[abc]
<
[abcd]

C6
[ab]
X
[bc]

C7
[ab]
|
[cd]

C8
[abc]*
>
(a|b)*

C9
[^xyz]
>
a

C10
[^xyz]
|
x

C11
[^xyz]
|
[xyz]

C12
[^xyz]
X
[wxyz]

C13
[^xyz]
X
[^abc]

C14
[0-9]
=
[0123456789]

C15
[a-z]
>
[b-y]

C16
[a-z]
=
a|[b-y]|z

C17
[^0-9]*
>
Hello, World!

C18
[^0-9]*
|
123

C19
[^abc]
<
.

C20
[^abc]*
X
.

C21
[^a]|a
=
.

C22
[^abcd]|[a-c]|d
=
.

C23
[^a]*
X
b[^c]*

C24
[^a]*
|
a[^c]*

D. Complex patterns with character classes

D1
http://[^/][^/]*/.*
>
http://(www\.|)google\.com/.*

D2
http://[A-Za-z][A-Za-z]*\.wikipedia\.org/
X
(http://|)en\.wikipedia\.org/(|wiki(|/.*))

D3
[A-Za-z0-9._%+\-][A-Za-z0-9._%+\-]*@[A-Za-z0-9.\-][A-Za-z0-9.\-]*\.[A-Za-z][A-Za-z](|[A-Za-z]|[A-Za-z][A-Za-z])
X
email@address.com

D4
[A-Za-z0-9._%+\-][A-Za-z0-9._%+\-]*@[A-Za-z0-9.\-][A-Za-z0-9.\-]*\.[A-Za-z][A-Za-z](|[A-Za-z]|[A-Za-z][A-Za-z])
>
email@address\.com

D5
[A-Za-z0-9._%+\-][A-Za-z0-9._%+\-]*@[A-Za-z0-9.\-][A-Za-z0-9.\-]*\.[A-Za-z][A-Za-z](|[A-Za-z]|[A-Za-z][A-Za-z])
>
[A-Za-z0-9._%+\-][A-Za-z0-9._%+\-]*@gmail\.com

D6
JohnDoe@[A-Za-z0-9.\-][A-Za-z0-9.\-]*\.[A-Za-z][A-Za-z](|[A-Za-z]|[A-Za-z][A-Za-z])
<
[A-Za-z0-9._%+\-][A-Za-z0-9._%+\-]*@[A-Za-z0-9.\-][A-Za-z0-9.\-]*\.[A-Za-z][A-Za-z](|[A-Za-z]|[A-Za-z][A-Za-z])

D7
[A-Za-z0-9._%+\-][A-Za-z0-9._%+\-]*@gmail\.com
X
JohnDoe@[A-Za-z0-9.\-][A-Za-z0-9.\-]*\.[A-Za-z][A-Za-z](|[A-Za-z]|[A-Za-z][A-Za-z])

D8
[^, a-z][^,]*(|(, [^, ][^,]*)* and [^, ][^,]*)\.
>
One, two and three\.

D9
[^, a-z][^,]*(|(, [^, ][^,]*)* and [^, ][^,]*)\.
|
one, two and three\.

D10
[^, a-z][^,]*(|(, [^, ][^,]*)* and [^, ][^,]*)\.
|
One, two\.

D11
[^, a-z][^,]*(|(, [^, ][^,]*)* and [^, ][^,]*)\.
<
(.*(,|))*

D12
[A-Z][^,]*(|(, [a-z][^,]*)* & [a-z][^,]*)\.
<
[^, a-z][^,]*(|(, [^, ][^,]*)* (and|et|[&+]) [^, ][^,]*)\.

Test Program

The following snippet can be used to compare regex patterns:

<style>#main {display: none;}#main[loaded] {display: inline;}.pattern_container {position: relative;}.pattern_underlay, .pattern {font: 12pt courier, monospace;overflow: hidden;white-space: pre;padding: 7px;box-sizing: border-box;}.pattern_underlay {background-color: #dddddd;color: #707070;border-radius: 4px;box-shadow: 0.5px 0.5px 2.5px #aaaaaa;}.pattern_underlay[error] {background-color: #ffccbb;}.pattern {position: absolute;left: 0px;top: 0px;background: none;border: none;width: 100%;height: 100%;resize: none;white-space: normal;}#ordering {min-width: 28pt;text-align: center;font-size: 16pt;}#status {padding: 5px;background-color: #fffdce;box-shadow: 1.5px 1.5px 3.5px #aaaaaa;font-size: 10pt;white-space: pre;display: none;}#status[error] {display: inline;background-color: #ffe8df;}#status[loading] {display: inline;}.inline_code {background-color: #dddddd;font: 12pt courier, monospace;padding: 2px;}.placeholder {visibility: hidden;}.error_text {background-color: #fffcb7};</style><span id="main"><table><tr><td><div class="pattern_container"><div class="pattern_underlay" id="pattern1_underlay"></div><textarea class="pattern" id="pattern1" oninput="compare()">Hello, .*!</textarea></div></td><td id="ordering"></td><td><div class="pattern_container"><div class="pattern_underlay" id="pattern2_underlay"></div><textarea class="pattern" id="pattern2" oninput="compare()">.*, .*!</textarea></div></td></tr></table><br></span><span id="status" loading>Loading...</span><script type='text/javascript'>var Module = {setStatus: function (status) {document.getElementById("status").innerHTML = status;if (status == "") {compare();document.getElementById("status").removeAttribute("loading");document.getElementById("main").setAttribute("loaded", 1);}}};function underlay_chars(str) {if (/^\n*$/.exec(str))return str.split("\n").map(function () { return '<span class="placeholder"> \n</span>'; });if (str.indexOf("\n") >= 0)str = str.replace(/\s*$/gm, function (m) { return m.replace(/[^\n]/g, "\0"); });return (str + "\n").split("").map(function (c) {if (c == "\0") return "·";else return '<span class="placeholder">' + c + '</span>';});}function underlay_str(str) {return underlay_chars(str).join("");}function str_to_array32(str) {a = [];for (c of str) {n = c.charCodeAt(0);a.push(n & 0xff, (n >> 8) & 0xff, (n >> 16) & 0xff, n >> 24);}a.push(0, 0, 0, 0);return a;}function compare() {try {for (e of ["pattern1_underlay", "pattern2_underlay", "status"])document.getElementById(e).removeAttribute("error");for (e of ["pattern1", "pattern2"])document.getElementById(e + "_underlay").innerHTML = underlay_str(document.getElementById(e).value);c = Module.ccall("regex_compare", "number", ["array", "array"], [str_to_array32(document.getElementById("pattern1").value),str_to_array32(document.getElementById("pattern2").value)]);if (c >= 0)document.getElementById("ordering").innerHTML = "∥≬<>="[c];else {i = Module.ccall("regex_error_index", "number", [], []);l = Module.ccall("regex_error_length", "number", [], []);e = document.getElementById("pattern" + -c + "_underlay");t = underlay_chars(document.getElementById("pattern" + -c).value);e.setAttribute("error", 1);e.innerHTML =t.slice(0, i).join("") +'<span class="error_text">' + t.slice(i, i + l).join("") + "</span>" +t.slice(i + l).join("");e.setAttribute("error", 1);throw "Pattern error: " + Module.ccall("regex_error", "string", [], []).replace(/`(.*?)'/g, '<span class="inline_code">$1</span>');}} catch (e) {document.getElementById("ordering").innerHTML = "??";document.getElementById("status").innerHTML = e;document.getElementById("status").setAttribute("error", 1);}}</script><script async type="text/javascript" src="https://gist.githack.com/anonymous/91f27d6746566c7b4e4c/raw/c563bf84a01c3a1c6e5f021369a3e730a2e74a1a/rpo.js"></script>

Ell

Posted 2014-12-05T16:20:41.363

Reputation: 7 317

10Wow. Any answer submitted to this gets an automatic +1 from me. Determining whether two formal languages are isomorphic is hard enough. Determining whether one is a sublanguage of another is a complete undergraduate CS project. @___@ – COTO – 2014-12-05T21:49:26.750

Is there any specified behavior for invalid regex patterns? – Paul Guyot – 2014-12-06T08:10:34.767

@PaulGuyot No. You can assume the patterns are valid. – Ell – 2014-12-06T10:43:12.187

1I wonder - did you write won yourself (to see how feasible it is for a code golf question) or didn't you? – proud haskeller – 2014-12-07T18:10:13.143

1@proudhaskeller I did; I wrote the test snippet. It's a tough challenge, and there aren't going to be any one-liners here, but it's golfable. – Ell – 2014-12-07T19:41:10.333

Answers

10

Python 2, 522 bytes * .92 = 480.24 537.28

Edit 2: -60 bytes

Edit: Added explanation.

Defined is the function f which takes the two pattern strings as arguments and returns '<', '=', '>', '|', or 'X'. Less than one minute is needed to process all test cases.

The code consists of the following, but with \r, \n \t, and hex escapes (but not \0) replaced with their actual byte values.

#encoding=Latin
exec"""x\xda]RMo\xdb0\x0c\xbd\xe7Wx\'K\x96\x92\xc5mOR\xb8\xdf1@%|\x98%X\x80a\x19\x96\x02\x03n\xf2\xdfG:i;\xec$\x92z|\x8f_\x1b\x84%m~\xca\xbe\x1c\x0e\xbd\x0fU\x10Agi\x0e\x87\xea\n\x1f\xf9n{=\xea\0\x93\x08\xd2\xaez\xd0\x99\xcc,m\x07g\xbb\x80s\x9b\x08\xee\x8cRo"\xf3\x8bHy!-\x95\xd7\xa9\x8aS\xb50O5\xc3&\xb68\x0b\xe7\xb1\x19t\x92\x8a\x1d\xaf]\xc2f\x94\x92\x111T\xf3\xf1j\xba\x1b\x081r\xa2\x97\xea\xa5\x11\x03\x9bI\xca\xe6\xed\xe7\xab\xbd\xde`\xb6\x8b"\xd1\xc5\xf7\xd7?^l\xa7\xaeKK\xd7i\x91\x92\x8b\xaaE\x16\x8e\x9c\x12#3\x86\xe0"\xc6\xc9\x15\xfd\x86\xae\\\xde\xcc^\xa7\x94;,\xea\x94t\x08\x84\xa6J\x82\xee%\xb1\xe8\xacW\xb9\xb3\x14f\xd9\x84\xeb\x89\xe1\x8b\xd5\xa3r\xeb\xbf\x81D\rS\xf5\x8b/\xd7e\xaao\xf0\xeb\xf2\xbbv\xdd\xf1\x15\x1f\x93\xe4Aq\xff\x19\xc6\x98\x8b\xa8E\xad\xb2\xaae-m\x843\xc5\xd7!\x8e\xbe\xca.\x1a4\x01\xe8E;@-\xe4\xad9\xd5\xa7\x10\xa7\x9eg\xcea\x10\x83\x0e\xd2\r\x973\xb2o\xb8\xd7\x06\xc2\x0f\xa8\xdf\xdfk\x1b\x15\xb4v\x84H\xc9\xad]\xc1\x83C;\x03m\xc3\x16p\x1f\xe3\x1d\xbf\xa4\xe2\xbe\x8d\x1eX)\x1e\t\x9dv\xf3\xa9\xcd\xe8xGU\x9e\x0b\t\x97\xd6\x0c\x8c\xf2\nxa\xa9\x19u\xaf\xf2iN3\r\xd1\xae\x0f\xe3\x13\x0c@h\xb5W\xb0\xaad3\xef\t\x91s]R=~\xc3^Lv\xc7\x16\x15\xf4\xfb\xa7\x88ze_~B\x06\x80\x99\x03\x86\x7f\x0bY\x06U\xd2.\xeaV\x95\x87$\xd1\xce\xff\x8b\xbf\x9a\x99\xe0\x03u\xa1 =o0<n\xd0\xef]s`b\xb7\x98\x89\xael\xd2\x85\xceO:>\x80j\xfa\xdeb\x95\x95k\x91N\xbe\xfc'5\xac\xe7\xe8\x15""".decode('zip')

The above statement causes the following code to be executed:

z=frozenset
def f(f,s):
 u={s};d,l,f=n(f);w,h,s=n(s);_=0;r=[[z(f[0]),z(s[0])]]
 for e,o in r:
  p=z(zip([e]*h,o)+zip(e,[o]*l))
  if p-u:_|=((l in e)+2*(h in o))*4/3;u|=p;r+=[[reduce(z.__or__,(oo[i+1]for i in ii if ff[i]in[t,4][t<4:]),z())for ii,oo,ff in(e,f,d),(o,s,w)]for t in z([d[i]for i in e]+[w[i]for i in o])]
 return'|=><X'[_-3]
def n(s):
 s=list('('+s+')');i=0
 while s[i:]:f=s[i];h='()|*.'.find(f);s[i]=(h,f)[h<0];s[i:i+1]*=f!='\\';i+=1;l=i;h=1;w=e=[];p=[0];t=[{l}]
 while i:
  d=[i];i-=1;o=[i];f=s[i];t=[{i}]+t
  if f<1:h-=1;e+=zip(o*l,d+s.pop());w.pop()
  if f==1:h+=1;w=w+o;s+=[[]];e+=[o+d]
  if f==2:s[-1]+=d;e+=[(i,w[-1])]
  if h==p[-1]:e+=[t[-1:]+o,[i,1+t.pop()]];p.pop()
  if f==3:p+=[h];t+=o
 for f,o in e:
  for n in t:n|=(n,t[o])[f in n]
 return s+[1],l,t

If the second code sample is stored in the string s, something similar to the first can be produced by the expression '#encoding=Latin\nexec"""%s"""'%__import__('zlib').compress(s). It may be necessary to fix some characters such as null bytes or backslashes. The unzipped code is nearly 1000 800 bytes so perhaps it is more obfuscated than golfed...but at least I managed to golf the encoding a bit: from Latin1 to Latin.

Explanation

The program works by using the indices of the string as a crude way to keep track of the states of parsing a string. The n function builds transition tables. First it parses the string and finds all instances of two kinds of transitions. First, there are transitions that don't involve adding another letter to the string. For example, jumping from a * to the beginning of a quantified expression. Secondly, there are transitions of adding a character, which simply move forward by one index. Then the transitive closure of the no-character transitions is calculated, and those are substituted for the destinations of the 1-character transitions. So it returns a mapping of an index and a character to a set of indices.

The main function does a BFS over possible input strings. It tracks a set of all possible indices for whatever fragment of a string it is considering. What we are interested in finding is states that are either accepted by both regexes, or by one and not the other. To show that a regex is accepted, it is only necessary to find one possible path of transitions to the end of the pattern. But to show that one is rejected, it is necessary to have analyzed all possible paths. Therefore, to determine if the sets of possible states for pattern A and pattern B are already covered by ones that have been analyzed before, pairs of a single state in one regex and the set of all possible states in the other are recorded. If there are no new ones, then it is OK to go back. Of course, it would also be possible, and less characters, to simply record pairs of the sets of all states for both patterns, but then the computer would explode.

feersum

Posted 2014-12-05T16:20:41.363

Reputation: 29 566

Very nice! Passes all tests in groups A and B (no character classes, it seems.) I can't get the compressed version to work, though, or get the same byte count. Either way, you can claim the x 0.92 bonus when you calculate your score. And, of course, an explanation is welcome! – Ell – 2014-12-09T17:14:56.713

4

Haskell, 560 553 618

might get some of the bonuses done in the future.

this is not fully golfed yet.

import Data.List
c%j|'\\':h:s<-j=[s|c==h]|(a,b):_<-[(a,b)|x<-[0..length j],(a,'|':b)<-[splitAt x j],snd(k b)==[]]=c%a++c%b|'(':s<-j,(a,_:'*':b)<-k s=map(++j)(c%a)++c%b|'(':s<-j,(a,_:b)<-k s=map(++b)(c%a)|h:'*':s<-j=map(++j)(c%[h])++c%s
c%"."=[""|c>'\0']
c%s@[_]=c%('\\':s)
c%(a:b)=map(++b)(c%[a])
c%s=[""|c>'\0']
a&b=nub[(x,nub$b>>=(c%))|c<-[' '..'~'],x<-c%a]
g e(k@(a,l):r)|j<-a&l\\e=g(k:e)(j++r)
g e[]=e
a#b=or[all(null.('\0'%))m|(x,m)<-g[][(a,[b])],""<-'\0'%x]
a!b|a#b,b#a='x'|a#b='>'|b#a='<'|0<1='='
k"("=("","(")
k(c:s)|'('<-c,(x,y)<-k$tail b=('(':a++')':x,y)|')'<-c=("",')':s)|0<1=(c:a,b)where(a,b)=k s
k j=(j,j)

a hand-wavy explanation of the algorithm:

reg!reg' returns the required char, one of "=<>x".

a#b is true iff not every string that a matches is also matched by b.

c%reg is a list of regular expressions such that reg matches c:s iff one of the regexps in the output matches s. i is basically partially matches the regex. except if c is '\0'. then it forces reg to not get any more input, returning [] if reg must get more input to match and [""] otherwise.

# works by generating a finite list of all possible "regex-state" the two regexps will be in after an arbitrary string. then to check if a<b we check weather there is a "regex-state" in which a is fully matched but b isn't fully matched.

proud haskeller

Posted 2014-12-05T16:20:41.363

Reputation: 5 866

Cool! You're obviously on the right track. However, right now it fails test B4. – Ell – 2014-12-10T23:47:32.327