25
6
For the purpose of this challenge, we say that a regex pattern matches a string if the entire string is matched by the pattern, not just a substring.
Given two regex patterns A and B, we say that A is more specialized than B if every string that is matched by A is also matched by B but not the other way around. We say that A is equivalent to B if both patterns match exactly the same set of strings. If neither pattern is more specialized than the other nor are they equivalent, we say that A and B are incomparable.
For example, the pattern Hello, .*!
is more specialized than .*, .*!
; the patterns (Hello|Goodbye), World!
and Hello, World!|Goodbye, World!
are equivalent; and the patterns Hello, .*!
and .*, World!
are incomparable.
The relation "more specialized than" defines a strict partial order on the set of regex patterns. In particular, for all patterns A and B, exactly one of the following is true:
- A is more specialized than B (A < B).
- B is more specialized than A (A > B).
- A and B are equivalent (A = B).
- A and B are incomparable (A ∥ B).
When A and B are incomparable, we can further distinguish between two cases:
- A and B are disjoint (A ∥ B), meaning that no string is matched by both of them.
- A and B are intersecting (A ≬ B), meaning that some strings are matched by both.
Challenge
Write a program or a function that takes a pair of regex patterns and compares them using the above order.
That is, if the two patterns are A and B, the program should determine whether A < B, A > B,
A = B or A ∥ B.
×92% Bonus An additional bonus is given if, when the patterns are incomparable, the program determines whether they're intersecting or disjoint.
Input and Output
The program should accept two regex patterns, as strings, in the flavor defined below. You may read the input through STDIN, the command line, as function arguments or an equivalent method. You may assume that the patterns are valid.
The program should produce one of exactly four distinct outputs (or five distinct outputs if you're going for the above bonus,) depending on the result of the comparison (the exact outputs are up to you.) You may write the output to STDOUT, return it as the function's result or use an equivalent method.
Regex Flavor
You may support whatever regex features you like, but you must support the following ones:
- Alternation with
|
. - Quantification with
*
. - Grouping with
(
and)
. - Matching any character (possibly excluding newlines) with
.
. - (Optional: ×80% Bonus) Matching simple and negated character classes with
[…]
and[^…]
, respectively. You don't have to support any predefined character classes (e.g.[:digit:]
) but you should support character ranges. - Character escaping with
\
. It should at least be possible to esacpe special characters (i.e.|*().[^-]\
) and preferably also common special characters in other flavors (e.g.{}
), but the behavior when escaping nonspecial characters is unspecified. In particular, you don't have to support special escape sequences such as\n
for a newline and the like. A possible implementation is simply to take the character following the\
as a literal.
You may assume the existance of an input character that cannot be matched by any literal (i.e. it can only be matched by .
and negated character classes.)
Additional Rules
- You may use regex libraries or builtin regex functionality only for the purpose of string matching and replacement.
- You may ignore any locale-related issues, such as collation rules.
- To state the obvious: your program must terminate. It should execute in a reasonable amount of time given typical patterns (definitely no more than an hour, preferably a lot less.)
Scoring
This is code-golf. Your score is the product of the code size, in bytes, and any of the bonuses. The lowest score wins.
Test Cases
The format of the test cases is as follows:
<Test ID>
<Pattern A>
<Ordering>
<Pattern B>
<Test ID>
<Pattern A>
<Ordering>
<Pattern B>
...
Where <Test ID>
is an identifier for the test case, <Pattern A>
and <Pattern B>
are the regex patterns and <Ordering>
is the ordering between them, and is one of:
<
:<Pattern A>
is more specialized than<Pattern B>
.>
:<Pattern B>
is more specialized than<Pattern A>
.=
: The patterns are equivalent.|
: The patterns are incomparable and disjoint.X
: The patterns are incomparable and intersecting.
The special value <empty pattern>
stands for the empty pattern.
A. Basic patterns
A1
Hello
=
Hello
A2
Hello
|
World
A3
Hell(|o)
=
Hell|Hello
A4
x
<
.
A5
.
>
x
A6
\.
<
.
A7
x|y
=
y|x
A8
x|y
X
y|z
A9
x|y
|
a|b
A10
x|y|z
>
x|z
A11
x.
X
.x
A12
.*
>
<empty pattern>
A13
()*
=
<empty pattern>
A14
x*
>
xxxx
A15
x*
>
x|xx|xxxx*
A16
x*
=
|x|xx|xxxx*
A17
x*
<
.*
A19
(.|..)*
=
.*
A20
(...)*
X
(..)*
A21
...*
X
(..)*
A22
...*
<
..*
A23
(...)*
|
Four
A24
a*
X
a.*
A25
a*
|
b.*
A26
x*
=
(x|)*
A27
(((.*)*)*)*
=
.*
B. Complex patterns
B1
aaa*
=
aa*a
B2
aa*a
=
a*aa
B3
a*
X
aaax||a|aa|aaa
B4
aaaaaaaaaaaaaa*
X
(aa)*|(aaa)*
B5
a*b*
<
(a|b)*
B6
((a|b)(a|b))*
>
(ab|ba)*
B7
((a|b)(a|b))*
=
(aa|ab|ba|bb)*
B8
.*.*
=
.*
B9
(x*y*z*)*
=
(x|y|z)*
B10
(...|.....)*
>
.........*
B11
(...|.....)*
=
|...|.....|......|.........*
B12
(...|......)*
<
|...|.......*
B13
(...|....|.....)*
=
|....*
B14
(...|x...............|x..x)*
>
(...|x..............|x..x)*
B15
(...|x...............|x..x)*
<
(..|...|x..............|x..x)*
B16
(____)*_x__(____)*
|
(____)*x___(____)*
B17
(____)*_x__(____)*
X
(_____)*x_____(_____)*
B18
(____)*_x__(____)*
|
(______)*x______(______)*
B19
(____)*_x__(____)*
<
_*x__*
C. Basic patterns with character classes
C1
[x]
=
x
C2
[xy]
|
xy
C3
[abc]
=
a|b|c
C4
[abc]
>
[ab]
C5
[abc]
<
[abcd]
C6
[ab]
X
[bc]
C7
[ab]
|
[cd]
C8
[abc]*
>
(a|b)*
C9
[^xyz]
>
a
C10
[^xyz]
|
x
C11
[^xyz]
|
[xyz]
C12
[^xyz]
X
[wxyz]
C13
[^xyz]
X
[^abc]
C14
[0-9]
=
[0123456789]
C15
[a-z]
>
[b-y]
C16
[a-z]
=
a|[b-y]|z
C17
[^0-9]*
>
Hello, World!
C18
[^0-9]*
|
123
C19
[^abc]
<
.
C20
[^abc]*
X
.
C21
[^a]|a
=
.
C22
[^abcd]|[a-c]|d
=
.
C23
[^a]*
X
b[^c]*
C24
[^a]*
|
a[^c]*
D. Complex patterns with character classes
D1
http://[^/][^/]*/.*
>
http://(www\.|)google\.com/.*
D2
http://[A-Za-z][A-Za-z]*\.wikipedia\.org/
X
(http://|)en\.wikipedia\.org/(|wiki(|/.*))
D3
[A-Za-z0-9._%+\-][A-Za-z0-9._%+\-]*@[A-Za-z0-9.\-][A-Za-z0-9.\-]*\.[A-Za-z][A-Za-z](|[A-Za-z]|[A-Za-z][A-Za-z])
X
email@address.com
D4
[A-Za-z0-9._%+\-][A-Za-z0-9._%+\-]*@[A-Za-z0-9.\-][A-Za-z0-9.\-]*\.[A-Za-z][A-Za-z](|[A-Za-z]|[A-Za-z][A-Za-z])
>
email@address\.com
D5
[A-Za-z0-9._%+\-][A-Za-z0-9._%+\-]*@[A-Za-z0-9.\-][A-Za-z0-9.\-]*\.[A-Za-z][A-Za-z](|[A-Za-z]|[A-Za-z][A-Za-z])
>
[A-Za-z0-9._%+\-][A-Za-z0-9._%+\-]*@gmail\.com
D6
JohnDoe@[A-Za-z0-9.\-][A-Za-z0-9.\-]*\.[A-Za-z][A-Za-z](|[A-Za-z]|[A-Za-z][A-Za-z])
<
[A-Za-z0-9._%+\-][A-Za-z0-9._%+\-]*@[A-Za-z0-9.\-][A-Za-z0-9.\-]*\.[A-Za-z][A-Za-z](|[A-Za-z]|[A-Za-z][A-Za-z])
D7
[A-Za-z0-9._%+\-][A-Za-z0-9._%+\-]*@gmail\.com
X
JohnDoe@[A-Za-z0-9.\-][A-Za-z0-9.\-]*\.[A-Za-z][A-Za-z](|[A-Za-z]|[A-Za-z][A-Za-z])
D8
[^, a-z][^,]*(|(, [^, ][^,]*)* and [^, ][^,]*)\.
>
One, two and three\.
D9
[^, a-z][^,]*(|(, [^, ][^,]*)* and [^, ][^,]*)\.
|
one, two and three\.
D10
[^, a-z][^,]*(|(, [^, ][^,]*)* and [^, ][^,]*)\.
|
One, two\.
D11
[^, a-z][^,]*(|(, [^, ][^,]*)* and [^, ][^,]*)\.
<
(.*(,|))*
D12
[A-Z][^,]*(|(, [a-z][^,]*)* & [a-z][^,]*)\.
<
[^, a-z][^,]*(|(, [^, ][^,]*)* (and|et|[&+]) [^, ][^,]*)\.
Test Program
The following snippet can be used to compare regex patterns:
<style>#main {display: none;}#main[loaded] {display: inline;}.pattern_container {position: relative;}.pattern_underlay, .pattern {font: 12pt courier, monospace;overflow: hidden;white-space: pre;padding: 7px;box-sizing: border-box;}.pattern_underlay {background-color: #dddddd;color: #707070;border-radius: 4px;box-shadow: 0.5px 0.5px 2.5px #aaaaaa;}.pattern_underlay[error] {background-color: #ffccbb;}.pattern {position: absolute;left: 0px;top: 0px;background: none;border: none;width: 100%;height: 100%;resize: none;white-space: normal;}#ordering {min-width: 28pt;text-align: center;font-size: 16pt;}#status {padding: 5px;background-color: #fffdce;box-shadow: 1.5px 1.5px 3.5px #aaaaaa;font-size: 10pt;white-space: pre;display: none;}#status[error] {display: inline;background-color: #ffe8df;}#status[loading] {display: inline;}.inline_code {background-color: #dddddd;font: 12pt courier, monospace;padding: 2px;}.placeholder {visibility: hidden;}.error_text {background-color: #fffcb7};</style><span id="main"><table><tr><td><div class="pattern_container"><div class="pattern_underlay" id="pattern1_underlay"></div><textarea class="pattern" id="pattern1" oninput="compare()">Hello, .*!</textarea></div></td><td id="ordering"></td><td><div class="pattern_container"><div class="pattern_underlay" id="pattern2_underlay"></div><textarea class="pattern" id="pattern2" oninput="compare()">.*, .*!</textarea></div></td></tr></table><br></span><span id="status" loading>Loading...</span><script type='text/javascript'>var Module = {setStatus: function (status) {document.getElementById("status").innerHTML = status;if (status == "") {compare();document.getElementById("status").removeAttribute("loading");document.getElementById("main").setAttribute("loaded", 1);}}};function underlay_chars(str) {if (/^\n*$/.exec(str))return str.split("\n").map(function () { return '<span class="placeholder"> \n</span>'; });if (str.indexOf("\n") >= 0)str = str.replace(/\s*$/gm, function (m) { return m.replace(/[^\n]/g, "\0"); });return (str + "\n").split("").map(function (c) {if (c == "\0") return "·";else return '<span class="placeholder">' + c + '</span>';});}function underlay_str(str) {return underlay_chars(str).join("");}function str_to_array32(str) {a = [];for (c of str) {n = c.charCodeAt(0);a.push(n & 0xff, (n >> 8) & 0xff, (n >> 16) & 0xff, n >> 24);}a.push(0, 0, 0, 0);return a;}function compare() {try {for (e of ["pattern1_underlay", "pattern2_underlay", "status"])document.getElementById(e).removeAttribute("error");for (e of ["pattern1", "pattern2"])document.getElementById(e + "_underlay").innerHTML = underlay_str(document.getElementById(e).value);c = Module.ccall("regex_compare", "number", ["array", "array"], [str_to_array32(document.getElementById("pattern1").value),str_to_array32(document.getElementById("pattern2").value)]);if (c >= 0)document.getElementById("ordering").innerHTML = "∥≬<>="[c];else {i = Module.ccall("regex_error_index", "number", [], []);l = Module.ccall("regex_error_length", "number", [], []);e = document.getElementById("pattern" + -c + "_underlay");t = underlay_chars(document.getElementById("pattern" + -c).value);e.setAttribute("error", 1);e.innerHTML =t.slice(0, i).join("") +'<span class="error_text">' + t.slice(i, i + l).join("") + "</span>" +t.slice(i + l).join("");e.setAttribute("error", 1);throw "Pattern error: " + Module.ccall("regex_error", "string", [], []).replace(/`(.*?)'/g, '<span class="inline_code">$1</span>');}} catch (e) {document.getElementById("ordering").innerHTML = "??";document.getElementById("status").innerHTML = e;document.getElementById("status").setAttribute("error", 1);}}</script><script async type="text/javascript" src="https://gist.githack.com/anonymous/91f27d6746566c7b4e4c/raw/c563bf84a01c3a1c6e5f021369a3e730a2e74a1a/rpo.js"></script>
10Wow. Any answer submitted to this gets an automatic +1 from me. Determining whether two formal languages are isomorphic is hard enough. Determining whether one is a sublanguage of another is a complete undergraduate CS project. @___@ – COTO – 2014-12-05T21:49:26.750
Is there any specified behavior for invalid regex patterns? – Paul Guyot – 2014-12-06T08:10:34.767
@PaulGuyot No. You can assume the patterns are valid. – Ell – 2014-12-06T10:43:12.187
1I wonder - did you write won yourself (to see how feasible it is for a code golf question) or didn't you? – proud haskeller – 2014-12-07T18:10:13.143
1@proudhaskeller I did; I wrote the test snippet. It's a tough challenge, and there aren't going to be any one-liners here, but it's golfable. – Ell – 2014-12-07T19:41:10.333