Dyalog APL, 166 bytes
{l←!!5*≢⍵⋄z←⊂''⋄n←,∘.,⍨⋄k←{⊃,/n\l/⊂⍵}⋄f←{⊃⊃{⍺∊';|':(2↓⍵),⍨⍺{';'=⍺:n/⍵⋄,/⍵}2↑⍵⋄⍺='_':⍵,⍨⊂z⋄'+'=⍺:⊂∘k∘⊃@1,⍵⋄⍺,⍵}/⌽1,⍵}⋄w←f⍵⋄⊃q/⍨{0=≢((z,k'01')~w∪f⍵)∪w∩f⍵}¨q←k'10!_|;+'}
The solution is a function that works by brute-forcing. It cannot be run in reasonable time nor with reasonable memory on any of the inputs, so no TIO link is provided; this is due to looping over every possible regex over the factorial of the factorial of 5 to the power of the input length. Note that the numbers involved go out of Dyalog's maximum number size for an input regex of length greater than 1. To test this, take a look at the function f
, which returns a value that includes all matches less than the calculated upper bound l
, and test it on the complementarity check {0=≢ ... ⍵∩f⍵}
. Using smaller values of l
can make the function provide wrong results, so note that before reducing its value.
There is an upper bound for the minimum length of the complementary regex as a function of the length of the input regex, and an upper bound for the length of the longest string the two regexes need to be tested against to determine their complementarity. These parameters are used to brute force the complementary regex, by comparing the strings accepted by the input regex and the regex being tested for complementarity.
We start by comparing the length of a regular expression to the number of states of its respective DFA.
The non-deterministic finite automaton (NFA) for the regex 1
is the following (using (Sn)
to denote accepting states)
1
Si-->(Sf)
The regex _
representing an \$\epsilon\$-transition can be represented by an NFA that has less than 2 states, as can the regex !
representing non-acceptance.
_
(Si)
!
Si
Thus, a regex with only characters from 01_!
with length n
will have an NFA with max size n+1
.
The regex ab;
, where a
and b
are sub-regexes, denotes the concatenation of a
and b
and can be described as the following. (Si#
denotes an initial state and (S##)
denotes a final state.)
Si1-->...-->(Sf1) Si2-->...-->(Sf2) ;
ε
NFA= Si1-->...-->Sf1-->Si2-->...-->(Sf2)
The regex a+
denotes one or more occurrences of a
.
Si1-->...-->(Sf1) +
ε
+------------+
v |
NFA= Si1-->...-->(Sf1)
This produces an NFA with the same size as that of a
. Again, a regex with only characters from 01_!;
with length n
will have an NFA with max size n+1
.
The regex ab|
denotes an alternative of either a
or b
.
Si1-->...-->(Sf1) | Si2-->...-->(Sf2)
Si1-->...-->(Sf1)
NFA=
Si2-->...-->(Sf2)
Both Si1
and Si2
are starting states in the above NFA. This does not fit the traditional definition of an NFA, where an NFA has only 1 starting state, but the size of the resulting deterministic finite automaton (DFA) does not increase. If needed, the NFA for +
can be changed so that there is an ε-transition from the final state to the second state, and the NFAs for _
and !
changed to have 2 states each, so that a traditional NFA can be constructed for alternation with ε-transitions between Si1 and Si2.
An NFA resulting from alternation will be as large as the sum of the two sub-regexes that make it up. A regex of length n
will have an NFA with a maximum size of n+1
.
By the powerset construction, an NFA with size \$n\$ will result in a DFA with size \$2^n\$. So a regex of length \$n\$ will have a minimal DFA upper-bounded by \$2^{n+1}\$. A DFA can be inverted by simply switching the final and non-final states, so its complement will have the same number of states as itself.
With this information, we need an upper limit on the size of the complementary regex as a function of the length of the number of states of the DFA. To do this, I use an algorithm to convert a DFA to a regex; the idea is to represent every possible path from Si to any other state, as well as accounting for loops by tracking every loop that starts and ends at Si, in a recursive manner. For every state that can be reached in 1 step from Si, the process is repeated and a path is constructed from this new state to every other state except the states that have already been encountered, i.e. Si in the first iteration, since loops back to Si would already be covered by tracking loops. An example of the construction can make this clearer.
0<c0>;1<c1>;|+_|
0<r0>;1<r1>;<a>||
;
This is a regex that represents a DFA that I will call the "base form". The base form for a state is a regex that traverses paths from that state to every state in the DFA except for, as we will see, states that have been explicitly excluded for a particular iteration. The accept/reject status of each path is denoted by <a>
, being _
for accept or !
for reject. The second line of the regex accounts for this "breadth-first" filling, by transitioning to states that are 1 step way, and recursing the base form (as represented by <r0>
and <r1>
) on those states. Each time we recurse, we exclude the original "parent" state from the graph traversal to prevent the algorithm from infinitely looping. Loops in the regex are tracked in the first line, where, also, every possible path is traversed, halting when reaching the parent state, but these sub-regexes, as denoted by <c0>
and <c1>
, consider only the parent state to be the accepting state.
At each level of recursion, the number of excluded states increases by 1, until every possible state is covered at which point the algorithm ends.
The base form, without the recursive parts, is of length 16. Each iteration spawn 4 new sub-regexes, and there are as many iterations as states in the DFA.
If a DFA has n states, an upper bound for the regex length is \$16 + 16×4 + 16×4×4 + ... + 16×4^{n-1}\$. This is a geometric series and can also be written as
$$\frac{16}{3}\left(4^n-1\right)$$
Now given a regex with length s, the length of its complement is bounded by
$$h=\frac{16}{3}\left(4^{2^{n+1}}-1\right)$$
(I felt that this upper bound could be improved to a mere exponential instead of a double exponential, potentially saving some bytes, but I was proven wrong by this answer.)
We can check if two regexes are equivalent by brute force, but to do that we need to know the upper limit of the input length to test. If the DFAs of the two regexes are a and b, we can construct another DFA c whose states are the Cartesian product of the states of a and b, so each state of DFA c can be labeled as a 2-tuple (a_j,b_k)
where a_j
and b_k
are states of DFA a and b respectively. The transition of DFA c's state (a_j,b_k)
under input letter l
is the tuple of the transition of each state a_j
and b_k
in DFAs a and b respectively under letter l
. The final states of DFA c are all the states (a_j,b_k)
where either both a_j
and b_k
are final states or they are both non-final states. DFA c accepts a word if both DFAs a and b accept this word, i.e. DFAs a and b do not yield complementary results on this word. DFAs a and b are complementary iff DFA c does not accept any word. By the pumping lemma, to check for this, we only need to check of input word sizes up to the number of states of DFA c, minus 1.
The size of the DFA of the complementary regex is 2^(h+1)
and that of the input regex is 2^(n+1)
. Thus we only need to check for input lengths up to 2^(h+1)×2^(n+1)-1
.
$$2^{\frac{16}{3}\left(4^{2^{n+1}}-1\right)+1+n+1}-1$$
We need to find a golfy way to write a function that is always greater than the triple exponential above for n≥1. Factorials grow faster than exponentials and are 1 byte shorter and so can be useful. However, the factorial of 1 is still 1, so we cannot simply chain 3 factorials together. The first function can instead be an exponential b*n
, APL for b to the power of n, after which two factorials can be added in place of the remaining double exponential, yielding !!b*n
. \$b=3\$ is too small for the base case of n=1, \$b=4\$ is slightly smaller, but \$b=5\$ goes above the function.
$$l=(5^{≢\omega}!)!$$
This gives the first statement of the APL code, l←!!5*≢⍵
assign to the variable l
the factorial !
of the factorial !
of 5
to the power of *
the length ≢
of ⍵
the right argument.
The submission
{l←!!5*≢⍵⋄z←⊂''⋄n←,∘.,⍨⋄k←{⊃,/n\l/⊂⍵}⋄f←{⊃⊃{⍺∊';|':(2↓⍵),⍨⍺{';'=⍺:n/⍵⋄,/⍵}2↑⍵⋄⍺='_':⍵,⍨⊂z⋄'+'=⍺:⊂∘k∘⊃@1,⍵⋄⍺,⍵}/⌽1,⍵}⋄w←f⍵⋄⊃q/⍨{0=≢((z,k'01')~w∪f⍵)∪w∩f⍵}¨q←k'10!_|;+'}
The regex is parsed and its matches returned by the function f
. It works on the reversed regex, and 'fold'ing ('reducing' in APL-speak) over it. It returns all matches of a regex, with the special case of the regex a+
only returning up to the first l
repetitions of sub-regex a
, thus keeping the list of matches finite. By having this behaviour of +
, all input matches up to a length of l
are included in the result of f
, i.e. nothing less than what is needed to test for complementarity. APL reduces from right to left, due to order of operations being from right to left. Each intermediate results of the reduce is formed by the matches of each sub-regex encountered thus far. For the initial case, a 1 is prepended to the regex before reversing it, after which the reduce operation is applied.
f←{⊃⊃{⍺∊';|':(2↓⍵),⍨⍺{';'=⍺:n/⍵⋄,/⍵}2↑⍵⋄⍺='_':⍵,⍨⊂z⋄'+'=⍺:⊂∘k∘⊃@1,⍵⋄⍺,⍵}/⌽1,⍵}
The right argument ⍵
contains the matches of each sub-regex encountered so far. The left argument ⍺
is the next character of the regex that is being processed. Each case of the regex character is checked using =
equality or ∊
belongs-to when multiple cases can be joined into 1.
Of particular note, !
in the regex is treated the same as 0
or 1
, resulting in a match of the character !
. Due to how the regexes are checked for complementarity, this works fine. Its exact mechanism is ⍺,⍵
, concatenate ⍺
, a single character being either of 0
, 1
, or !
, with ⍵
.
_
matches the empty string, z
in the APL code refers to z←⊂''
present earlier in the entire submission; ⍵,⍨⊂z
the enclosed empty string, i.e. signifying that only the empty string is being matched, is prepended to ⍵
.
The mechanism of +
is dependent on the function k←{⊃,/n\l/⊂⍵}
, which given a list of strings ⍵
, performs the Cartesian concatenation (Cartesian product, but using concatenation instead of set union) of l
copies of ⍵
, where the Cartesian concatenation is defined earlier as n←,∘.,⍨
. This is achieved using the @
operator, applying the function ⊂∘k∘⊃
on the list of matches at index 1
, the ⊂
enclose and ⊃
pick functions are required due to the behaviour of @
.
Concatenation and alternation are considered together. On 2↑⍵
the list of the first two elements of ⍵
, n/
(reduce by n
Cartesian concatenation) is applied in the case of concatenation, and ,/
(reduce by concatenation). The remaining ⍵
is concatenated at the end of this result.
After the execution of the reduce operation, its first set of matches is retrieved using ⊃
.
The regex does not give errors even on incorrect regexes. The handling of +
uses the @
operator which expects the argument to be indexable. When +
is by itself, ⍵
is the scalar 1
, so @
on 1
would not work, but the ,
applied on ⍵
converts it into a list, preventing an error. +
then takes the Cartesian concatenation of 1
with l
copies of itself.
;
and |
with too few arguments are implicitly handled by the functionality of the ↑
take function and the ↓
function. 2↑
takes the first 2 elements of the right argument, but if the right argument is shorter, the result is padded with sufficient prototype values, for example the prototype of a strictly numeric array is 0, while that of a character array is the space character ' '
.
However, even with this treating of anomalous regexes, these regexes will not be returned by the entire complementary regex function as a complementary regex. This is due to the fact that they cannot be shorter than the respective shortest valid postfix regex, and even if these incorrect regexes are of the same length as the shortest valid regex, they are ordered lexicographically later than a valid regex, with the order being specified later while performing the complementarity check by 10!_|;+
, and only the first (again, ordered lexicographically) complementary regex is picked as a result to the function submission.
Complementarity check
The checking for complementarity is done by the following part. q←k'10!_|;+'
generates all possible regexes up to a length of l
(k
computes this function as mentioned earlier) and assigns this to q
. The ordering specified by this part is what prevents incorrect regexes from being returned by the function.
For each of these strings ¨
, the following function is computed, (⍵
denotes the right argument, in this case, each string for which the following function is applied on)
{0=≢((z,k'01')~w∪f⍵)∪w∩f⍵}
w
is assigned earlier to the matches of the input regex. w∩f⍵
computes the intersection of w
, the matches of the input, with f⍵
, the matches of ⍵
. For complementary regexes, this should be the empty set.
w∪f⍵
computes the union of w
and f⍵
. ~
performs set complement. z
is ⊂''
, the enclosed empty string. k'01'
computes all binary strings with length up ≤ l
, so z,k'01'
is a list containing all these binary strings, along with the empty string.
The result of (z,k'01')~w∪f⍵
is all the binary strings that are not present in w∪f⍵
, so it contains the strings that are matches by neither the input regex nor ⍵
. For complementary regexes this should be the empty set.
((z,k'01')~w∪f⍵)∪w∩f⍵
is the union of the two sets above. This should be empty if the regexes are complementary. Its emptiness is checked by counting the number of elements contained within it using ≢
, and comparing this result with 0
. (Using ⍬
doesn't work because in Dyalog APL, empty lists can have different prototype lists.)
This function explained above returns 1 is the regexes are complementary, and 0 if they are not. The first truthy regex is picked by ⊃q/⍨
, being returned by the function.
Alternative approaches
Initially, I thought that using the approach recommended in the challenge of doing the conversion regex → NFA → DFA → regex would yield a short solution in APL because of APL's strength with matrices (by storing the NFA/DFA as a 3-dimensional array, with each of its 2-dimensional arrays being adjacency matrices for transitions involving each input character and ε-transitions). I got to implement a part of this for the regex matching challenge that popped up recently, but I got a solution much longer than what I had expected at over 250 bytes. It's the handling of individual matrix entries that made the program long, for example, I had to include specific transitions between specific states, like an ε-transition from state 1 to states 2 and 3 for alternation. However the handling of the entire array or its component matrices was short, as was needed to execute the NFA on the input string. And besides, all that the >250 byte program did was only to determine if 1 string matches the input regex. Maybe storing the NFA in a "sparse" format with list of lists of numbers representing states might end up being shorter that the matrix-oriented approach.
I tried another approach to that regex matching challenge by listing all the matches of the input regex, with +
repetitions bounded by the input length, and checking if the input string belongs to this list, which yielded a >140-byter, almost half the length of my previous approach. This approach ended up being adapted into function f
of my submission to this challenge. Unlike the approach mentioned in the previous paragraph, this approach generates the list of all input matches, so I do not have to separately run the regex matching function on each of the possible inputs to see which ones it matches, thus saving a great deal of bytes.
Moreover, I had also tried ⎕r
, Dyalog APL's PCRE replace function, for this challenge but that resulted in a postfix → PCRE converter at >110 bytes. Besides, similarly to the approach where I stored the NFA in an array, implementing the matching as well as finding all of the matches of this regex on all (bounded) input strings added too many bytes.
It was when I stumbled upon ankh-morpork's Prolog submission to the regex decomposition challenge and by the power of Prolog DCGs that I started working on this regex complement challenge in Prolog. However my Prolog skills was not good enough to make it work, and seeing how my approach ended up becoming un-Prolog-y and more APL-y I switched to APL and started this solution with the knowledge I had gathered from attempting the regex matching challenge. This brute-force approach proved to be significantly short than any other approach I could think of, so I went ahead with this. I left the calculation of the upper bound to the very end, i.e. the l←
bit, and golfed the rest of my program from an initial size of >210 bytes.
The complementary check might be further golfable, perhaps there's a nice way to get rid of the parentheses, or an alternative logic. In function f
, under the case when ⍺∊';|'
, I feel the 2↑⍵
and 2↓⍵
surrounding the code has a shorter alternative; I tried @1 2⊢⍵
, but this results in an error for invalid regexes requiring me to use an error guard.
But
/0*(10*)*/
is not an regex for matching any string with an even number. – tsh – 2018-04-03T06:50:37.7571That regex should probably have been
0*(10*10*)*
. – Martin Ender – 2018-04-03T07:17:12.1174Your first two test cases seem wrong. The outputs should also match strings which aren't of length 1. – Martin Ender – 2018-04-03T07:18:48.397
Related – Martin Ender – 2018-04-03T07:27:11.277
How should a regex that contains
!
behave? Always fail? – Erik the Outgolfer – 2018-04-03T11:41:59.807@MartinEnder I think that the first test case should be
001|+_|;101|+;|_|
and the second should be101|+_|;001|+;|_|
. – Erik the Outgolfer – 2018-04-03T11:48:10.963@MartinEnder I think it's fixed now. – Esolanging Fruit – 2018-04-03T18:01:36.343
@EriktheOutgolfer Force backtracking or trying the alternative to a case. It's like JavaScript's
[]
that always fails. – Esolanging Fruit – 2018-04-03T18:01:47.143I'd like to put a bounty on this, but I don't feel the test cases are sufficient and don't know enough regex to generate nontrivial ones that are still computationally feasible. Could you add a few? – lirtosiast – 2018-11-22T01:11:24.387
@lirtosiast The test cases aren't that useful anyway, because they were hand-written and probably don't reflect what an answer to this challenge would output given the provided inputs. I have included a program to test regexes on strings, though. – Esolanging Fruit – 2018-12-07T00:29:19.303