Complement of a Regex

15

4

For the purposes of this challenge, we define a postfix regex dialect with alphabet {1, 0} and the following operations:

  • 1 and 0 match themselves literally.
  • _ matches the empty string.
  • ! always fails (i.e. it does not match anything).
  • ab; matches a, followed by b.
  • ab| matches both a and b.
  • a+ matches one or more instances of a.

The regex should always start matching from the start of the string, and stop matching at the end of a string (in other word, it is implicitly wrapped in ^$).

As an example, the following regex matches any string with an even number of 1s:

0+_|10+_|;10+_|;;+_|;

This is equivalent to the regex 0*(10*10*)*.

The following regex matches any nonempty string of ASCII letters, expressed in binary:

110|010|10|10|10|;;;;1010|10|10|;;;10010|;10;|;;|;|;;+

This is equivalent to the regex (1.(0....|1(0...|10(0.|10))))* (where . matches any character.

You can test regexes with this program.

The Challenge

Given a regex in the format described above, return a regex in this format that matches precisely those strings that are not matched by the input.

One potential method to do this would be to convert the regex into a nondeterministic finite automaton, then convert that into a deterministic finite automaton, flip which states are designated as accept states, and then convert it back to a regex.

Test Cases

format:
input   => potential output
           equivalent infix regex

0       => _101|+_|;|
           |1.*
1       => _001|+_|;|
           |0.*
_       => 01|+
           .+
!       => 01|+_|
           .*
01|+    => _

01|+_|  => !
           [] (empty character class)
1001;;; => _01|01|01|;01|01|01|;;001|01|01|01|+_|;;;;01|101|01|01|+_|;;;;01|01|101|01|+_|;;;;01|01|01|001|+_|;;;;|||||||
           |.|..|...|0....*|.1...*|..1..*|...0.*

Esolanging Fruit

Posted 2018-04-03T05:08:24.303

Reputation: 13 542

But /0*(10*)*/ is not an regex for matching any string with an even number. – tsh – 2018-04-03T06:50:37.757

1That regex should probably have been 0*(10*10*)*. – Martin Ender – 2018-04-03T07:17:12.117

4Your first two test cases seem wrong. The outputs should also match strings which aren't of length 1. – Martin Ender – 2018-04-03T07:18:48.397

Related – Martin Ender – 2018-04-03T07:27:11.277

How should a regex that contains ! behave? Always fail? – Erik the Outgolfer – 2018-04-03T11:41:59.807

@MartinEnder I think that the first test case should be 001|+_|;101|+;|_| and the second should be 101|+_|;001|+;|_|. – Erik the Outgolfer – 2018-04-03T11:48:10.963

@MartinEnder I think it's fixed now. – Esolanging Fruit – 2018-04-03T18:01:36.343

@EriktheOutgolfer Force backtracking or trying the alternative to a case. It's like JavaScript's [] that always fails. – Esolanging Fruit – 2018-04-03T18:01:47.143

I'd like to put a bounty on this, but I don't feel the test cases are sufficient and don't know enough regex to generate nontrivial ones that are still computationally feasible. Could you add a few? – lirtosiast – 2018-11-22T01:11:24.387

@lirtosiast The test cases aren't that useful anyway, because they were hand-written and probably don't reflect what an answer to this challenge would output given the provided inputs. I have included a program to test regexes on strings, though. – Esolanging Fruit – 2018-12-07T00:29:19.303

Answers

11

Dyalog APL, 166 bytes

{l←!!5*≢⍵⋄z←⊂''⋄n←,∘.,⍨⋄k←{⊃,/n\l/⊂⍵}⋄f←{⊃⊃{⍺∊';|':(2↓⍵),⍨⍺{';'=⍺:n/⍵⋄,/⍵}2↑⍵⋄⍺='_':⍵,⍨⊂z⋄'+'=⍺:⊂∘k∘⊃@1,⍵⋄⍺,⍵}/⌽1,⍵}⋄w←f⍵⋄⊃q/⍨{0=≢((z,k'01')~w∪f⍵)∪w∩f⍵}¨q←k'10!_|;+'}

The solution is a function that works by brute-forcing. It cannot be run in reasonable time nor with reasonable memory on any of the inputs, so no TIO link is provided; this is due to looping over every possible regex over the factorial of the factorial of 5 to the power of the input length. Note that the numbers involved go out of Dyalog's maximum number size for an input regex of length greater than 1. To test this, take a look at the function f, which returns a value that includes all matches less than the calculated upper bound l, and test it on the complementarity check {0=≢ ... ⍵∩f⍵}. Using smaller values of l can make the function provide wrong results, so note that before reducing its value.


There is an upper bound for the minimum length of the complementary regex as a function of the length of the input regex, and an upper bound for the length of the longest string the two regexes need to be tested against to determine their complementarity. These parameters are used to brute force the complementary regex, by comparing the strings accepted by the input regex and the regex being tested for complementarity.

We start by comparing the length of a regular expression to the number of states of its respective DFA.

The non-deterministic finite automaton (NFA) for the regex 1 is the following (using (Sn) to denote accepting states)

   1
Si-->(Sf)

The regex _ representing an \$\epsilon\$-transition can be represented by an NFA that has less than 2 states, as can the regex ! representing non-acceptance.

 _


(Si)
 !

Si

Thus, a regex with only characters from 01_! with length n will have an NFA with max size n+1.

The regex ab;, where a and b are sub-regexes, denotes the concatenation of a and b and can be described as the following. (Si# denotes an initial state and (S##) denotes a final state.)

  Si1-->...-->(Sf1)   Si2-->...-->(Sf2)    ;

                     ε
NFA= Si1-->...-->Sf1-->Si2-->...-->(Sf2)

The regex a+ denotes one or more occurrences of a.

  Si1-->...-->(Sf1)   +

            ε
      +------------+
      v            |
NFA= Si1-->...-->(Sf1)

This produces an NFA with the same size as that of a. Again, a regex with only characters from 01_!; with length n will have an NFA with max size n+1.

The regex ab| denotes an alternative of either a or b.

  Si1-->...-->(Sf1)   |   Si2-->...-->(Sf2)

      Si1-->...-->(Sf1)
NFA=
      Si2-->...-->(Sf2)

Both Si1 and Si2 are starting states in the above NFA. This does not fit the traditional definition of an NFA, where an NFA has only 1 starting state, but the size of the resulting deterministic finite automaton (DFA) does not increase. If needed, the NFA for + can be changed so that there is an ε-transition from the final state to the second state, and the NFAs for _ and ! changed to have 2 states each, so that a traditional NFA can be constructed for alternation with ε-transitions between Si1 and Si2.

An NFA resulting from alternation will be as large as the sum of the two sub-regexes that make it up. A regex of length n will have an NFA with a maximum size of n+1.

By the powerset construction, an NFA with size \$n\$ will result in a DFA with size \$2^n\$. So a regex of length \$n\$ will have a minimal DFA upper-bounded by \$2^{n+1}\$. A DFA can be inverted by simply switching the final and non-final states, so its complement will have the same number of states as itself.

With this information, we need an upper limit on the size of the complementary regex as a function of the length of the number of states of the DFA. To do this, I use an algorithm to convert a DFA to a regex; the idea is to represent every possible path from Si to any other state, as well as accounting for loops by tracking every loop that starts and ends at Si, in a recursive manner. For every state that can be reached in 1 step from Si, the process is repeated and a path is constructed from this new state to every other state except the states that have already been encountered, i.e. Si in the first iteration, since loops back to Si would already be covered by tracking loops. An example of the construction can make this clearer.

0<c0>;1<c1>;|+_|
0<r0>;1<r1>;<a>||
;

This is a regex that represents a DFA that I will call the "base form". The base form for a state is a regex that traverses paths from that state to every state in the DFA except for, as we will see, states that have been explicitly excluded for a particular iteration. The accept/reject status of each path is denoted by <a>, being _ for accept or ! for reject. The second line of the regex accounts for this "breadth-first" filling, by transitioning to states that are 1 step way, and recursing the base form (as represented by <r0> and <r1>) on those states. Each time we recurse, we exclude the original "parent" state from the graph traversal to prevent the algorithm from infinitely looping. Loops in the regex are tracked in the first line, where, also, every possible path is traversed, halting when reaching the parent state, but these sub-regexes, as denoted by <c0> and <c1>, consider only the parent state to be the accepting state.

At each level of recursion, the number of excluded states increases by 1, until every possible state is covered at which point the algorithm ends.

The base form, without the recursive parts, is of length 16. Each iteration spawn 4 new sub-regexes, and there are as many iterations as states in the DFA.

If a DFA has n states, an upper bound for the regex length is \$16 + 16×4 + 16×4×4 + ... + 16×4^{n-1}\$. This is a geometric series and can also be written as

$$\frac{16}{3}\left(4^n-1\right)$$

Now given a regex with length s, the length of its complement is bounded by

$$h=\frac{16}{3}\left(4^{2^{n+1}}-1\right)$$

(I felt that this upper bound could be improved to a mere exponential instead of a double exponential, potentially saving some bytes, but I was proven wrong by this answer.)

We can check if two regexes are equivalent by brute force, but to do that we need to know the upper limit of the input length to test. If the DFAs of the two regexes are a and b, we can construct another DFA c whose states are the Cartesian product of the states of a and b, so each state of DFA c can be labeled as a 2-tuple (a_j,b_k) where a_j and b_k are states of DFA a and b respectively. The transition of DFA c's state (a_j,b_k) under input letter l is the tuple of the transition of each state a_j and b_k in DFAs a and b respectively under letter l. The final states of DFA c are all the states (a_j,b_k) where either both a_j and b_k are final states or they are both non-final states. DFA c accepts a word if both DFAs a and b accept this word, i.e. DFAs a and b do not yield complementary results on this word. DFAs a and b are complementary iff DFA c does not accept any word. By the pumping lemma, to check for this, we only need to check of input word sizes up to the number of states of DFA c, minus 1.

The size of the DFA of the complementary regex is 2^(h+1) and that of the input regex is 2^(n+1). Thus we only need to check for input lengths up to 2^(h+1)×2^(n+1)-1.

$$2^{\frac{16}{3}\left(4^{2^{n+1}}-1\right)+1+n+1}-1$$

We need to find a golfy way to write a function that is always greater than the triple exponential above for n≥1. Factorials grow faster than exponentials and are 1 byte shorter and so can be useful. However, the factorial of 1 is still 1, so we cannot simply chain 3 factorials together. The first function can instead be an exponential b*n, APL for b to the power of n, after which two factorials can be added in place of the remaining double exponential, yielding !!b*n. \$b=3\$ is too small for the base case of n=1, \$b=4\$ is slightly smaller, but \$b=5\$ goes above the function.

$$l=(5^{≢\omega}!)!$$

This gives the first statement of the APL code, l←!!5*≢⍵ assign to the variable l the factorial ! of the factorial ! of 5 to the power of * the length of the right argument.

The submission

{l←!!5*≢⍵⋄z←⊂''⋄n←,∘.,⍨⋄k←{⊃,/n\l/⊂⍵}⋄f←{⊃⊃{⍺∊';|':(2↓⍵),⍨⍺{';'=⍺:n/⍵⋄,/⍵}2↑⍵⋄⍺='_':⍵,⍨⊂z⋄'+'=⍺:⊂∘k∘⊃@1,⍵⋄⍺,⍵}/⌽1,⍵}⋄w←f⍵⋄⊃q/⍨{0=≢((z,k'01')~w∪f⍵)∪w∩f⍵}¨q←k'10!_|;+'}

The regex is parsed and its matches returned by the function f. It works on the reversed regex, and 'fold'ing ('reducing' in APL-speak) over it. It returns all matches of a regex, with the special case of the regex a+ only returning up to the first l repetitions of sub-regex a, thus keeping the list of matches finite. By having this behaviour of +, all input matches up to a length of l are included in the result of f, i.e. nothing less than what is needed to test for complementarity. APL reduces from right to left, due to order of operations being from right to left. Each intermediate results of the reduce is formed by the matches of each sub-regex encountered thus far. For the initial case, a 1 is prepended to the regex before reversing it, after which the reduce operation is applied.

f←{⊃⊃{⍺∊';|':(2↓⍵),⍨⍺{';'=⍺:n/⍵⋄,/⍵}2↑⍵⋄⍺='_':⍵,⍨⊂z⋄'+'=⍺:⊂∘k∘⊃@1,⍵⋄⍺,⍵}/⌽1,⍵}

The right argument contains the matches of each sub-regex encountered so far. The left argument is the next character of the regex that is being processed. Each case of the regex character is checked using = equality or belongs-to when multiple cases can be joined into 1.

Of particular note, ! in the regex is treated the same as 0 or 1, resulting in a match of the character !. Due to how the regexes are checked for complementarity, this works fine. Its exact mechanism is ⍺,⍵, concatenate , a single character being either of 0, 1, or !, with .

_ matches the empty string, z in the APL code refers to z←⊂'' present earlier in the entire submission; ⍵,⍨⊂z the enclosed empty string, i.e. signifying that only the empty string is being matched, is prepended to .

The mechanism of + is dependent on the function k←{⊃,/n\l/⊂⍵}, which given a list of strings , performs the Cartesian concatenation (Cartesian product, but using concatenation instead of set union) of l copies of , where the Cartesian concatenation is defined earlier as n←,∘.,⍨. This is achieved using the @ operator, applying the function ⊂∘k∘⊃ on the list of matches at index 1, the enclose and pick functions are required due to the behaviour of @.

Concatenation and alternation are considered together. On 2↑⍵ the list of the first two elements of , n/ (reduce by n Cartesian concatenation) is applied in the case of concatenation, and ,/ (reduce by concatenation). The remaining is concatenated at the end of this result.

After the execution of the reduce operation, its first set of matches is retrieved using .

The regex does not give errors even on incorrect regexes. The handling of + uses the @ operator which expects the argument to be indexable. When + is by itself, is the scalar 1, so @ on 1 would not work, but the , applied on converts it into a list, preventing an error. + then takes the Cartesian concatenation of 1 with l copies of itself.

; and | with too few arguments are implicitly handled by the functionality of the take function and the function. 2↑ takes the first 2 elements of the right argument, but if the right argument is shorter, the result is padded with sufficient prototype values, for example the prototype of a strictly numeric array is 0, while that of a character array is the space character ' '.

However, even with this treating of anomalous regexes, these regexes will not be returned by the entire complementary regex function as a complementary regex. This is due to the fact that they cannot be shorter than the respective shortest valid postfix regex, and even if these incorrect regexes are of the same length as the shortest valid regex, they are ordered lexicographically later than a valid regex, with the order being specified later while performing the complementarity check by 10!_|;+, and only the first (again, ordered lexicographically) complementary regex is picked as a result to the function submission.

Complementarity check

The checking for complementarity is done by the following part. q←k'10!_|;+' generates all possible regexes up to a length of l (k computes this function as mentioned earlier) and assigns this to q. The ordering specified by this part is what prevents incorrect regexes from being returned by the function.

For each of these strings ¨, the following function is computed, ( denotes the right argument, in this case, each string for which the following function is applied on)

{0=≢((z,k'01')~w∪f⍵)∪w∩f⍵}

w is assigned earlier to the matches of the input regex. w∩f⍵ computes the intersection of w, the matches of the input, with f⍵, the matches of . For complementary regexes, this should be the empty set.

w∪f⍵ computes the union of w and f⍵. ~ performs set complement. z is ⊂'', the enclosed empty string. k'01' computes all binary strings with length up ≤ l, so z,k'01' is a list containing all these binary strings, along with the empty string.

The result of (z,k'01')~w∪f⍵ is all the binary strings that are not present in w∪f⍵, so it contains the strings that are matches by neither the input regex nor . For complementary regexes this should be the empty set.

((z,k'01')~w∪f⍵)∪w∩f⍵ is the union of the two sets above. This should be empty if the regexes are complementary. Its emptiness is checked by counting the number of elements contained within it using , and comparing this result with 0. (Using doesn't work because in Dyalog APL, empty lists can have different prototype lists.)

This function explained above returns 1 is the regexes are complementary, and 0 if they are not. The first truthy regex is picked by ⊃q/⍨, being returned by the function.

Alternative approaches

Initially, I thought that using the approach recommended in the challenge of doing the conversion regex → NFA → DFA → regex would yield a short solution in APL because of APL's strength with matrices (by storing the NFA/DFA as a 3-dimensional array, with each of its 2-dimensional arrays being adjacency matrices for transitions involving each input character and ε-transitions). I got to implement a part of this for the regex matching challenge that popped up recently, but I got a solution much longer than what I had expected at over 250 bytes. It's the handling of individual matrix entries that made the program long, for example, I had to include specific transitions between specific states, like an ε-transition from state 1 to states 2 and 3 for alternation. However the handling of the entire array or its component matrices was short, as was needed to execute the NFA on the input string. And besides, all that the >250 byte program did was only to determine if 1 string matches the input regex. Maybe storing the NFA in a "sparse" format with list of lists of numbers representing states might end up being shorter that the matrix-oriented approach.

I tried another approach to that regex matching challenge by listing all the matches of the input regex, with + repetitions bounded by the input length, and checking if the input string belongs to this list, which yielded a >140-byter, almost half the length of my previous approach. This approach ended up being adapted into function f of my submission to this challenge. Unlike the approach mentioned in the previous paragraph, this approach generates the list of all input matches, so I do not have to separately run the regex matching function on each of the possible inputs to see which ones it matches, thus saving a great deal of bytes.

Moreover, I had also tried ⎕r, Dyalog APL's PCRE replace function, for this challenge but that resulted in a postfix → PCRE converter at >110 bytes. Besides, similarly to the approach where I stored the NFA in an array, implementing the matching as well as finding all of the matches of this regex on all (bounded) input strings added too many bytes.

It was when I stumbled upon ankh-morpork's Prolog submission to the regex decomposition challenge and by the power of Prolog DCGs that I started working on this regex complement challenge in Prolog. However my Prolog skills was not good enough to make it work, and seeing how my approach ended up becoming un-Prolog-y and more APL-y I switched to APL and started this solution with the knowledge I had gathered from attempting the regex matching challenge. This brute-force approach proved to be significantly short than any other approach I could think of, so I went ahead with this. I left the calculation of the upper bound to the very end, i.e. the l← bit, and golfed the rest of my program from an initial size of >210 bytes.

The complementary check might be further golfable, perhaps there's a nice way to get rid of the parentheses, or an alternative logic. In function f, under the case when ⍺∊';|', I feel the 2↑⍵ and 2↓⍵ surrounding the code has a shorter alternative; I tried @1 2⊢⍵, but this results in an error for invalid regexes requiring me to use an error guard.

user41805

Posted 2018-04-03T05:08:24.303

Reputation: 16 320

1

I never got around to golfing it, but here's regex complement implemented in Prolog

– ankh-morpork – 2020-02-20T01:44:56.417

3

Python 2, 1996 1960 bytes

lambda s:L(J(I(s)))
_=set;Z=sorted;x,y,z={0},{1},{2}
A=lambda a,b,c:{e for d in b for e in a[c][d]}
def B(a,b):b=_(b);c=b|A(a,b,2);return B(a,c)if b<c else c
C=lambda a,b:any(c in a for c in b)
def D(a):
 b,c,d,e=a;f=_();g=tuple(Z(B(e,{c})));h=_();i={0:{},1:{},2:{}};j=[g]
 while j:
  l=j.pop()
  if l in f:continue
  f.add(l);C(d,l)and h.add(l)
  for k in i:i[k][l]=_()
  for k in[0,1]:m=tuple(Z(B(e,A(e,l,k))));j.append(m);i[k][l].add(m)
 return f,g,h,i
def E(b,a):c,d,e,f=b;return{(a,g)for g in c},(a,d),{(a,g)for g in e},{k:{(a,g):{(a,h)for h in f[k][g]}for g in f[k]}for k in f}
def F(a,b):c,d,e,f=E(a,0);g,h,i,j=E(b,1);l=f.copy();[l[k].update(j[k])for k in l];[l[2][m].add(h)for m in e];return c|g,d,i,l
def G(a,b):c,d,e,f=E(a,0);g,h,i,j=E(b,1);l=f.copy();[l[k].update(j[k])or l[k].update({0:_(),1:_()})for k in l];l[2][0]|={d,h};[l[2][m].add(1)for m in e|i];return c|g|x|y,0,y,l
def H(a):b,c,d,e=a;f=_(d);g={k:e[k].copy()for k in e};[g[2][h].add(c)for h in f];return _(b),c,f,g
def I(s):
 a=[]
 for c in s:
  if c in'10_!':a.append({'1':(x|y|z,0,y,{0:{0:z,1:z,2:z},1:{0:y,1:z,2:z},2:{0:_(),1:_(),2:_()}}),'0':(x|y|z,0,y,{0:{0:y,1:z,2:z},1:{0:z,1:z,2:z},2:{0:_(),1:_(),2:_()}}),'_':(x|y,0,x,{0:{0:y,1:y},1:{0:y,1:y},2:{0:_(),1:_()}}),'!':(x,0,_(),{0:{0:x},1:{0:x},2:{0:_()}})}[c])
  elif c in'|;':a[-2:]=[{'|':G,';':F}[c](a[-2],a[-1])]
  else:a[-1]=H(a[-1])
 return D(a[-1])
def J(a):b,c,d,e=a;f={};[f.update({(g,h):f.get((g,h),'!')+str(k)+'|'})for k in e for g in e[k]for h in e[k][g]];return b,c,b-d,f
def K(a,b,c,d):
 for e in a-{c,d}:
  g={f:b[(f,t)]for f,t in b if t==e!=f};h={t:b[(f,t)]for f,t in b if f==e!=t};i={(f,t):b[(f,t)]for f,t in b if f!=e!=t}
  for f in g:
   for t in h:j=i.get((f,t),'!');i[(f,t)]=g[f]+b.get((e,e),'!')+'+_|;'+h[t]+';'+j+'|'
  b=i
 l,m,n,o=[b.get(pair,'!')for pair in[(c,c),(c,d),(d,d),(d,c)]];return l+'+_|'+(m+';'+o+l+'+_|;'+m+';'+n+'|+_|;')*(c!=d)
def L(a):b,c,d,e=a;return''.join(K(b,e,c,f)for f in d)+'!'+'|'*len(d)

Try it online!

Since user41805 came up with a very short but only theoretically-working solution, here is a practical one (practical as in "you can actually observe the output for small inputs"). The TIO link also includes a kind of test-battery to check for complement-ness.

It should be theoretically possible to solve Complement of POSIX ERE by reusing large chunks of code here, but I don't think I'll tackle it anytime soon. Feel free to solve it and take the 200 bounty.

Ungolfed, and how it works

The code is mainly divided into three parts: construct an NFA from the input regex, convert it to a DFA and invert its accept states, and convert it back to a regex.

Preliminary definitions

from collections import namedtuple

NFA = namedtuple('NFA', ['nodes', 'start', 'accept', 'trans'])
DFA = namedtuple('DFA', ['nodes', 'start', 'accept', 'trans'])

The definitions for NFA and DFA are there just to help reading the code. NFA and DFA have the same field names, but the trans fields contain different data structures.

Input regex to NFA

def step_once(trans, nodes_from, step):
    return {n for x in nodes_from for n in trans[step][x]}

def e_closure(trans, nodes_from):
    nodes_from = set(nodes_from)
    nodes_to = nodes_from | step_once(trans, nodes_from, '')
    while nodes_from != nodes_to:
        nodes_from = nodes_to
        nodes_to = nodes_from | step_once(trans, nodes_from, '')
    return nodes_to

def is_accept(accept, nodes):
    return any(node in accept for node in nodes)

def to_dfa(nfa):
    new_nodes = set()
    new_start = tuple(sorted(e_closure(nfa.trans, {nfa.start})))
    new_accept = set()
    new_trans = {0:{}, 1:{}, '':{}}
    stack = [new_start]
    while stack:
        node_set = stack.pop()
        if node_set in new_nodes: continue
        new_nodes.add(node_set)
        if is_accept(nfa.accept, node_set): new_accept.add(node_set)
        for k in new_trans: new_trans[k][node_set] = set()
        for k in [0, 1]:
            node_step = e_closure(nfa.trans, step_once(nfa.trans, node_set, k))
            node_step = tuple(sorted(node_step))
            stack.append(node_step)
            new_trans[k][node_set].add(node_step)
    return NFA(new_nodes, new_start, new_accept, new_trans)

to_dfa eliminates ϵ-moves and non-deterministic moves from the given ϵ-NFA. The algorithm is pretty standard: starting from the ϵ-closure of the start state, record all the reachable state sets by traversing through 0- and 1-moves followed by ϵ-closure.

one = NFA({0,1,2}, 0, {1}, {
    0: {0:{2}, 1:{2}, 2:{2}},
    1: {0:{1}, 1:{2}, 2:{2}},
    '': {0:set(), 1:set(), 2:set()}
})

zero = NFA({0,1,2}, 0, {1}, {
    0: {0:{1}, 1:{2}, 2:{2}},
    1: {0:{2}, 1:{2}, 2:{2}},
    '': {0:set(), 1:set(), 2:set()}
})

empty = NFA({0,1}, 0, {0}, {
    0: {0:{1}, 1:{1}},
    1: {0:{1}, 1:{1}},
    '': {0:set(), 1:set()}
})

null = NFA({0}, 0, set(), {
    0: {0:{0}},
    1: {0:{0}},
    '': {0:set()}
})

The NFA definitions for the four base regexes 10_! respectively.

def prefix(nfa, a):
    new_nodes = {(a,node) for node in nfa.nodes}
    new_start = (a,nfa.start)
    new_accept = {(a,node) for node in nfa.accept}
    new_trans = {k: {(a,node): {(a,dest) for dest in nfa.trans[k][node]} for node in nfa.trans[k]} for k in nfa.trans}
    return NFA(new_nodes, new_start, new_accept, new_trans)

prefix helper function converts an NFA into an equivalent one, but with each node key n wrapped into a 2-tuple (a,n). This is used for combining two NFAs to avoid key collisions. In Python, a tuple is hashable whenever all of its elements are hashable (thus appropriate for a dict key).

def concat(a, b):
    a = prefix(a, 0)
    b = prefix(b, 1)
    new_nodes = a.nodes | b.nodes
    new_start = a.start
    new_accept = b.accept
    new_trans = a.trans.copy()
    for k in new_trans: new_trans[k].update(b.trans[k])
    for a_acc in a.accept: new_trans[''][a_acc].add(b.start)
    return NFA(new_nodes, new_start, new_accept, new_trans)

def branch(a, b):
    a = prefix(a, 0)
    b = prefix(b, 1)
    new_nodes = a.nodes | b.nodes | {0, 1}
    new_start = 0
    new_accept = {1}
    new_trans = a.trans.copy()
    for k in new_trans: new_trans[k].update(b.trans[k]); new_trans[k].update({0:set(), 1:set()})
    new_trans[''][0] |= {a.start, b.start}
    for acc in a.accept | b.accept: new_trans[''][acc].add(1)
    return NFA(new_nodes, new_start, new_accept, new_trans)

def plus(nfa):
    new_nodes = set(nfa.nodes)
    new_start = nfa.start
    new_accept = set(nfa.accept)
    new_trans = {k: nfa.trans[k].copy() for k in nfa.trans}
    for acc in new_accept: new_trans[''][acc].add(new_start)
    return NFA(new_nodes, new_start, new_accept, new_trans)

Three combiner functions corresponding to ab;, ab|, a+ respectively.

ab;: Connect the accept states of a to the start state of b through ϵ-moves.

            e
S1 ... A1 -----> S2 ... A2

ab|: Introduce the overall start and accept states, and connect them through ϵ-moves like the following.

     e                e
   -----> S1 ... A1 ----->
S {                       } A
   -----> S2 ... A2 ----->
     e                e

a+: Connect the accept states back to the start states through ϵ-moves.

S ... A
|     |
\<----/
   e

Finally, we combine all of them to convert the input regex into a deterministic NFA.

def input_to_nfa(s):
    stack = []
    for c in s:
        if c == '1': stack.append(one)
        elif c == '0': stack.append(zero)
        elif c == '_': stack.append(empty)
        elif c == '!': stack.append(null)
        elif c == '|': stack[-2:] = [branch(stack[-2], stack[-1])]
        elif c == ';': stack[-2:] = [concat(stack[-2], stack[-1])]
        elif c == '+': stack[-1] = plus(stack[-1])
    return to_dfa(stack[-1])

NFA to DFA

def nfa_to_dfa(nfa):
    nodes = nfa.nodes
    start = nfa.start
    accept = nfa.nodes - nfa.accept
    trans = {}
    for k in nfa.trans:
        for node_from in nfa.trans[k]:
            for node_to in nfa.trans[k][node_from]:
                trans[(node_from, node_to)] = trans.get((node_from, node_to), '!') + str(k) + '|'
    return DFA(nodes, start, accept, trans)

Invert the accept states by set difference, and convert the data structure of the transition table so that it is suitable for regex construction.

DFA to regex

# dfa-style trans
def path_to_regex(nodes, trans, node_from, node_to):
    for node in nodes - {node_from, node_to}:
        left_nodes = {f: trans[(f,t)] for f,t in trans if t == node != f}
        right_nodes = {t: trans[(f,t)] for f,t in trans if f == node != t}
        loop = trans.get((node, node), '!')
        new_trans = {(f, t): trans[(f,t)] for f,t in trans if f != node != t}
        for f in left_nodes:
            for t in right_nodes:
                existing = new_trans.get((f, t), '!')
                new_trans[(f,t)] = left_nodes[f] + loop + '+_|;' + right_nodes[t] + ';' + existing + '|'
        trans = new_trans
    ff, ft, tt, tf = [trans.get(pair, '!') for pair in [(node_from, node_from), (node_from, node_to), (node_to, node_to), (node_to, node_from)]]
    if node_from == node_to: return ff + '+_|'
    return ff + '+_|' + ft + ';' + tf + ff + '+_|;' + ft + ';' + tt + '|+_|;'

def dfa_to_regex(dfa):
    regexes = [path_to_regex(dfa.nodes, dfa.trans, dfa.start, acc) for acc in dfa.accept]
    if not regexes: return '!'
    return ''.join(regexes) + '|' * (len(regexes) - 1)

For each pair of start -> accept states, eliminate all the other states by regex chaining and convert the result into a single regex. If we have multiple accept states, link the resulting regexes by |. If we have zero accept states, return a single '!'.

def solution(s):
    return dfa_to_regex(nfa_to_dfa(input_to_nfa(s)))

Solves the entire task by chaining the three core functions.

Bubbler

Posted 2018-04-03T05:08:24.303

Reputation: 16 616

Would creating another variable for {2}, and likewise for {1}, work for fewer bytes? – user41805 – 2020-02-20T19:15:29.037