Regex in reverse - decompose regular expressions

18

2

The Problem

I have a bunch of regular expressions that I need to use in some code, but I'm using a programming language that doesn't support regex! Luckily, I know that the test string will have a maximum length and will be composed of printable ASCII only.

The Challenge

You must input a regex and a number n, and output every string composed of printable ASCII (ASCII codes 32 to 126 inclusive, to ~, no tabs or newlines) of length less than or equal to n that matches that regex. You may not use built-in regular expressions or regex matching functions in your code at all. Regular expressions will be limited to the following:

  • Literal characters (and escapes, which force a character to be literal, so \. is a literal ., \n is a literal n (equivalent to just n), and \w is equivalent to w. You do not need to support escape sequences.)
  • . - wildcard (any character)
  • Character classes, [abc] means "a or b or c" and [d-f] means anything from d to f (so, d or e or f). The only characters that have special meaning in a character class are [ and ] (which will always be escaped, so don't worry about those), \ (the escape character, of course), ^ at the beginning of the character class (which is a negation), and - (which is a range).
  • | - the OR operator, alternation. foo|bar means either foo or bar, and (ab|cd)e matches either abe or cde.
  • * - match the previous token repeated zero or more times, greedy (it tries to repeat as many times as possible)
  • + - repeated one or more times, greedy
  • ? - zero or one times
  • Grouping with parentheses, to group tokens for |, *. +, or ?

The input regex will always be valid (i.e., you do not have to handle input like ?abc or (foo or any invalid input). You may output the strings in any order you would like, but each string must appear only once (don't output any duplicates).

The Test Cases

Input: .*, 1
Output: (empty string), , !, ", ..., }, ~

Input: w\w+, 3
Output: ww, www

Input: [abx-z][^ -}][\\], 3
Output: a~\, b~\, x~\, y~\, z~\

Input: ab*a|c[de]*, 3
Output: c, cd, ce, aa, cde, ced, cdd, cee, aba

Input: (foo)+(bar)?!?, 6
Output: foo, foo!, foofoo, foobar

Input: (a+|b*c)d, 4
Output: ad, cd, aad, bcd, aaad, bbcd

Input: p+cg, 4
Output: pcg, ppcg

Input: a{3}, 4
Output: a{3}

The Winner

This is , so the shortest code in bytes will win!

Doorknob

Posted 2014-04-05T13:31:07.293

Reputation: 68 138

Anybody been able to do this with R? – Kaleb – 2017-02-08T10:29:58.803

Are we allowed to support escape sequences? Then this is trivial. – John Dvorak – 2014-04-05T13:39:37.427

@JanDvorak No, you are not, as specified in the challenge and as demonstrated in one of the test cases. – Doorknob – 2014-04-05T13:40:16.863

@JanDvorak Actually, I thought I had a rule against built-in regex functions, but apparently not. Edited it in now – Doorknob – 2014-04-05T13:42:17.517

3Your explanation of | makes very little sense. It doesn't seem to handle nested groups or a|b|c. What's wrong with using the standard explanations in terms of how strongly concatenation and alternation bind? (And you have no excuse for not using the sandbox) – Peter Taylor – 2014-04-05T13:44:17.270

@PeterTaylor Because I have no idea what those standard explanations are? ;) You could edit my post with the "standard" terms, or just link me to a resource on this so I can do it myself. – Doorknob – 2014-04-05T13:45:34.120

1

@PeterTaylor Actually, he has an excuse: http://meta.codegolf.stackexchange.com/q/1305/9498

– Justin – 2014-04-05T15:15:38.867

http://www.regular-expressions.info/quickstart.html – Peter Taylor – 2014-04-05T16:01:02.213

@PeterTaylor Ok, edited to just say "alternation" – Doorknob – 2014-04-05T16:15:26.180

I like this Regex Debugger

– F. Hauri – 2014-04-05T17:10:45.253

2Judging by your examplesthe pattern has to match the entire string? (As opposed to a substring) – Martin Ender – 2014-04-05T21:01:09.970

@m.buettner Yes, it does. – Doorknob – 2014-04-06T17:39:03.367

@Doorknob You should probably put the {\d,\d} quantifier in the The Challenge section if you're going to have it in the The Test Cases section – Mouq – 2014-04-07T00:00:31.337

@Mouq Why is that? It's a test case, not a rule. – Doorknob – 2014-04-07T00:22:42.430

@Doorknob It's contradictory if your rules don't say you need it but the test cases say you do – Mouq – 2014-04-07T00:36:18.533

@Mouq "need it"? Need what? I suppose you could say it falls under the first category ("Literal characters"), but I don't see the need to clarify. – Doorknob – 2014-04-07T00:37:47.840

I wasn't trying to nit-pick; I just don't see how "Literal characters" covers "a{3}" – Mouq – 2014-04-07T00:59:42.117

@Mouq Umm... a is a literal character, { is a literal character, .... What would you suggest I add to the post? Sorry, I don't see your point. – Doorknob – 2014-04-07T01:01:23.627

Hmm. Puzzles like this make me think I ought to learn regular expressions. – Kyle Kanos – 2014-04-07T13:28:35.353

3

@KyleKanos It's a shame real world problems don't make you think you ought to learn regular expressions. :P But they are not as inaccessible as they might seem: http://www.regular-expressions.info/tutorial.html

– Martin Ender – 2014-04-07T19:34:26.730

Noticed a wee typo in the rules, going over them to check my entry: '...of length less than n'. You meant less than or equal to, going by the examples, and both entries work that way. – bazzargh – 2014-04-13T01:24:41.153

@bazzargh Thanks; fixed – Doorknob – 2014-04-13T01:25:57.657

What about abcd, 2? Do we have to handle cases like that? – Justin – 2014-04-13T02:32:41.030

How come ab*a|c[de]* matches the empty string? Is this a bug in the test cases? – John Dvorak – 2014-11-20T09:45:24.407

Are we required to only output each matched string once? May we output ["aa","a","aa"] for aa|a+? – John Dvorak – 2014-11-20T09:48:28.330

@JanDvorak Yep, I'm assuming that empty string was a mistake. There cannot be duplicate outputs; edited. – Doorknob – 2014-11-20T22:56:53.487

Answers

8

Haskell 757 705 700 692 679 667

import Data.List
data R=L Char|A R R|T R R|E
h=[' '..'~']
k(']':s)a=(a,s)
k('^':s)_=l$k[]s
k('-':c:s)(a:b)=k([a..c]++b)s
k('\\':c:s)a=k s$c:a
k(c:s)a=k s$c:a
l(a,b)=(h\\a,b)
c#E=L c
c#r=A(L c)r
o(a,b)=(foldr(#)E a,b)
t%0=E
t%n=A(t%(n-1))$T t$t%(n-1)
d s n=m(fst$r s)[[]] where{m E a=a;m(L c)a=[b++[c]|b<-a,length b<n];m(A r s)x=nub$(m r x)++m s x;m(T r s)a=m s$m r a;r s=w$e s E;w(u,'|':v)=(\(a,b)->(A u a,b))$r v;w x=x;e(')':xs)t=(t,xs);e s@('|':_)t=(t,s);e s@(c:_)t=g t$f$b s;e[]t=(t,[]);g t(u,v)=e v$T t u;f(t,'*':s)=(t%n,s);f(t,'+':s)=(T t$t%n,s);f(t,'?':s)=(A t E,s);f(t,s)=(t,s);b('(':s)=r s;b('\\':s:t)=(L s,t);b('.':s)=o(h,s);b('[':s)=o$k s[];b(s:t)=(L s,t)}

output:

ghci> d ".*" 1
[""," ","!","\"","#","$","%","&","'","(",")","*","+",",","-",".","/","0","1","2","3","4","5","6","7","8","9",":",";","<","=",">","?","@","A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z","[","\\","]","^","_","`","a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z","{","|","}","~"]
ghci> d "w\\w+" 3
["ww","www"]
ghci> d "[abx-z][^ -}][\\\\]" 3
["x~\\","y~\\","z~\\","b~\\","a~\\"]
ghci> d "ab*a|c[de]*" 3
["aa","aba","c","ce","cd","cee","cde","ced","cdd"]
ghci> d "(foo)+(bar)?!?" 6
["foo!","foobar","foo","foofoo"]
ghci> d "(a+|b*c)d" 4
["ad","aad","aaad","cd","bcd","bbcd"]
ghci> d "p+cg" 4
["pcg","ppcg"]
ghci> d "a{3}" 4
["a{3}"]

Explanation: this one's a textbook regex implementation. R is the regex type, with constructors A (alternate), L (literal), T (then) and E (empty/epsilon). The usual 'Star' doesn't appear because I inline it as alternates during the parse (see '%'). 'm' runs the simulation. The parser (start at 'r s=....') is just recursive descent; 'k' parses ranges. The function '#' expands ranges into alternations.

bazzargh

Posted 2014-04-05T13:31:07.293

Reputation: 2 476

9

Python v2.7 1069 1036 950 925 897 884 871 833 822

This answer seems rather long for a code golf, but there are a lot operators that need to be handled and I know what purpose each byte in this answer does. Since there is no existing answer I submit this as a target for other users to beat. See if you can make a shorter answer :).

The two main functions are f which parses the regex starting at the ith character, and d which generates the matching strings, using r the sub-regexes we can recursed into, 'a' the array representing the part of the current sub-regex not yet processed, and a string suffix s which represents the part of the string generated so far.

Also check out the sample output and a test harness.

import sys;V=sys.argv;n=int(V[2]);r=V[1];S=len;R=range;C=R(32,127)
Z=[];z=-1;D='d(r,p,';F='for j in '
def f(i,a):
 if i>=S(r):return a,i
 c=r[i];x=0;I="|)]".find(c)
 if c in"([|":x,i=f(i+1,Z)
 if I+1:return([c,a,x],[a],[c,a])[I],i
 if'\\'==c:i+=1;x=c+r[i]
 return f(i+1,a+[x or c])
def d(r,a,s):
 if S(s)>n:return
 while a==Z:
        if r==Z:print s;return
        a=r[z];r=r[:z]
 e=a[z];p=a[0:z]
 if'|'==a[0]:d(r,a[1],s);d(r,a[2],s)
 elif']'==a[0]:
        g=a[1];N=g[0]=='^';g=(g,g[1:])[N];B=[0]*127;O=[ord(c[z])for c in g]
        for i in R(0,S(g)):
         if'-'==g[i]:exec F+'R(O[i-1],O[i+1]):B[j]=1'
         else:B[O[i]]=1
        for c in C:N^B[c]<1or d(r,Z,chr(c)+s)
 elif' '>e:d(r+[p],e,s)
 else:c=p[:z];exec{'.':F+'C:'+D+'chr(j)+s)','?':D+'s);d(r,p[:z],s)','*':F+'R(0,n+1):d(r,c,s);c+=[p[z]]','+':"d(r,p+['*',p[z]],s)"}.get(e,D+'e[z]+s)')
d(Z,f(0,Z)[0],"")

Note that tabs in the original solution have been expanded. To count the number of characters in the original use unexpand < regex.py | wc.

gmatht

Posted 2014-04-05T13:31:07.293

Reputation: 676

9I've never seen python look that horrible. – user80551 – 2014-04-12T18:17:43.993

Can't you change def E(a,b):c=a[:];c.extend(b);return c to E=lambda a,b:a[:].extend(b) , ditto for A – user80551 – 2014-04-12T18:21:52.427

Apparently not, as .extend(b) does not return anything. – gmatht – 2014-04-12T18:51:05.330

How about E=lambda a,b:a[:]+b? – Justin – 2014-04-12T19:15:33.863

1

For the elif isinstance(e,str):, I believe you could change the code inside to: exec{'.':'for c in C:d(r,p,s+chr(c))','?':'d(r,p,s);d(r,p[:z],s)','*':'''c=p[:z]#newline for i in R(0,n+1):d(r,c,s);c+=[p[z]]''','+':"d(r,p+['*',p[z]],s)",'\\':'d(r,p,e[1]+s)'}.get(e,'d(r,p,e+s)')(note that the #newline is a newline) (source: http://stackoverflow.com/a/103081/1896169)

– Justin – 2014-04-12T19:30:45.523

1If you could find more places to use the exec trick, we could easily change your code into unreadable code :-) – Justin – 2014-04-12T19:50:28.197

1

Prolog (SWI), 586 bytes

The built in backtracking ability of Prolog makes it an awesome choice for this challenge. By taking advantage of backtracking, generating strings for a regex becomes exactly the same task as testing if a string is matched by a regex. Unfortunately, much of my golfing effort went into writing a shorter regex parsers. The actual task of decomposing a regular expression we easily golfed.

R-L-S:-R*A,-(B,A,[]),setof(Z,(0/L/M,length(C,M),C+B+[],Z*C),S).
-R-->e+S,S-R.
R-T-->{R=T};"|",e+S,u+R+S-T.
Z+Y-->(".",{setof(C,32/126/C,R)};"[^",!,\E,"]",{setof(C,(32/126/C,\+C^E),R)};"[",\R,"]";"(",-R,")";{R=[C]},([C],{\+C^`\\.[|+?*(`};[92,C])),("*",{S=k*R};"+",{S=c+R+k*R};"?",{S=u+e+R};{S=R}),({Y=c+Z+S};c+Z+S+Y).
\C-->{C=[H|T]},+H,\T;{C=[]};+A,"-",+B,\T,{setof(C,A/B/C,H),append(H,T,C)}.
+C-->[92,C];[C],{\+C^`\\]-`}.
S+e+S.
[C|S]+D+S:-C^D.
S+(B+L+R)+T:-B=c,!,S+L+U,U+R+T;S+L+T;S+R+T.
S+k*K+U:-S=U;S+K+T,S\=T,T+k*K+U.
A/B/C:-between(A,B,C).
A^B:-member(A,B).
A*B:-string_codes(A,B).

Try it online!

Ungolfed Code

generate_string(R, L, S) :-
    % parse regex
    string_codes(R, RC),
    regex_union(RE, RC, []),

    % bound string length
    between(0, L, M),
    length(SC, M),

    % find string
    match(SC, RE, []),

    string_codes(S, SC).

% Parsers
%%%%%%%%%  

regex_union(R) -->regex_concat(S), regex_union1(S, R).

regex_union1(R,T) --> [124], regex_concat(S), regex_union1(regex_union(R,S), T).
regex_union1(R, R) --> [].

regex_concat(R) --> regex_op(S), regex_concat1(S, R).

regex_concat1(R, T) --> regex_op(S), regex_concat1(regex_concat(R,S), T).
regex_concat1(R, R) --> [].

regex_op(regex_kleene(R)) --> regex_lit(R), [42].
regex_op(regex_concat(R,regex_kleene(R))) --> regex_lit(R), [43].
regex_op(regex_union(regex_empty,R)) --> regex_lit(R), [63].
regex_op(R) --> regex_lit(R).

regex_lit(regex_char([C])) --> [C], {\+ regex_ctrl(C)}.
regex_lit(regex_char([C])) --> [92], [C].

regex_lit(regex_char(CS)) --> [46], {findall(C, between(32,126, C), CS)}.

regex_lit(regex_char(DS)) --> 
    [91], [94], !, class_body(CS), [93],
    {findall(C, (between(32, 126, C), \+ member(C, CS)), DS)}.
regex_lit(regex_char(CS)) --> [91], class_body(CS), [93].

regex_lit(R) --> [40], regex_union(R), [41].

class_body([C|T]) --> class_lit(C),class_body(T).
class_body(CS) -->
    class_lit(C0), [45], class_lit(C1), class_body(T),
    {findall(C, between(C0, C1, C), H), append(H,T,CS)}.
class_body([]) --> [].

class_lit(C) --> [C], {\+ class_ctrl(C)}.
class_lit(C) --> [92], [C].

class_ctrl(C) :- string_codes("\\[]-", CS), member(C, CS).
regex_ctrl(C) :- string_codes("\\.[]|+?*()", CS), member(C, CS).

% Regex Engine
%%%%%%%%%%%%%% 

% The regex empty matches any string without consuming any characters.
match(S, regex_empty, S).

% A regex consisting of a single character matches any string starting with
% that character. The chanter is consumed.
match([C|S], regex_char(CS), S) :- member(C, CS).

% A union of two regex only needs to satisify one of the branches.
match(S, regex_union(L,R), T) :- match(S, L, T); match(S, R, T).     

% A concat of two regex must satisfy the left and then the right.
match(S, regex_concat(L, R), U) :- match(S, L, T), match(T, R, U).

% The kleene closure of a regex can match the regex 0 times or it can the regex
% once before matching the kleene closure again.
match(S, regex_kleene(_), S).
match(S, regex_kleene(K), U) :- match(S, K, T), S \= T, match(T, regex_kleene(K), U).

ankh-morpork

Posted 2014-04-05T13:31:07.293

Reputation: 1 350

I am not familiar with Prolog DCGs, how does regex_union(RE, RC, []) match against regex_union(R) -->...? – user41805 – 2020-01-24T17:41:08.100

@KritixiLithos DCG predicates have two implicit parameters that are supplied automatically when called from within a DCG or as part of phrase\2. The first is the string the DCG is applied to and the second is some suffix of the string that remains after satisfying the DCG predicate. An equivalent construct would be phrase(regex_union(RE), RC). – ankh-morpork – 2020-01-24T17:56:24.287

By the way, you might be interested in solving https://codegolf.stackexchange.com/q/161108/, the challenge asks for a simpler regex, so parsing would be easier, and this sounds like something prolog can do nicely

– user41805 – 2020-01-26T11:35:40.357

@KritixiLithos yeah, regex complement is definitely an interesting task. I've implemeneted the regex->nfa->dfa->complement->regex pipeline and it's not exactly short. I want to look into some more terse implementations before posting. – ankh-morpork – 2020-01-28T19:56:16.297

I have been trying to go about it using brute-force; I have an algorithm in mind, but implementing the brute-force part in prolog hasn't been successful so far. Hm, it would be nice if there is a reversible predicate to convert regex <=> dfa. Interesting to see what comes of both our solutions – user41805 – 2020-01-29T18:22:30.387