Compile Regexes

17

3

In this task you have to write a program that reads a regular expression and generates another program that outputs whether an input string is accepted by that regular expression. The output must be a program written in the same language as your submission.

Input

The input is a regular expression r matching the following ABNF (the initial production rule is REGEX):

REGEX       = *( STAR / GROUP / LITERAL / ALTERNATIVE )
STAR        = REGEX '*'
GROUP       = '(' REGEX ')'
LITERAL     = ALPHA / DIGIT
ALTERNATIVE = REGEX '|' REGEX

If the input does not match this grammar, the behavior of your program is undefined.

Interpretation

Interprete the input as a regular expression, where * is the Kleene-star (meaning repeat left argument zero or more times), | is an alternative, ( and ) group and no operator at all concatenated. Grouping takes precedence over star, star takes precedence over concatenation, concatenation takes precedence over alternative.

A string is said to be accepted if the regex matches the whole string.

Output

The program's output is another program written in the same language as your submission that reads a string s in an implementation defined way at runtime, outputs whether r accepts s and then terminates. Output may be done in a user-defined way although there must be only two distinct outputs for accepted and rejected programs.

You may assume that the input of your output program is never longer than 216-1 Bytes.

Restrictions

Neither your submission nor any program generated by your submission may use builtin functionality or libraries that

  • match regexes
  • transform regular expressions
  • compile regular expressions
  • generate parsers from a grammar
  • simplify the problem in a way that your submission becomes trivial

Scoring

The score of your submission is the number of characters it. The submission with the lowest score wins.

Testcases

All testcases contain a regular expression, a set of accepted strings, a set of rejected strings and an example program in C99 which is a valid output of a (hyptothetical) C99 submission.

(empty regular expression)

Accepted strings

  1. (empty input)

Rejected strings

  1. foo
  2. bar
  3. baz
  4. quux

Example program

#include <stdio.h>

int main() {
    char input[65536];
    gets(input);

    return input[0] != 0;
}

(b|)(ab)*(a|) (a and b alternating)

accepted strings

  1. a
  2. ba
  3. abababababa
  4. abab

rejected strings

  1. afba
  2. foo
  3. babba

example program

#include <stdio.h>

int main() {
  char input[65536];
  int state = 0;

  for (;;) switch (state) {
    case 0: switch (getchar()) {
      case 'a': state = 1; break;
      case 'b': state = 2; break;
      case EOF: return 0;
      default:  return 1;
    } break;
    case 1: switch (getchar()) {
      case 'b': state = 2; break;
      case EOF: return 0;
      default:  return 1;
    } break;
    case 2: switch (getchar()) {
      case 'a': state = 1; break;
      case EOF: return 0;
      default:  return 1;
    } break;
}

(0|1(0|1)*)(|A(0|1)*1) (binary floating point numbers)

accepted strings

  1. 10110100
  2. 0
  3. 1A00001

rejected strings

  1. 011
  2. 10A
  3. 1A00
  4. 100A010

FUZxxl

Posted 2012-08-18T22:37:13.253

Reputation: 9 656

1I presume that having a program like return (regex.match(stdin) is not null) is not allowed. – beary605 – 2012-08-18T23:25:22.090

1You say that "The output must be a program written in the same language as the input", but the input is a regex. And the grammar you provide doesn't include the rule GROUP, which presumably defines literal brackets. – Peter Taylor – 2012-08-19T07:24:16.067

@Peter Sorry, I meant to write they same language as the submission. – FUZxxl – 2012-08-19T10:04:52.583

@beary605 Yeah, you're right. See the section Restrictions: Neither your submission nor any program generated by your submission may use builtin functionality or libraries that match regexes (...). – FUZxxl – 2012-08-19T10:13:04.143

I think your second example program is incorrect, it's missing a loop around the outer switch – Hasturkun – 2012-08-21T11:09:54.100

Answers

8

Ruby, 641 651 543 characters

H=Hash.new{|h,k|[k]}
d=[[i=0,0,[]]]
o=[?(]
L="t,u,v=d.pop;q,r,s=d.pop;o.pop<?|&&(H[r]<<=t)||(H[q]<<=t;H[r]<<=u);d<<[q,u,s+v]"
gets.chop.chars.map{|c|c==?*&&(q,r,s=d.pop;H[r]|=[q,i+=1];d<<=[r,i,s];next)
eval(L)while/[|)]/=~c ?o[-1]>?(:o[-1]==?.
/[|(]/=~c&&d<<[i+=1,i,o<<c&&[]]||c!=?)&&d<<[i+=1,i+1,["s==#{o<<?.;i}&&c=='#{c}'&&#{i+=1}"]]||o[-1]=?.}
eval(L)while o.size>1
H.map{H.map{|k,v|v.map{|v|H[k]|=H[v]}}}
t,u,v=d[0]
$><<"s=#{H[t]};gets.chop.chars.map{|c|s=s.map{|s|#{v*'||'}}-[!0];#{H.map{|k,v|"s&[#{k}]!=[]&&s|=#{v}"}*?;}};p s&[#{u}]!=[]"

This ruby version became quite long because of several corner cases in the regex parser (maybe I should try a different approach). It expects the regular expression on STDIN and outputs the corresponding ruby code for the matcher to STDOUT.

The program directly generates code for a NFA-ε which is then executed in the matcher.

Test case 1: (output includes additional newlines and indentation)

>>>

s=[0];
gets.chop.chars.map{|c|
  s=s.map{|s|}-[!0];
};
p s&[0]!=[]

Test case 2:

>>> (b|)(ab)*(a|)

s=[0, 1, 2, 4, 9, 5, 10, 6, 11, 12, 14];
gets.chop.chars.map{|c|
  s=s.map{|s|s==2&&c=='b'&&3||s==6&&c=='a'&&7||s==8&&c=='b'&&9||s==12&&c=='a'&&13}-[!0];
  s&[1]!=[]&&s|=[1, 2, 4, 9, 5, 10, 6, 11, 12, 14];
  s&[3]!=[]&&s|=[3, 4, 9, 5, 10, 6, 11, 12, 14];
  s&[0]!=[]&&s|=[0, 1, 2, 4, 9, 5, 10, 6, 11, 12, 14];
  s&[5]!=[]&&s|=[5, 6];
  s&[7]!=[]&&s|=[7, 8];
  s&[9]!=[]&&s|=[9, 5, 10, 6, 11, 12, 14];
  s&[4]!=[]&&s|=[4, 9, 5, 10, 6, 11, 12, 14];
  s&[11]!=[]&&s|=[11, 12, 14];
  s&[13]!=[]&&s|=[13, 14];
  s&[10]!=[]&&s|=[10, 11, 12, 14]
};
p s&[14]!=[]

Another example:

>>> a|bc

s=[0, 1, 3, 4];
gets.chop.chars.map{|c|
  s=s.map{|s|s==1&&c=='a'&&2||s==4&&c=='b'&&5||s==6&&c=='c'&&7}-[!0];
  s&[0]!=[]&&s|=[0, 1, 3, 4];
  s&[3]!=[]&&s|=[3, 4];
  s&[5]!=[]&&s|=[5, 6];
  s&[2]!=[]&&s|=[2, 7]
};
p s&[7]!=[]

Edit: Added a transition to fix the bug PleaseStand noted in the comments. Also changed initialisation of state.

Howard

Posted 2012-08-18T22:37:13.253

Reputation: 23 109

Input 011 for regex (0|1(0|1)*)(|A(0|1)*1) results in true - it should be false. – PleaseStand – 2012-08-23T02:23:44.623

@PleaseStand Fixed. Please see my edit. – Howard – 2012-08-25T07:10:48.077

12

C, 627 characters

This program treats its first command-line argument as the input and generates C code as its output.

#define A(v) F[v]+strlen(F[v])
#define S sprintf
char*B="&&f%d(s)||f%d(s)",*J="&&f%d(s+%d)",*r,F[256][65536];h=2;e(f,n,o,R,C,O,t,g){for(C=O=f;*r;++r)switch(*r){case 40:r++;e(g=h++,C=h++,0,0);r[1]^42?t=g:(t=C,S(A(t),B,g,C=h++),r++);o=!S(A(O),J,t,o);O=C;break;case 41:goto L;case'|':S(A(C),J,n,o);o=!S(A(O=f),"||1");break;default:r[1]^42?S(A(C),"&&s[%d]==%d",o++,*r,O^f||R++):(o=!S(A(O),J,t=h++,o),O=C=h++,g=h++,S(A(g),"&&*s==%d&&f%d(s+1)",*r++,t),S(A(t),B,g,C));}L:S(A(C),J,n,o);}main(int c,char**v){r=v[1];for(e(1,!S(*F,"&&!*s"),0,0);h--;)printf("f%d(char*s){return 1%s;}",h,F[h]);puts("main(int c,char**v){exit(f1(v[1]));}");}

Here is its output for (0|1(0|1)*)(|A(0|1)*1) (with newlines added):

f11(char*s){return 1&&s[0]==49&&f7(s+1);}
f10(char*s){return 1&&s[0]==48&&f9(s+1)||1&&s[0]==49&&f9(s+1);}
f9(char*s){return 1&&f10(s)||f11(s);}
f8(char*s){return 1&&f7(s+0)||1&&s[0]==65&&f9(s+1);}
f7(char*s){return 1&&f0(s+0);}
f6(char*s){return 1&&f2(s+0);}
f5(char*s){return 1&&s[0]==48&&f4(s+1)||1&&s[0]==49&&f4(s+1);}
f4(char*s){return 1&&f5(s)||f6(s);}
f3(char*s){return 1&&s[0]==48&&f2(s+1)||1&&s[0]==49&&f4(s+1);}
f2(char*s){return 1&&f8(s+0);}
f1(char*s){return 1&&f3(s+0);}
f0(char*s){return 1&&!*s;}
main(int c,char**v){exit(f1(v[1]));}

If you provide valid input as its first command-line argument, it returns exit status 1. Otherwise, it returns exit status 0.

$ ./regexcompiler '(0|1(0|1)*)(|A(0|1)*1)' > floatprog.c
$ gcc -o floatprog floatprog.c
floatprog.c: In function ‘main’:
floatprog.c:1:519: warning: incompatible implicit declaration of built-in function ‘exit’ [enabled by default]
$ ./floatprog '1A00001' && echo invalid || echo valid
valid
$ ./floatprog '100A010' && echo invalid || echo valid
invalid

Either program, if you fail to provide the command-line argument, dereferences a null pointer, causing a segmentation fault. A sufficiently long regex will overflow the buffers of this submission, and the size of input to a generated program is limited by the size of the stack. However, all the test cases work.

Note that e(g=h++,C=h++,0,0); introduces undefined behavior. If, for example, generated programs do not compile, you can try replacing the statement with h+=2;e(g=h-1,C=h-2,0,0);, which is five characters longer.

PleaseStand

Posted 2012-08-18T22:37:13.253

Reputation: 5 369