Compile Regexes (By Substitution)

21

1

Your task is to compile regexes... by specifying a substitution for each character in a regex.

Regexes

The regexes support these

REGEX       = (LITERAL REGEX / GROUP REGEX / STAR REGEX / ALTERNATIVE)
LITERAL     = 1 / 0
GROUP       = '(' REGEX ')'
STAR        = (LITERAL / GROUP) '*'
ALTERNATIVE = '('REGEX ('|' REGEX)*')'

Why only 1 or 0? It is for simplification. The regex thus has only the following characters:

*()|10

It is interpreted as follows:

  1. * is Kleene star (repeat left group or literal 0 or more times).
  2. | is alternation (match if either the regex to the left or the regex to the right matches).
  3. () is grouping.
  4. 1 matches character 1.
  5. 0 matches character 0.

How to compile?

You specify six code snippets: one to replace each regex character. For example, if your answer is:

* : FSAGFSDVADFS
| : GSDGSAG
( : GSDG
) : GDSIH
1 : RGIHAIGH
0 : GIHEBN

Then you replace each regex with its respective code snippet, so:

(0|11)*

is turned into:

GSDGGIHEBNGSDGSAGRGIHAIGHRGIHAIGHGDSIHFSAGFSDVADFS

What is the resulting program supposed to do?

Your program will:

  1. Take the input.
  2. Output a truthy value if the regex matches whole input.
  3. Else output a falsy value.

Input outside 01 is consindered undefined behavior. Input can be empty.

Additional rules

  1. For a given regex character, the resulting snippet must always be the same.
  2. There is no prefix or suffix character added afterwards.
  3. The regex is guaranteed to be nonempty.

Scoring

The least combined snippet is the winner. So the score for the example case would be calculated as follows:

FSAGFSDVADFS + GSDGSAG + GSDG + GDSIH + RGIHAIGH + GIHEBN

12 + 7 + 4 + 5 + 8 + 6 = 42

Akangka

Posted 2015-11-17T14:42:10.640

Reputation: 1 859

Is each snippet to be at least 1 character long? – trichoplax – 2015-11-17T15:50:16.803

The snippet can have zero length. The edit is OK. – Akangka – 2015-11-18T09:09:45.027

Is the language RegEx valid for this challenge? :P – Loovjo – 2015-11-18T09:31:42.147

I consider RegEx have RegEx build-in. I am forced to do this. I want to exclude Retina and regex, however, according to Mego, it is not allowed. Still, I don't know about Snails and friends. – Akangka – 2015-11-18T10:59:47.543

@ChristianIrwan Interestingly, I'm still not sure this is even solvable in Retina, and even it is, it will be far from competitive. – Martin Ender – 2015-11-18T11:45:23.370

Actually, I want to exclude Retina, RegEx, Snail, SnakeEx, etc. However, I don't know about Snail, SnakeEx, ..., so someone post the loophole. Since I can't prevent it, now you can post in it. – Akangka – 2015-11-18T12:16:40.380

I want to see answer without regex nor other pattern-matching language. I will give him a bounty. – Akangka – 2015-11-18T12:19:41.343

Does The regex is guaranteed to be nonempty. apply only to the entire input, or two REGEX as well? In particular, do we have to support (0|)? – Dennis – 2015-11-18T22:34:47.777

Only to entire input. – Akangka – 2015-11-19T15:18:21.497

Answers

7

Snails, 48 bytes

0 -> )0(\0!(l.)(~

1 -> )0(\1!(l.)(~

( -> )0({{(

) -> )0}}(~

| -> )0}|{(

* -> )0),(~

If we had to search for partial matches rather than matching only the full input, then it would be very easy. 0 would become \0, 1 would become \1, * would become ,, and the others would map to themselves. Instead there are a lot of shenanigans to prevent matches from starting somewhere other than the beginning or ending somewhere other than the end. !(l.) is an assertion that will fail if the beginning of the match is not at the beginning of the input. ~ matches a cell outside of the input, so it is added to all the characters that are allowed to be at the end of the regex. If there is another regex character following, it is canceled out by a numerical quantifier 0 which requires it to be matched 0 times, essentially commenting it out. To allow * (,) to work correctly despite the dummy out-of-bounds test being in the way, the language's bracket matching rules are heavily used. From the documentation:

Matching pairs of parentheses () or curly braces {} will behave as expected (like parentheses in regex), but it is also possible to leave out one half of a pair and have it inferred, according to the following rules. ) or } groups everything to the left until the nearest unclosed group opening instruction of the same type(( or { respectively), or the beginning of the pattern if none exists. It closes any unclosed opening instructions of the opposite type in the middle of this range. An otherwise unmatched ( or { is closed by the end of the pattern.

Clear as mud, right?

feersum

Posted 2015-11-17T14:42:10.640

Reputation: 29 566

Sigh, I forget that there is even matching language outside regex. Nice job, but sorry, no upvote (no downvote, either) – Akangka – 2015-11-18T08:27:55.410

@ChristianIrwan there's actually a whole challenge on this site for developing 2d matching languages, most of which has 1d degenerate uses. http://codegolf.stackexchange.com/questions/47311/language-design-2-d-pattern-matching

– Sparr – 2015-11-18T08:39:44.970

7

CJam, 151 bytes

{]Na/Saf+{:m*:sSf-~}%}:J{+:MU{]W=~Jea&,}|}:TM0sa`T
{]Na/Saf+{:m*:sSf-~}%}:J{+:MU{]W=~Jea&,}|}:TM1sa`T
M{{+:M];eas!}:T}|U):UM'[T
MN`T
U(:UM'JT
M\"S+ea`,m*\"T

The lines correspond to the characters 01(|)* (in that order). Try it online!

This uses no built-in regular expression or other types of pattern matching. In fact, CJam has neither of these features. Instead, it starts from the regular expression it represents and builds all possible strings it could match, to finally check if the user input is one of them.

Test runs

The following uses a program that reads a regular expression from STDIN, replaces each of its characters by the proper snippet, and finally evaluates the generated code to see if it matches the input specified in the command line argument.

$ cat regex.cjam
l"01(|)*""

{]Na/Saf+{:m*:sSf-~}%}:J{+:MU{]W=~Jea&,}|}:TM0sa`T
{]Na/Saf+{:m*:sSf-~}%}:J{+:MU{]W=~Jea&,}|}:TM1sa`T
M{{+:M];eas!}:T}|U):UM'[T
MN`T
U(:UM'JT
M\"S+ea`,m*\"T

"N%ers~
$ cjam regex.cjam '' <<< '(|)'
1
$ cjam regex.cjam '0' <<< '(|)'
0
$ cjam regex.cjam '' <<< '0(|)'
0
$ cjam regex.cjam '0' <<< '0(|)'
1
$ cjam regex.cjam '' <<< '(0|11)*'
1
$ cjam regex.cjam '0' <<< '(0|11)*'
1
$ cjam regex.cjam '11' <<< '(0|11)*'
1
$ cjam regex.cjam '011011000' <<< '(0|11)*'
1
$ cjam regex.cjam '1010' <<< '(0|11)*'
0

Unfortunately, this isn't particularly fast. It will choke rather quickly if there are more than 9 characters in the input or more than a single Kleene star in the regex.

At the cost of 5 extra bytes – for a total of 156 bytes – we can generate shorter strings to match the potential input against and deduplicate them. This doesn't change how the code works; it just makes it more efficient.

$ cat regex-fast.cjam 
l"01(|)*""

{]Na/Saf+{:m*:sSf-~}%}:J{+:MU{]W=~Jea&,}|}:TM0sa`T
{]Na/Saf+{:m*:sSf-~}%}:J{+:MU{]W=~Jea&,}|}:TM1sa`T
M{{+:M];eas!}:T}|U):UM'[T
MN`T
U(:UM'JT
M\"S+eas,)m*:sSf-L|\"T

"N%ers~
$ cjam regex-fast.cjam '0101001010' <<< '(01|10)*'
0
$ cjam regex-fast.cjam '011001101001' <<< '(01|10)*'
1
$ cjam regex-fast.cjam '0' <<< '(0*1)*'
0
$ time cjam regex-fast.cjam '101001' <<< '(0*1)*'
1

Dennis

Posted 2015-11-17T14:42:10.640

Reputation: 196 637

I still have some idea how I could make this shorter and/or faster. I'll add an explanation when I'm satisfied with the results. – Dennis – 2015-11-19T01:50:46.900

There seems to be a superfluous \-escaping of the " in the pattern for *. Regardless of that, I couldn't get this program to accept any input, even for the simplest case where the regex consists only of a 0 (see test in online interpreter). Am I doing it wrong?

– matz – 2015-11-23T07:26:07.967

1

@matz My code uses command-line arguments, which are not implemented in that interpreter. Try this one instead.

– Dennis – 2015-11-23T07:37:56.773