Check if simple regex matches string

20

2

A simple regex is either:

  • _ (which matches the empty string)
  • Any lowercase letter a to z (which matches that letter)
  • r*, where r is a regex (which matches r any number of times)
  • (r|s), where r and s are regexes (which matches either r or s)
  • (r+s), where r and s are regexes (which matches r followed by s)

Note that due to the recursive definition, * can occur multiple times in a row.

Here are some examples of regexes and their matches:

  • (a+b) matches only ab
  • ((a|_)+b*) matches , a, b, ab, bb, abb, but not baaa, aab
  • (c+(o+((l+(o+(u|_)))+r))) matches only color and colour
  • (a|b)* matches only strings containing letters a and b (so , ab, bab, but not abc)
  • (_***|(a+b***)) matches only the empty string or a followed by any number of bs.

Your task is to write a program that takes such a regex and a string of lowercase letters, and outputs whether or not the regex matches the entire string (output should be as described here).

The shortest code in bytes wins.

Stefan

Posted 2019-12-14T20:09:09.947

Reputation: 309

Welcome to the site! You have tagged this as code-golf but there is no indication of the scoring criterion in the challenge. Is code-golf with bytes the intended criterion? – Post Rock Garf Hunter – 2019-12-14T20:13:00.963

1Thanks! Yes, the shortest code wins. Added that to the question. – Stefan – 2019-12-14T20:13:56.813

Additionally output formats for decision problems can vary. Here is a meta post that goes into the options you as a challenge writer can choose.

– Post Rock Garf Hunter – 2019-12-14T20:14:46.707

3Can we assume that the regex consists entirely of lowercase letters and _*(|)? – Adám – 2019-12-14T20:27:48.107

2Yes, you can assume that it's a string of lower case letters, and that the input regex will be a valid regex as described above. – Stefan – 2019-12-14T20:29:52.297

@Stefan Valid sure, but nowhere do you state that there won't be any other characters in the regex. – Adám – 2019-12-14T20:31:12.227

There won't be any other characters, because any regex must be one of those five cases above (and nothing else), which are composed of only characters ()|+* and a to z. I'll add some examples of cases that don't match. – Stefan – 2019-12-14T20:33:35.257

The entire string has to match. – Stefan – 2019-12-14T20:36:26.980

Already added it. Also, abc in the fourth example is a partial match (ab) but not a full match. – Stefan – 2019-12-14T20:38:28.750

Suggested rewording of entire post: Given a `[a-z]string and a[a-z_()|+*]+PCRE regex, output whether the entirety of the string would be matched by the regex after all_s and+`s are removed from the regex.* Strictly speaking, this isn't the same as it allows many more regexes, but most answers will probably use this method anyway. – Adám – 2019-12-14T20:57:17.900

What is the criterion that defines what regex rule to use, I mean that since it's a code-golf I guess most answers will use the simplest and shortest form of a possible regex, is it allowed? – game0ver – 2019-12-14T21:12:11.080

1@game0ver There's never any ambiguity: A sub-regex is either a single character or a parenthesis, optionally followed by an asterisk. – Adám – 2019-12-14T21:20:08.187

@Adám: That rewrite might make an implementation in a language without a built-in regex library even less likely if each "atom" is a string of letters instead of just a single character. It's a good summary of one valid implementation strategy, though. Although I was kind of hoping this challenge wouldn't be so easily solved just by translating this into standard regex so more answers would have to implement a simple regex engine, maybe with backtracking via recursion for example. – Peter Cordes – 2019-12-15T19:24:15.133

1I don't think (c+(o+(l+((o+u)|o)+r))) strictly matches the grammar. (l+((o+u)|o)+r) requires defining the associativity of + to have a well defined parse. – ankh-morpork – 2019-12-16T03:52:49.060

Hi Stefan. If you want to sharpen your regex skills in a fun way try "www.regexcrossword.com". It has tutorials that gently introduce you to using regex, getting progressively harder. After that it has a "user generated" regex crossword level which range from being difficult to utterly insanely devious. But going through the tutorials is enough for everyday regex usage. – Robin – 2019-12-16T09:41:17.563

@Robin Please note that the OP is not asking for any help here. :) Questions on Code Golf SE are just meant to be recreational programming challenges. – Arnauld – 2019-12-17T03:11:28.030

@Arnauld: Yup I understand, I thought it was a tad off topic, I have nothing to do with that web site other than enjoying playing it. A regex crossword is a very close relation to a programming challenge though :)! – Robin – 2019-12-17T08:56:31.490

5

@ankh-morpork has pointed out that under the question's definition of a simple regex, any number of * in a row results in a valid simple regex; for example, qa***z should match qz and qaaz. Is this the definition we should stick to? If so, 8+ answers will need to be edited, as they are currently incorrect.

– Deadcode – 2020-01-09T01:44:18.070

4In addition to Deadcode's comment above, _* is also a valid "simple regex" that matches only an empty string. Deleting _ would leave single *, which is a grammar error in most regex flavors (if not all). – Bubbler – 2020-01-09T04:26:56.657

Answers

18

Haskell, 203 bytes

Nobody had done this by implementing a small regex engine yet, and I felt like it had to be done. This obviously won't win. but I'm hoping it will inspire someone to write an even more golfed regex engine.

I've rewritten my solution to avoid directly parsing the regular expression into its AST. Instead, the parsing process constructs a function that is used to match a string against the input regex.

The main function is (&) :: String -> String -> Bool which takes a string representation of a regex and a string to test, returning a boolean value. This calls to the next function which handles most of the work in parsing the regex and matching the string.

Function p :: String -> ([String] -> [String], String) takes a string representation of a regex and returns as the first element of a tuple a function that returns a list of all possible unmatched suffixes of strings in the input list after satisfying the regex parsed from the input string. The regex fully matches the string if the empty string is contained in the list of possible unmatched suffixes.

r&s=elem""$fst(p r)[s]
p(c:t)|c>'`'=t% \s->[t|h:t<-s,c==h]|c>'^'=t%id|(l,o:t)<-p t,(r,_:u)<-p t=u%last(r.l:[\s->r s++l s|o>'+'])
m#s=s++filter(`notElem`s)(m s)
('*':t)%m=t%until(\s->s==m#s)(m#)
s%m=(m,s)

Try it online!

To get rid of one byte, I replaced import Data.List; m#s=nub$s++m s with m#s=s++filter(`notElem`s)(m s). These functions aren't equivalent if there are duplicate elements in either s of m s. The new function does, however, remove all elements from m s that already exist in s, so until still terminates once no new suffixes are discovered by the application of m.

Ungolfed Code

import Data.List

match :: String -> String -> Bool
match r s =elem ""$(fst $ parseRegex r)[s]

parseRegex :: String -> ([String] -> [String], String)
parseRegex ('_':t) = parseKleene id t
parseRegex (c:t) | c >= 'a' = parseKleene (>>=p) t
  where p (c':t')| c==c' = [t']
        p _ = []
parseRegex ('(':t) =
  let (l, (o:t')) = parseRegex t in
  let (r, (_:t'')) = parseRegex t' in
  parseKleene (if o=='+' then (r.l) else (\ss-> (r ss)++(l ss))) t''

parseKleene :: ([String] -> [String]) -> String -> ([String] -> [String], String)
parseKleene p ('*':t) = parseKleene p' t
  where
    p' ss
      | ss' <- nub$ p ss,
        ss /= ss' = ss ++ (p' ss')
      | otherwise = ss
parseKleene p s = (p,s)

GolfScript, 198 bytes

I was able to beat my Haskell solution by implementing the first algorithm I tried in GolfScript instead of Haskell. I don't think it's interesting enough for a separate answer, so I'll just leave it here. There are likely some golfing opportunities since I learned GolfScript just for this.

This solution is in the form of a block that expects the test string on the top of the stack followed by the regex string.

{[.;]\1+{(.96>{0[\]}{2%0{r\(\r\:s;[@]s(;\}i}{if}:i~\{(.3%}{;\2[\]\}until[.;]\+\}:r~\;{.{(.{.4%{2%{)@\m\)\;m}{)\;{.@.@m 1$|.@={\;}{\o}i}:o~}i}{;)@.@m@@\)\;m|}i}{;(:c;;{,},{(\;c=},{(;}%}i}{;}i}:m~""?}

Try it online!

ankh-morpork

Posted 2019-12-14T20:09:09.947

Reputation: 1 350

3Fantastic work and great golfing! It even works with ((((a|d)|(g|j))|((((c|f)|i)+((a|d)|(g|j))*)+((b|e)|h)))|(((((b|e)|h)|((((c|f)|i)+((a|d)|(g|j))*)+((c|f)|i)))+(((a|d)|(g|j))|((((b|e)|h)+((a|d)|(g|j))*)+((c|f)|i)))*)+(((c|f)|i)|((((b|e)|h)+((a|d)|(g|j))*)+((b|e)|h)))))*, the regex that tests decimal numbers for divisibility by 3 (where 0-9 is replaced with a-j). – Deadcode – 2020-01-07T09:58:29.000

1@Deadcode thanks for bringing some attention to my solution. It's good to know that people are seeing it since I put a good bit of time into trying out different golfing strategies. – ankh-morpork – 2020-01-09T00:03:32.490

Incidentally I was wondering, why is _***** included as a test case, given that this is not a valid pattern as described by the problem? Not a big deal, just curious. My guess is that it's meant to show a pattern that works despite being invalid. (I notice if _ is replaced with a letter, it hangs.) If so, perhaps add a comment to that test case. Alternative guess: Something added during debugging/testing that was meant to be removed later. – Deadcode – 2020-01-09T00:47:13.703

1@Deadcode The way I understand the problem, that pattern should be valid: _ is defined as a regex and r* is defined to be a regex for all regex r, so we can chain * as much as we like on top of _. With regards to hanging, I think (but am not confident) that it will terminate but in an embarrassingly exponential amount of time. – ankh-morpork – 2020-01-09T01:05:02.693

Very interesting point. Under this interpretation, every other answer would currently be incorrect. – Deadcode – 2020-01-09T01:27:30.840

@Deadcode It looks like you were correct that my code doesn't always terminate for chained stars on characters (some cases work due to lazy evaluation). There's an easy fix for 16 bytes although I may be able to do better. I might also just leave it as is given that such inputs aren't handled by other answers. – ankh-morpork – 2020-01-09T03:30:54.363

I do think you should fix it. I'm trying to get everybody else to fix theirs. Also, it appears it's not just chained stars that hang your code. What about ((a|_)*|_)* or (_+(_+a*)*)* attempting to match a? – Deadcode – 2020-01-09T04:23:47.397

1@Deadcode The easy fix required importing Data.List for nub, but I'm fairly confident the problem is solved. – ankh-morpork – 2020-01-09T15:21:25.143

1You can change %(\s->...) to % \s->... – H.PWiz – 2020-01-10T15:26:02.307

10

APL (Dyalog Unicode), 39 bytesSBCS

Edit: now works with runs of * even after _

Full program. Prompts stdin for string and then for regex. Returns a list consisting of an empty list (by default, this prints as two spaces) for matches and an empty list (empty line) for non-matches.

(1⌽'$^','\*+' '_'⎕R'*' '()'⊢⍞~'+')⎕S⍬⊢⍞

Try it online! (output made easier to read by converting all output to JSON)

 prompt stdin (for string)

 on that, apply the following:

()⎕S⍬ PCRE Search for the following, returning an empty list for each match

~'+' remove all plusses from the following:

 prompt stdin (for regex)

 on that, apply the following:

'\*+' '_'⎕R'*' '()' PCRE Replace runs of * with * and _ with ()

'$^', prepend dollar sign and caret (indicating end and start)

1⌽ rotate the first character ($) to the end

Adám

Posted 2019-12-14T20:09:09.947

Reputation: 37 779

_*** is a valid simple regex. Please revise your answer to allow it. – Deadcode – 2020-01-09T05:40:35.803

@Deadcode Fixed. – Adám – 2020-01-09T06:46:34.603

6

APL (Dyalog Unicode), 295 277 bytes

a←819⌶⎕A
E←{⍵{(⍺⊆⍨~⍵),⍺[⍸⍵]}(⍵∊'|+')∧0=+\-⌿'()'∘.=⍵}1↓¯1↓⊢
M←{c←⊃⌽⍵⋄c∊'0',a:0⋄c∊'_*':1⋄r s o←E⍵⋄o='|':∨/∇¨r s⋄∧/∇¨r s}
D←{c←⊃⌽⍵⋄c∊'0_':'0'⋄c=⍺:'_'⋄c∊a:'0'⋄c='*':1⌽∊')('(⍺∇¯1↓⍵)'+'⍵⋄r s o←E⍵⋄o='|':1⌽∊')('(⍺∇r)'|',⍺∇s⋄M r:1⌽∊')(('(⍺∇r)'+'s')|',⍺∇s⋄1⌽∊')('(⍺∇r)'+'s}
{M⊃D/(⌽⍵),⊂⍺}

Try it online!

-18 bytes thanks to @ngn.

This is a proof of concept that we can do a "simple regex matching" without any backtracking, thus avoiding possible infinite loops due to _* or r**. This is also a showcase that APL is a general-purpose programming language.

The anonymous function at the last line does the regex matching; use it as (regex) f (input string). The return value is 1 if the match is successful, 0 otherwise.

Concept

Given a simple regex R and the first character c of input string, we can construct (or derive) another simple regex R' that matches exactly the strings s where the original R matches c+s.

$$ \forall R \in \text{simple regex}, c \in \text{[a-z]}, s \in \text{[a-z]*}, \\ \exists R' \in \text{simple regex}, R' =\sim s \iff R =\sim c+s $$

Combine this with a tester which checks if r matches an empty string (epsilon), and we get a fully working simple regex matcher: given a regex \$ R_0 \$ and string \$ s = c_1 c_2 \cdots c_n \$, sequentially derive \$ R_0, c_1 \rightarrow R_1, c_2 \rightarrow R_2 \cdots \rightarrow R_n \$ and then test if \$ R_n \$ matches epsilon.

My code uses the following algorithm for testing epsilon match (MatchEps) and computing R' from R and c (Derive).

T = True, F = False
0 = null regex (never matches)
_ = "empty string" regex
a = single-char regex
r, s = any (sub-)regex

MatchEps :: regex -> bool
MatchEps 0 = F    # Null regex can't match empty string
MatchEps _ = T    # Empty-string regex trivially matches empty string
MatchEps a = F    # Single-char can't match
MatchEps r* = T   # Kleene matches as zero iteration
MatchEps (r|s) = MatchEps r or MatchEps s
MatchEps (r+s) = MatchEps r and MatchEps s

Derive :: char -> regex -> regex
# No matching string at all
Derive c 0 = 0
# _ can't match any string that starts with c
Derive c _ = 0
# Single-char regex only matches itself followed by empty string
Derive c a = if c == 'a' then _ else 0
# r* matches either _ or (r+r*);
# _ can't start with c, so it must be first `r` of (r+r*) that starts with c
Derive c r* = ([Derive c r]+r*)
# r or s; simply derive from r or derive from s
Derive c (r|s) = ([Derive c r]|[Derive c s])
# r followed by s; it matters if r can match _
Derive c (r+s) =
  # if r matches _, either [r starts with c] or [r matches _ and s starts with c]
  if MatchEps r then (([Derive c r]+s)|[Derive c s])
  # otherwise, r always starts with c
  else ([Derive c r]+s)

Ungolfed, with comments

⍝ Unwrap single layer of (...) and extract (r, s, op) from (r|s) or (r+s)
ExtractRS←{⍵{(⍺⊆⍨~⍵),⍺[⍸⍵]}(⍵∊'|+')∧0=+\-⌿'()'∘.=⍵}1↓¯1↓⊢
  ⍝ 1↓¯1↓⊢    Drop the outermost ()
  ⍝ {...}     Pass the result to the function as ⍵...
  ⍝   +\-⌿'()'∘.=⍵    Compute the layers of nested ()s
  ⍝   (⍵∊'|+')∧0=     Locate the operator (`|` or `+`) as bool vector
  ⍝   ⍵{...}          Pass to inner function again ⍵ as ⍺, above as ⍵
  ⍝     ⍺[⍸⍵]     Extract the operator
  ⍝     (⍺⊆⍨~⍵),  Prepend the left and right regexes

⍝ Tests if the given regex matches an empty string (epsilon, eps)
MatchEps←{
    c←⊃⌽⍵                 ⍝ Classify the regex by last char
    c∊'0',819⌶⎕A:0        ⍝ 0(no match) or lowercase: false
    c∊'_*':1              ⍝ _(empty) or Kleene: true
    r s op←ExtractRS ⍵    ⍝ The rest is (r|s) or (r+s); extract it
    op='|': ∨/∇¨r s       ⍝ (r|s): r =~ eps or s =~ eps
    ∧/∇¨r s               ⍝ (r+s): r =~ eps and s =~ eps
}

⍝ Derives regex `R'` from original regex `R` and first char `c`
Derive←{
    c←⊃⌽⍵             ⍝ Classify the regex by last char
    c∊'0_':,'0'       ⍝ 0 or _ doesn't start with any c
    c=⍺:,'_'          ⍝ Single char that matches
    c∊819⌶⎕A:'0'      ⍝ Single char that doesn't match
    c='*': '(',(⍺∇¯1↓⍵),'+',⍵,')'    ⍝ One char from Kleene: (R*)' = (R'+R*)
    r s op←ExtractRS ⍵               ⍝ Extract (r|s) or (r+s)
    op='|': '(',(⍺∇r),'|',(⍺∇s),')'  ⍝ (r|s): one char from either branch
    MatchEps r: '((',(⍺∇r),'+',s,')|',(⍺∇s),')'   ⍝ (r+s) and r =~ eps: ((r'+s)|s')
    '(',(⍺∇r),'+',s,')'                           ⍝ (r+s) but not r =~ eps: (r'+s)
}

⍝ Main function: Fold the string by Derive with initial regex,
⍝                and then test if the result matches eps
f←{MatchEps⊃Derive/(⌽⍵),⊂⍺}

Final note

This is not an original idea of mine; it is part of a series of exercises on a theorem proving textbook. I can claim that the algorithm is proven to work (because I did complete the correctness proofs), though I can't open the entire proof to public.

Bubbler

Posted 2019-12-14T20:09:09.947

Reputation: 16 616

1Very cool solution. I remember working through those exercises when I took a software verification course - Lots of fun. – ankh-morpork – 2020-01-09T16:33:04.660

5

Python 3, 58 56 bytes

lambda r,s:re.match(re.sub('[_+]','',r)+'$',s)
import re

Try it online!

Simple - just convert it to ordinary regex, using ordinary regex!

-2 bytes thanks to Deadcode


Note from author: invalid due to chained repeats and repeats of nothing being allowed. Working on it.

Artemis still doesn't trust SE

Posted 2019-12-14T20:09:09.947

Reputation: 525

I think this is actually a 56 byte answer, because you don't have to count the f= for lambda answers, and can put the import after it: Try it online!

– Deadcode – 2020-01-09T00:18:42.220

1_*** is a valid simple regex. Please revise your answer to allow it. – Deadcode – 2020-01-09T05:40:51.130

4

JavaScript (ES6), 45 bytes

Takes input as (regex)(string). Returns a Boolean value.

Applies the straightforward method of removing [_+] from the simple regex to turn it into a standard regex.

r=>s=>!!s.match(`^${r.replace(/[_+]/g,"")}$`)

Try it online!

Or 43 bytes by returning either null or an object.

Arnauld

Posted 2019-12-14T20:09:09.947

Reputation: 111 334

_*** is a valid simple regex. Please revise your answer to allow it. – Deadcode – 2020-01-09T05:41:56.857

4

Retina, 38 35 bytes

*1A`
1G`
^
a`
_
()
\*+
*
"$-5"~`\+

Try it online! Takes the simple regex on the first line and the string to match on the second. Explanation:

*1A`

Delete the first line, but don't actually change the working string. The string to match still gets stored in the history, which allows us to refer to it later.

1G`

Keep only the first line.

^
a`

Prefix the a modifier to anchor the pattern to the whole string.

_
()

Turn the _s into ()s to match an empty string that can be "repeated" with *.

\*+
*

Reduces runs of * to a single *.

\+

Delete any +s.

"$-5"~`

Execute that as a stage, using the history as the working string.

Neil

Posted 2019-12-14T20:09:09.947

Reputation: 95 035

As specified in the problem, this is supposed to match the entire string (implied ^...$ around the pattern), but this is matching as if there is neither a ^ nor a $. – Deadcode – 2020-01-09T00:24:25.853

@Deadcode Sorry, I might not have noticed that edit by the time I finished answering the question. – Neil – 2020-01-09T01:05:16.797

Figured that might be it, which means you opened the question within within 27 minutes of when it was posted. – Deadcode – 2020-01-09T01:09:30.670

_*** is a valid simple regex. Please revise your answer to allow it. – Deadcode – 2020-01-09T05:42:49.923

@Deadcode It being the weekend, I probably had the "Newest Questions" tab open all day. – Neil – 2020-01-09T10:41:19.833

4

R, 55 75 bytes

function(x,y)grepl(paste0("^",gsub("([+_]|(?<=\\*))\\**","",x,pe=T),"$"),y)

Try it online!

A function that takes a simple regex x and a vector of strings y and returns a vector of logical values the same length as y indicating whether x matches.

Nick Kennedy

Posted 2019-12-14T20:09:09.947

Reputation: 11 829

_*** is a valid simple regex. Please revise your answer to allow it. – Deadcode – 2020-01-09T05:40:45.380

1@Deadcode fixed to allow this and ab****. – Nick Kennedy – 2020-01-09T07:18:25.653

3

Java 8, 55 bytes

r->s->s.matches(r.replaceAll("\\+|(_|(\\*))\\**","$2"))

Try it online.

Removes all +; all _ with zero or more trailing *; and changes all sequences of more than one subsequent * with a single *. Then it check if the String matches this modified regex. Note that in Java, the String#matches method implicitly adds a leading and trailing ^...$ to check the entire String.

Kevin Cruijssen

Posted 2019-12-14T20:09:09.947

Reputation: 67 575

_*** is a valid simple regex. Please revise your answer to allow it. – Deadcode – 2020-01-09T05:42:10.783

1@Deadcode Both _*** and ab**** are fixed now at the cost of 15 bytes. Can probably be golfed a bit, though. – Kevin Cruijssen – 2020-01-09T07:50:36.090

2

PHP, 983 976 954 930 910 892 838 bytes

<?php list(,$s,$i)=$argv;$p=0;$u=[2,[3,[2,[1,'('],$o=[2,[4,[2,&$u,[1,'|+']]],&$u],[1,')']],[1,'_'.join(range('a','z'))]],[5,[1,'*']]];m($o,$a);$s=$i;$p=0;echo m(o($a))&&$p==strlen($s);function m($m,&$a=[]){global$p,$s;$r=$p;$n=array_shift($m);foreach($m as$t){$b=[];if($n==1)if(($c=$s[$p]??0)&&strpos($t,$c)!==!1){$a[]=$c;$p++;return 1;}if($n==2){if(!m($t,$b)){$p=$r;return!1;}$a[]=$b;}if($n==3){if(m($t,$b)){$a[]=$b;return 1;}}if($n==4){k:$b=[];$r=$p;if(!m($t,$b)){$p=$r;return 1;}$a[]=$b;goto k;}if($n==5){if(m($t,$b))$a[]=$b;else{$a[]=[];$p=$r;}return 1;}if($n==6)return 1;}return $n==2?:$p!=$p=$r;}function o($a){$e=$b=u($a[1]);if($a[0]){$e=[2];foreach($a[0]as$u){$e[]=u($u[0]);$e[0]=$u[1][0]=='+'?2:3;}$e[]=$b;}return$e;}function u($u){$w=$u[0][0];$v=$w[0][0];$t=$v!='('?($v=='_'?[6,0]:[1,$v]):o($w[1]);return$u[1][0]==[]?$t:[4,$t];}

Try it online!

Ungolfed

<?php

list($dummy,$string,$user_test)=$argv;

$pointer = 0;

//production rules
$unit = [];
$char = ['char','_abcdefghijklmnopqrstuvwxyz'];
$separator = ['char','|+'];
$unit_and_separator = ['and',&$unit,$separator];
$operators_list = ['list',$unit_and_separator];
$operators = ['and',$operators_list,&$unit];
$open_bracket = ['char','('];
$close_bracket = ['char',')'];
$brackets = ['and',$open_bracket,$operators,$close_bracket];
$atom = ['or',$brackets,$char];
$star = ['opt',['char','*']];
$unit = ['and',$atom,$star];

$ast = [];
match($operators, $ast);

$user_regex = buildoperators($ast);

$user_ast = [];
$string = $user_test;
$pointer = 0;

// answer here 1=matched blank=not matched
echo match($user_regex, $user_ast)&&($pointer==strlen($string));

// recursive descent parser 
function match($test_match, &$ast) {
    global $pointer,$string;

    $original_pointer = $pointer;

    foreach (array_slice($test_match,1) as $test) {
        switch ($test_match[0]) {
            case 'and':
                $sub_match = [];
                $pass = match($test,$sub_match);
                if (!$pass) {$pointer = $original_pointer;return false;}
                $ast[] = $sub_match;
                break;
            case 'or':
                $sub_match = [];
                $pass = match($test, $sub_match);
                if ($pass) {
                    $ast[] = $sub_match;
                    return true;
                }
                break;
            case 'list':
                do {
                    $sub_match = [];
                    $original_pointer=$pointer;
                    $pass = match($test, $sub_match);
                    if (!$pass) {
                        $pointer = $original_pointer;
                        return true;
                    }
                    $ast[] = $sub_match;
                } while (true);
                break;
            case 'char':
                $char = substr($string,$pointer,1);
                if ($char && @strpos($test,$char)!==false) {
                    $ast[]=substr($string,$pointer,1);
                    $pointer++;
                    return true;
                }
                break;
            case 'emptystring':
                return true;
                break;
            case 'opt':
                $pass = match($test, $sub_match);
                if ($pass) {$ast[] = $sub_match;} else {$ast[] = []; $pointer = $original_pointer;}
                return true;
                break;
        }
    }

    if ($test_match[0] == 'and') {
        return true;
    } else {
        $pointer = $original_pointer;
        return false;
    }
}

// build user production rules
function buildoperators($ast) {
    if ($ast[0]) {  
        $engine = ['and'];
        foreach ($ast[0] as $unit_and_separator) {
            $engine[] = buildunit($unit_and_separator[0]);
            switch ($unit_and_separator[1][0]) {
                case '+':
                    $engine[0]='and';
                    break;
                case '|':
                    $engine[0]='or';
                    break;
            }

        }
        $engine[] = buildunit($ast[1]);
    } else {
        $engine = buildunit($ast[1]);
    }
    return $engine;
}

function buildunit($unit) {
    $star = !empty($unit[1][0]);
    if ($star) {
        return ['list',buildatom($unit[0][0])];
    } else {
        return buildatom($unit[0][0]);
    }
}

function buildatom($atom) {
    if ($atom[0][0]=='(') {
        return buildoperators($atom[1]);
    } elseif ($atom[0][0]=='_') {
        return ['emptystring',''];
    } else {
        return ['char',$atom[0][0]];
    }
}

Guillermo Phillips

Posted 2019-12-14T20:09:09.947

Reputation: 561

Very cool that you did this, and welcome to PPCG! Seems it needs some bugfixes though. It is missing the implied $ at the end of the pattern, e.g. (a|b) matches ac. And it seems using * anywhere except after the outermost expression results in error messages, e.g. (a*|b) crashes, but (a|b)* works. – Deadcode – 2020-01-09T00:11:58.180

Also, (((a+a)|a)+(a+b)) should be able to backtrack and match aab, but currently only ((a|(a+a))+(a+b)) is matching aab. – Deadcode – 2020-01-09T03:30:00.797

Thanks @Deadcode. Both bugs are fixed in the golfed version. It's larger, not sure how much more I can squeeze from the lemon. The ungolfed version was working correctly with (a*|b). The backtracking is not compatible with my overly simplistic recursive descent parser. As you've noted treat it gently and check shorter productions first to make it work. – Guillermo Phillips – 2020-01-09T11:55:50.127

Just a few extras. It's worth noting that the original question did not specify the behaviour of something like a+b+c|d, since there are no precedence rules. My code will accept it, but treat it as a|b|c|d, which is probably undesirable. Also, my regex engine makes up for its deficiency in that it is fact more powerful than conventional regex, as it also returns an abstract syntax tree of the results (in $user_ast). – Guillermo Phillips – 2020-01-09T12:15:29.743

1

Perl 5, 34 + -p flag = 35 bytes

Full program. Takes the simple regex pattern, followed by string to match against, from stdin as two separate lines, then loops and does it again, until EOF is encountered. Prints 1 for a match or nothing for a non-match (with no newline in either case).

@ankh-morpork has pointed out that technically, given the question's description of simple regexes, any number of * in a row makes a valid simple regex. @Bubbler has pointed out that _* also needs to work (and be equivalent to _). The other answers haven't taken these things into account yet, but I will do so:

s/[_+]/()/g;s/\*+/*/g;$_=<>=~/^$_/

Try it online!

To allow simple regexes such as (_***+a) to work, _ is changed to () instead of . For golf reasons, + is also changed to (), although changing it to  would work.

This solution exploits the fact that valid input won't contain newlines, input can be assumed to be valid, and both the implicit <> (from -p) and the explicit <> include the terminating newline read from stdin, thus $ doesn't need to be added at the end of the regex (as both the pattern and the string ), only ^ needs to be inserted at the beginning.


Perl 5, 20 + -p flag = 21 bytes (looser, obsolete interpretation of question)

y/_+//d;$_=<>=~/^$_/

Try it online!

Like most of the other solutions, deletes the _ and + characters to turn the simple regex into a standard regex. This means that the simple regex (_*+a) will not work, as it becomes (*a) after the deletion. Anything containing ** won't work either; in standard regex, an already-quantified expression cannot be quantified again.

Deadcode

Posted 2019-12-14T20:09:09.947

Reputation: 3 022

You don't need to count the flags into bytes. Instead, you can write e.g. Perl 5 `-p`, 34 bytes. – Bubbler – 2020-01-09T05:31:29.960

@Bubbler I know, but I've intentionally done this so that it is considered to be answer in the "Perl" language, and thus directly comparable to other answers in this language, rather than being in a different language, "Perl + -p". – Deadcode – 2020-01-09T05:38:23.527

1

C# (Visual C# Interactive Compiler), 535 bytes

a=>b=>{int y=0,e=a.Length,k=0,o,q;var z=new int[e];for(;k<e;k++)if(a[k]<41){for(o=q=1;o>0;q++)o+=a[q+k]<41?1:a[q+k]==41?-1:0;q--;z[k+q]=k+1;z[k]=k+q+2;}void t(string s,int j){for(;j<e;){var l=a[j++];var w=j<e&&a[j]==42;if(j>1&&a[j-2]<41)for(int r=j,d=0;r<z[j-2]-1;)if(a[r++]>123&z.Take(r).Skip(j-2).Count(x=>x>0)%2>0)t(s,r);if(l>96&l<124){do{if(w)t(s,j+1);if(s==""||s[0]!=l)return;s=s.Remove(0,1);}while(w);}if(l==42&&a[j-2]==41||l<41&z[j-1]-1<e&&a[z[j-1]-1]==42)t(s,z[j-1]);j=l>123?a.IndexOf(')',j)+1:j;}y=s==""?1:y;}t(b,0);return y;}

Try it online!

Inspired by @ankh-morpork's solution, this is another simple regex engine. Unfortunately, it is nowhere as terse as that solution.

Embodiment of Ignorance

Posted 2019-12-14T20:09:09.947

Reputation: 7 014

0

Japt , 13 bytes

è^+'$iVr"_|%+

Try it

true

Shaggy

Posted 2019-12-14T20:09:09.947

Reputation: 24 623

_*** is a valid simple regex. Please revise your answer to allow it. – Deadcode – 2020-01-09T05:41:22.887

0

Perl 5, 47 + -alp flag = 50 bytes

$_=$F[0];s/[_+]/()/g;s/\*+/*/g;$_=$F[1]=~/^$_$/

Try it online!


Perl 5, 41 + -alp flag = 44 bytes

Obsolete: it not supports _***-like regexes

$_=eval'$F[1]=~/^'.($F[0]=~y/_+//rd).'$/'

Try it online!

Denis Ibaev

Posted 2019-12-14T20:09:09.947

Reputation: 876

1_*** is a valid simple regex. Please revise your answer to allow it. – Deadcode – 2020-01-09T05:42:25.457

@Deadcode Fixed. – Denis Ibaev – 2020-01-09T18:33:14.457