a←819⌶⎕A
E←{⍵{(⍺⊆⍨~⍵),⍺[⍸⍵]}(⍵∊'|+')∧0=+\-⌿'()'∘.=⍵}1↓¯1↓⊢
M←{c←⊃⌽⍵⋄c∊'0',a:0⋄c∊'_*':1⋄r s o←E⍵⋄o='|':∨/∇¨r s⋄∧/∇¨r s}
D←{c←⊃⌽⍵⋄c∊'0_':'0'⋄c=⍺:'_'⋄c∊a:'0'⋄c='*':1⌽∊')('(⍺∇¯1↓⍵)'+'⍵⋄r s o←E⍵⋄o='|':1⌽∊')('(⍺∇r)'|',⍺∇s⋄M r:1⌽∊')(('(⍺∇r)'+'s')|',⍺∇s⋄1⌽∊')('(⍺∇r)'+'s}
{M⊃D/(⌽⍵),⊂⍺}
Try it online!
-18 bytes thanks to @ngn.
This is a proof of concept that we can do a "simple regex matching" without any backtracking, thus avoiding possible infinite loops due to _*
or r**
. This is also a showcase that APL is a general-purpose programming language.
The anonymous function at the last line does the regex matching; use it as (regex) f (input string)
. The return value is 1 if the match is successful, 0 otherwise.
Concept
Given a simple regex R
and the first character c
of input string, we can construct (or derive) another simple regex R'
that matches exactly the strings s
where the original R
matches c+s
.
$$
\forall R \in \text{simple regex}, c \in \text{[a-z]}, s \in \text{[a-z]*}, \\
\exists R' \in \text{simple regex}, R' =\sim s \iff R =\sim c+s
$$
Combine this with a tester which checks if r
matches an empty string (epsilon), and we get a fully working simple regex matcher: given a regex \$ R_0 \$ and string \$ s = c_1 c_2 \cdots c_n \$, sequentially derive \$ R_0, c_1 \rightarrow R_1, c_2 \rightarrow R_2 \cdots \rightarrow R_n \$ and then test if \$ R_n \$ matches epsilon.
My code uses the following algorithm for testing epsilon match (MatchEps
) and computing R'
from R
and c
(Derive
).
T = True, F = False
0 = null regex (never matches)
_ = "empty string" regex
a = single-char regex
r, s = any (sub-)regex
MatchEps :: regex -> bool
MatchEps 0 = F # Null regex can't match empty string
MatchEps _ = T # Empty-string regex trivially matches empty string
MatchEps a = F # Single-char can't match
MatchEps r* = T # Kleene matches as zero iteration
MatchEps (r|s) = MatchEps r or MatchEps s
MatchEps (r+s) = MatchEps r and MatchEps s
Derive :: char -> regex -> regex
# No matching string at all
Derive c 0 = 0
# _ can't match any string that starts with c
Derive c _ = 0
# Single-char regex only matches itself followed by empty string
Derive c a = if c == 'a' then _ else 0
# r* matches either _ or (r+r*);
# _ can't start with c, so it must be first `r` of (r+r*) that starts with c
Derive c r* = ([Derive c r]+r*)
# r or s; simply derive from r or derive from s
Derive c (r|s) = ([Derive c r]|[Derive c s])
# r followed by s; it matters if r can match _
Derive c (r+s) =
# if r matches _, either [r starts with c] or [r matches _ and s starts with c]
if MatchEps r then (([Derive c r]+s)|[Derive c s])
# otherwise, r always starts with c
else ([Derive c r]+s)
Ungolfed, with comments
⍝ Unwrap single layer of (...) and extract (r, s, op) from (r|s) or (r+s)
ExtractRS←{⍵{(⍺⊆⍨~⍵),⍺[⍸⍵]}(⍵∊'|+')∧0=+\-⌿'()'∘.=⍵}1↓¯1↓⊢
⍝ 1↓¯1↓⊢ Drop the outermost ()
⍝ {...} Pass the result to the function as ⍵...
⍝ +\-⌿'()'∘.=⍵ Compute the layers of nested ()s
⍝ (⍵∊'|+')∧0= Locate the operator (`|` or `+`) as bool vector
⍝ ⍵{...} Pass to inner function again ⍵ as ⍺, above as ⍵
⍝ ⍺[⍸⍵] Extract the operator
⍝ (⍺⊆⍨~⍵), Prepend the left and right regexes
⍝ Tests if the given regex matches an empty string (epsilon, eps)
MatchEps←{
c←⊃⌽⍵ ⍝ Classify the regex by last char
c∊'0',819⌶⎕A:0 ⍝ 0(no match) or lowercase: false
c∊'_*':1 ⍝ _(empty) or Kleene: true
r s op←ExtractRS ⍵ ⍝ The rest is (r|s) or (r+s); extract it
op='|': ∨/∇¨r s ⍝ (r|s): r =~ eps or s =~ eps
∧/∇¨r s ⍝ (r+s): r =~ eps and s =~ eps
}
⍝ Derives regex `R'` from original regex `R` and first char `c`
Derive←{
c←⊃⌽⍵ ⍝ Classify the regex by last char
c∊'0_':,'0' ⍝ 0 or _ doesn't start with any c
c=⍺:,'_' ⍝ Single char that matches
c∊819⌶⎕A:'0' ⍝ Single char that doesn't match
c='*': '(',(⍺∇¯1↓⍵),'+',⍵,')' ⍝ One char from Kleene: (R*)' = (R'+R*)
r s op←ExtractRS ⍵ ⍝ Extract (r|s) or (r+s)
op='|': '(',(⍺∇r),'|',(⍺∇s),')' ⍝ (r|s): one char from either branch
MatchEps r: '((',(⍺∇r),'+',s,')|',(⍺∇s),')' ⍝ (r+s) and r =~ eps: ((r'+s)|s')
'(',(⍺∇r),'+',s,')' ⍝ (r+s) but not r =~ eps: (r'+s)
}
⍝ Main function: Fold the string by Derive with initial regex,
⍝ and then test if the result matches eps
f←{MatchEps⊃Derive/(⌽⍵),⊂⍺}
Final note
This is not an original idea of mine; it is part of a series of exercises on a theorem proving textbook. I can claim that the algorithm is proven to work (because I did complete the correctness proofs), though I can't open the entire proof to public.
Welcome to the site! You have tagged this as code-golf but there is no indication of the scoring criterion in the challenge. Is code-golf with bytes the intended criterion? – Post Rock Garf Hunter – 2019-12-14T20:13:00.963
1Thanks! Yes, the shortest code wins. Added that to the question. – Stefan – 2019-12-14T20:13:56.813
Additionally output formats for decision problems can vary. Here is a meta post that goes into the options you as a challenge writer can choose.
– Post Rock Garf Hunter – 2019-12-14T20:14:46.7073Can we assume that the regex consists entirely of lowercase letters and
_*(|)
? – Adám – 2019-12-14T20:27:48.1072Yes, you can assume that it's a string of lower case letters, and that the input regex will be a valid regex as described above. – Stefan – 2019-12-14T20:29:52.297
@Stefan Valid sure, but nowhere do you state that there won't be any other characters in the regex. – Adám – 2019-12-14T20:31:12.227
There won't be any other characters, because any regex must be one of those five cases above (and nothing else), which are composed of only characters
()|+*
anda
toz
. I'll add some examples of cases that don't match. – Stefan – 2019-12-14T20:33:35.257The entire string has to match. – Stefan – 2019-12-14T20:36:26.980
Already added it. Also,
abc
in the fourth example is a partial match (ab
) but not a full match. – Stefan – 2019-12-14T20:38:28.750Suggested rewording of entire post: Given a `[a-z]
string and a
[a-z_()|+*]+PCRE regex, output whether the entirety of the string would be matched by the regex after all
_s and
+`s are removed from the regex.* Strictly speaking, this isn't the same as it allows many more regexes, but most answers will probably use this method anyway. – Adám – 2019-12-14T20:57:17.900What is the criterion that defines what regex rule to use, I mean that since it's a code-golf I guess most answers will use the simplest and shortest form of a possible regex, is it allowed? – game0ver – 2019-12-14T21:12:11.080
1@game0ver There's never any ambiguity: A sub-regex is either a single character or a parenthesis, optionally followed by an asterisk. – Adám – 2019-12-14T21:20:08.187
@Adám: That rewrite might make an implementation in a language without a built-in regex library even less likely if each "atom" is a string of letters instead of just a single character. It's a good summary of one valid implementation strategy, though. Although I was kind of hoping this challenge wouldn't be so easily solved just by translating this into standard regex so more answers would have to implement a simple regex engine, maybe with backtracking via recursion for example. – Peter Cordes – 2019-12-15T19:24:15.133
1I don't think
(c+(o+(l+((o+u)|o)+r)))
strictly matches the grammar.(l+((o+u)|o)+r)
requires defining the associativity of+
to have a well defined parse. – ankh-morpork – 2019-12-16T03:52:49.060Hi Stefan. If you want to sharpen your regex skills in a fun way try "www.regexcrossword.com". It has tutorials that gently introduce you to using regex, getting progressively harder. After that it has a "user generated" regex crossword level which range from being difficult to utterly insanely devious. But going through the tutorials is enough for everyday regex usage. – Robin – 2019-12-16T09:41:17.563
@Robin Please note that the OP is not asking for any help here. :) Questions on Code Golf SE are just meant to be recreational programming challenges. – Arnauld – 2019-12-17T03:11:28.030
@Arnauld: Yup I understand, I thought it was a tad off topic, I have nothing to do with that web site other than enjoying playing it. A regex crossword is a very close relation to a programming challenge though :)! – Robin – 2019-12-17T08:56:31.490
5
@ankh-morpork has pointed out that under the question's definition of a simple regex, any number of
– Deadcode – 2020-01-09T01:44:18.070*
in a row results in a valid simple regex; for example,qa***z
should matchqz
andqaaz
. Is this the definition we should stick to? If so, 8+ answers will need to be edited, as they are currently incorrect.4In addition to Deadcode's comment above,
_*
is also a valid "simple regex" that matches only an empty string. Deleting_
would leave single*
, which is a grammar error in most regex flavors (if not all). – Bubbler – 2020-01-09T04:26:56.657