Not possible in Ruby!

I wanted to write a regular expression in Ruby that matches valid regular expressions in Ruby, but I must conclude that writing one is not possible.

At first, Ruby seems like a good choice. If I want to check for matching parentheses, I can write a recursive expression like (?<mparen>([^()]|$\g<mparen>$)*) where mparen matches zero or more things that are either not parentheses, or are a set of parentheses with mparen inside.

So I listed things that are valid and not valid in Ruby regular expressions. Then I found things that I cannot check with regular expressions.

Empty ranges of characters

In Ruby, [a-b] is a valid, but [b-a] is not valid, because a comes before b. This rule is common to several flavors of regular expressions:

$ ruby21 -e '/[b-a]/'
-e:1: empty range in char class: /[b-a]/
$ perl -e 'qr/[b-a]/'
Invalid [] range "b-a" in regex; marked by <-- HERE in m/[b-a <-- HERE ]/ at -e line 1.
$ sed '/[b-a]/!d'   
sed: 1: "/[b-a]/!d": RE error: invalid character range

So how do I write a regular expression that matches a-b but not b-a, and works with any range? I can't.

I would like to write (?<begin>.)-[\k<begin>-z] using a backreference to the first character. This would become a-[a-z] or b-[b-z] or so on. (Later I would change z to the highest character, but for now I test lowercase letters.) This fails because I can't use backreferences in ranges! [\k<begin>-z] is the same as [<>-z], as the range >-z includes the letters begkin.

I can write in 202 characters:

a-[a-z]|b-[b-z]|c-[c-z]|d-[d-z]|e-[e-z]|f-[f-z]|g-[g-z]|h-[h-z]|i-[i-z]|j-[j-z]|k-[k-z]|l-[l-z]|m-[m-z]|n-[n-z]|o-[o-z]|p-[p-z]|q-[q-z]|r-[r-z]|s-[s-z]|t-[t-z]|u-[u-z]|v-[v-z]|w-[w-z]|x-[x-z]|y-[yz]|z-z

This covers 26 lowercase letters, but only in their literal forms. Octal and hex escapes are also part of the regular expression syntax. Now a is also \141 and \x61, and z is also \172 and \x7a. My regular expression would need to match \141-\172 and \x61-\x7a, and also a-\172 and \x61-z and so on.

(a|\\141|\\x61)-([a-z]|\\1(4[2-7]|[5-7][0-7])|\\x(4[1-9a-fA-F]|[5-7]\h))

I would not try to extend this to the 1112064 codepoints of Unicode.

Missing capture groups

A regular expression is not valid if it has a backreference to an undefined capture group. For example, (?<a>)\k is not valid because it has no capture group b. So, my regular expression for valid regular expressions would need to check \k against the set of valid names. I can't think of a way to do this in one regular expression.

The answer is not (?<=\(\?.*)\\k. This would use the lookbehind feature to match \k if there was a (? earlier in the string. It fails because variable quantifiers like * are not valid in lookbehind expressions.

Another problem happens with subexpression calls. It is valid for two capture groups to share the same name, but it is not valid for a subexpression call to use a multiplex name. So (?<a>)(?<a>) is valid but (?<a>)(?<a>)\g<a> is not.

Encoding troubles

A valid regular expression must also be valid in its character encoding. If the encoding is UTF-8, then \xc4\x80 is valid but \x80\x80 is not. My regular expression for regular expressions would also need to validate UTF-8. It is valid to mix octal and hex escapes, so \xc4\200 and \304\x80 are valid too.

Ruby has other encodings. My regular expression would also need to validate EUC-JP, Windows-31J, GB18030, and every other ASCII-compatible encoding known to Ruby. That's not possible! My regular expression can't know which encoding to validate.

Is \200 a valid regular expression? Regexp.new '\200'.encode('binary') is valid, but Regexp.new '\200'.encode('utf-8') is not valid. My regular expression can't see the encoding tag. It can only see the four ASCII characters in \200, so it can't decide whether \200 is valid.

I might fix this if I decree that my regular expression for regular expressions only works with one encoding. I might pick 'us-ascii', which has only 128 characters. This does not fix my problem with capture groups.

kernigh

Posted 2014-05-29T13:37:21.903

Reputation: 2 615

+1 for effort. – primo – 2014-05-30T03:08:00.893

The first problem is solvable by writing a regex generator; the second is probably solvable by using negative lookahead. The biggest problem is matching properly parenthesised expressions. – Peter Taylor – 2014-05-30T07:34:46.350

@PeterTaylor I think in general, numbered capturing groups provide the biggest problem, because you can't really check that there are enough parentheses in the pattern for any given backreference-number. – Martin Ender – 2014-05-30T19:26:17.697

@m.buettner, IIRC some regex libraries will let you get away with that and either make it a no-op or treat it as a literal number. – Peter Taylor – 2014-05-30T19:48:26.777

@PeterTaylor hm interesting. Unfortunately, it doesn't work in either .NET or PCRE. – Martin Ender – 2014-05-30T19:50:27.293

@PeterTaylor damn it, Lua doesn't allow it either... that was my last hope. – Martin Ender – 2014-06-04T23:14:03.717

@MarcoPrins I voted to reopen now, as it's a proper (solvable) challenge now. I still suggest you think about limiting the scope of patterns to matched to the simple theoretical flavour. In the meantime, I'll start cleaning up my comments. – Martin Ender – 2014-05-29T15:10:22.840

2I second the suggestion made by @m.buettner. It would be more fun if you formally identified a single grammar and asked us to write a regex that matches any possible string that could possibly be produced by that grammar. – Rainbolt – 2014-05-29T15:15:39.487

@MarcoPrins Just to be completely clear, I was suggesting limiting the patterns to the syntactical features listed in this section.

– Martin Ender – 2014-05-29T15:16:59.850

1I don't think this is theoretically possible. I believe it could be possible for a certain flavor of regex to match a different, simpler one, though. – primo – 2014-05-29T15:34:05.817

@Rusher Are you guys sure I should limit this to a single grammar? The idea of asking for any language is so that more people can participate.. and have more potential answers – Marco Prins – 2014-05-29T15:38:40.247

@MarcoPrins it only makes it insanely hard. Also, any flavour that supports backreferences is definitely impossible to support completely - I think - (and that would exclude most actually implemented flavours). Hence the suggestion to limit it to the simple theoretical flavour. That would be hard enough I think. – Martin Ender – 2014-05-29T15:39:57.713

Okay, which one do you think would be most fun? @m.buettner – Marco Prins – 2014-05-29T15:43:28.970

@m.buettner If your 'flavor' supports executing arbitrary code... take perlre. Attempt to compile the input as a regular expression, wrapped in an eval. Check $@ (eval error). If it's set, match something impossible, else, match nothing.

– primo – 2014-05-29T15:43:44.080

@MarcoPrins Single grammar, not single language. Here is an example of a grammar (although not very formally written): "The alphabet is periods and asterisks. A period can only follow an asterisk. An asterisk can only follow a period unless it is the first letter." This would lead to strings like *.*.* or *.*.*.*.*.*.. Obviously it could be much more complicated. I think there is some confuson here about what flavor, language, grammar, etc. means. Google the terminology if you are unsure. – Rainbolt – 2014-05-29T15:45:46.883

@MarcoPrins I told you several times which flavour, and I even provided a link. The flavour that consists only of *, ?, (), | and literal characters. – Martin Ender – 2014-05-29T15:53:27.667

@primo I'm not sure I'd count evaluating arbitrary Perl code as part of the regex flavour. – Martin Ender – 2014-05-29T15:53:50.843

4The core problem with this challenge: Regular expressions include matched delimiters, but are incapable of parsing matched delimiters. – AJMansfield – 2014-05-29T16:09:33.163

@AJMansfield We discussed this before (in now deleted comments). This is why only .NET and PCRE are capable of solving the challenge at all (as both of them can parse matched and nested parentheses). – Martin Ender – 2014-05-29T16:40:31.523

in fact, this has actually been asked before on PPCG: http://codegolf.stackexchange.com/q/3596/8478

– Martin Ender – 2014-05-30T19:30:26.897

Regex that matches any regex

Answers

Not possible in Ruby!

Empty ranges of characters

Missing capture groups

Encoding troubles