Ruby
I tried to match the actual syntax of Ruby's regex flavor as much as possible, but there are a few quirks: it accepts a few lookbehinds that are actually invalid (like (?<=(?<!))
), and it recognizes empty character ranges like D-A
. The latter could be fixed for ASCII, but the regex is long enough as it is.
\A(?<main>
(?!
\{(\d+)?,(\d+)?\} # do not match lone counted repetition
)
(?:
[^()\[\]\\*+?|<'] | # anything but metacharacters
(?<cclass>
\[ \^? (?: # character class
(?: # character class
[^\[\]\\-] | # anything but square brackets, backslashes or dashes
\g<esc> |
\[ : \^? (?: # POSIX char-class
alnum | alpha | word | blank | cntrl | x?digit | graph | lower | print | punct | space | upper
) : \] |
- (?!
\\[dwhsDWHS]
) # range / dash not succeeded by a character class
)+ |
\g<cclass> # more than one bracket as delimiter
) \]
) |
(?<esc>
\\[^cuxkg] | # any escaped character
\\x \h\h? | # hex escape
\\u \h{4} | # Unicode escape
\\c . # control escape
) |
\\[kg] (?:
< \w[^>]* (?: > | \Z) |
' \w[^']* (?: ' | \Z)
)? | # named backrefs
(?<! (?<! \\) \\[kg]) [<'] | # don't match < or ' if preceded by \k or \g
\| (?! \g<rep> ) | # alternation
\( (?: # group
(?:
\?
(?:
[>:=!] | # atomic / non-capturing / lookahead
(?<namedg>
< [_a-zA-Z][^>]* > |
' [_a-zA-Z][^']* ' # named group
) |
[xmi-]+: # regex options
)
)?
\g<main>*
) \) |
\(\?<[!=] (?<lbpat>
(?! \{(\d+)?,(\d+)?\} )
[^()\[\]\\*+?] |
\g<esc> (?<! \\[zZ]) |
\g<cclass> |
\( (?: # group
(?:
\?: |
\? \g<namedg> |
\? <[!=]
)?
\g<lbpat>*
) \) |
\(\?\# [^)]* \)
)* \)
|
\(\? [xmi-]+ \) # option group
(?! \g<rep> )
|
\(\?\# [^)]*+ \) # comment
(?! \g<rep> )
)+
(?<rep>
(?:
[*+?] | # repetition
\{(\d+)?,(\d+)?\} # counted repetition
)
[+?]? # with a possessive/lazy modifier
)?
)*\Z
Unreadable version:
\A(?<main>(?!\{(\d+)?,(\d+)?\})(?:[^()\[\]\\*+?|<']|(?<cclass>\[\^?(?:(?:[^\[\]\\-]|\g<esc>|\[:\^?(?:alnum|alpha|word|blank|cntrl|x?digit|graph|lower|print|punct|space|upper):\]|-(?!\\[dwhsDWHS]))+|\g<cclass>)\])|(?<esc>\\[^cuxkg]|\\x\h\h?|\\u\h{4}|\\c.)|\\[kg](?:<\w[^>]*(?:>|\Z)|'\w[^']*(?:'|\Z))?|(?<!(?<!\\)\\[kg])[<']|\|(?!\g<rep>)|\((?:(?:\?(?:[>:=!]|(?<namedg><[_a-zA-Z][^>]*>|'[_a-zA-Z][^']*')|[xmi-]+:))?\g<main>*)\)|\(\?<[!=](?<lbpat>(?!\{(\d+)?,(\d+)?\})[^()\[\]\\*+?]|\g<esc>(?<!\\[zZ])|\g<cclass>|\((?:(?:\?:|\?\g<namedg>|\?<[!=])?\g<lbpat>*)\)|\(\?#[^)]*\))*\)|\(\?[xmi-]+\)(?!\g<rep>)|\(\?#[^)]*+\)(?!\g<rep>))+(?<rep>(?:[*+?]|\{(\d+)?,(\d+)?\})[+?]?)?)*\Z
I've voted to close this question as it has no winning criteria, making it impossible to objectively identify a winner, and thus off-topic. – caird coinheringaahing – 2018-02-04T18:27:59.560
1@BrianVandenberg regular expressions implemented in modern languages are pretty much all non-regular ... as soon as you add backreferences, you can match non-regular languages. Moreover, both Perl/PCRE and .NET are powerful enough to match correct nesting. – Martin Ender – 2014-06-04T23:16:21.113
8not possible, the arbitrary nesting brackets make a regex a context free grammar, (replacing it with a polish notation also needs a stack) – ratchet freak – 2011-08-29T23:27:51.257
@ratchet Augh, you might be right. – Mateen Ulhaq – 2011-08-29T23:45:54.523
1some extensions on the regular languages exist that may allow matching the brackets but I don't know how to do it – ratchet freak – 2011-08-29T23:57:17.997
8It's bound to be possible with Perl regexes. – Peter Taylor – 2011-08-30T05:46:31.017
@ratchet freak is correct; regular expressions are a context-free grammar, so a regular expression cannot accurately capture context-sensitive features of the language -- eg, balancing arbitrarily nested parentheses, etc. – Brian Vandenberg – 2011-09-15T17:46:53.837