Regex validating regex

17

3

Build a regex that will accept a regex string as input and check if it's valid. Basically, your regex should be able to validate itself. (Any invalid regex should not be validated, so you can't use .*. ;))

Your flavor must be fully supported by well known implementations (Perl, sed, grep, gawk, etc), and it must fully support what those implementations support. [Don't worry about the lawyer speak; I'm just trying to remove any possible loopholes for the smart***es.]


I'd it, but I'm worried it'll give an edge to those who know and use non-feature-rich flavors. Or are my worries unfounded?

Mateen Ulhaq

Posted 2011-08-29T23:06:39.147

Reputation: 1 889

Question was closed 2018-02-04T19:48:39.400

I've voted to close this question as it has no winning criteria, making it impossible to objectively identify a winner, and thus off-topic. – caird coinheringaahing – 2018-02-04T18:27:59.560

1@BrianVandenberg regular expressions implemented in modern languages are pretty much all non-regular ... as soon as you add backreferences, you can match non-regular languages. Moreover, both Perl/PCRE and .NET are powerful enough to match correct nesting. – Martin Ender – 2014-06-04T23:16:21.113

8not possible, the arbitrary nesting brackets make a regex a context free grammar, (replacing it with a polish notation also needs a stack) – ratchet freak – 2011-08-29T23:27:51.257

@ratchet Augh, you might be right. – Mateen Ulhaq – 2011-08-29T23:45:54.523

1some extensions on the regular languages exist that may allow matching the brackets but I don't know how to do it – ratchet freak – 2011-08-29T23:57:17.997

8It's bound to be possible with Perl regexes. – Peter Taylor – 2011-08-30T05:46:31.017

@ratchet freak is correct; regular expressions are a context-free grammar, so a regular expression cannot accurately capture context-sensitive features of the language -- eg, balancing arbitrarily nested parentheses, etc. – Brian Vandenberg – 2011-09-15T17:46:53.837

Answers

22

Ruby

I tried to match the actual syntax of Ruby's regex flavor as much as possible, but there are a few quirks: it accepts a few lookbehinds that are actually invalid (like (?<=(?<!))), and it recognizes empty character ranges like D-A. The latter could be fixed for ASCII, but the regex is long enough as it is.

\A(?<main>
    (?!
        \{(\d+)?,(\d+)?\} # do not match lone counted repetition
    )
    (?:
        [^()\[\]\\*+?|<'] | # anything but metacharacters
        (?<cclass>
            \[ \^? (?: # character class
                (?: # character class
                    [^\[\]\\-] | # anything but square brackets,  backslashes or dashes
                    \g<esc> |
                    \[ : \^? (?: # POSIX char-class
                        alnum | alpha | word | blank | cntrl | x?digit | graph | lower | print | punct | space | upper
                    ) : \] |
                    - (?!
                        \\[dwhsDWHS]
                    ) # range / dash not succeeded by a character class
                )+ |
                \g<cclass> # more than one bracket as delimiter
            ) \]
        ) |
        (?<esc>
            \\[^cuxkg] | # any escaped character
            \\x \h\h? | # hex escape
            \\u \h{4} | # Unicode escape
            \\c . # control escape
        ) |
        \\[kg] (?:
            < \w[^>]* (?: > | \Z) |
            ' \w[^']* (?: ' | \Z)
        )? | # named backrefs
        (?<! (?<! \\) \\[kg]) [<'] | # don't match < or ' if preceded by \k or \g
        \| (?! \g<rep> ) | # alternation
        \( (?: # group
            (?:
                \?
                (?:
                    [>:=!] | # atomic / non-capturing / lookahead
                    (?<namedg>
                        < [_a-zA-Z][^>]* > |
                        ' [_a-zA-Z][^']* ' # named group
                    ) |
                    [xmi-]+: # regex options
                )
            )?
            \g<main>*
        ) \) |
        \(\?<[!=] (?<lbpat>
            (?! \{(\d+)?,(\d+)?\} )
            [^()\[\]\\*+?] |
            \g<esc>  (?<! \\[zZ]) |
            \g<cclass> |
            \( (?: # group
                (?:
                    \?: |
                    \? \g<namedg> |
                    \? <[!=]
                )?
                \g<lbpat>*
            ) \) |
            \(\?\# [^)]* \)
        )* \)
        |
        \(\? [xmi-]+ \) # option group
        (?! \g<rep> ) 
        |
        \(\?\# [^)]*+ \) # comment
        (?! \g<rep> )
    )+
    (?<rep>
        (?:
            [*+?] | # repetition
            \{(\d+)?,(\d+)?\} # counted repetition
        )
        [+?]? # with a possessive/lazy modifier
    )?
)*\Z

Unreadable version:

\A(?<main>(?!\{(\d+)?,(\d+)?\})(?:[^()\[\]\\*+?|<']|(?<cclass>\[\^?(?:(?:[^\[\]\\-]|\g<esc>|\[:\^?(?:alnum|alpha|word|blank|cntrl|x?digit|graph|lower|print|punct|space|upper):\]|-(?!\\[dwhsDWHS]))+|\g<cclass>)\])|(?<esc>\\[^cuxkg]|\\x\h\h?|\\u\h{4}|\\c.)|\\[kg](?:<\w[^>]*(?:>|\Z)|'\w[^']*(?:'|\Z))?|(?<!(?<!\\)\\[kg])[<']|\|(?!\g<rep>)|\((?:(?:\?(?:[>:=!]|(?<namedg><[_a-zA-Z][^>]*>|'[_a-zA-Z][^']*')|[xmi-]+:))?\g<main>*)\)|\(\?<[!=](?<lbpat>(?!\{(\d+)?,(\d+)?\})[^()\[\]\\*+?]|\g<esc>(?<!\\[zZ])|\g<cclass>|\((?:(?:\?:|\?\g<namedg>|\?<[!=])?\g<lbpat>*)\)|\(\?#[^)]*\))*\)|\(\?[xmi-]+\)(?!\g<rep>)|\(\?#[^)]*+\)(?!\g<rep>))+(?<rep>(?:[*+?]|\{(\d+)?,(\d+)?\})[+?]?)?)*\Z

Lowjacker

Posted 2011-08-29T23:06:39.147

Reputation: 4 466

1How does this ensure that there are no invalid numeric backreferences? – Martin Ender – 2014-06-04T23:21:08.180

1I guess it doesn't. Then again, it's not the only limitation it has (see above). Some things could be fixed, but the regex would become ridiculously long. – Lowjacker – 2014-06-08T14:21:14.137

28Aren't they both the unreadable version? – Kibbee – 2011-09-07T01:59:16.467

2@Kibbee The first one is reasonably readable if you know regex well. – Lowjacker – 2011-09-08T23:15:06.143