Complement a POSIX Extended Regular Expression

19

6

Given a POSIX Extended Regular Expression as input, output its complement (also a POSIX ERE). If the given regular expression matches a string, its complement should not match it, and the regex: (regex given)|(complement) should match any string. In other words, negate the regex.

The regex for format input and for output, POSIX ERE, is basically regex without character classes, lookarounds, backreferences, etc. You can assume that we only care about ASCII characters. Also, assume that character classes do not contain - for use as ranges (no [a-d] as input; instead, the input would be [abcd]). Further assume that ? and + and {} will not be used in the input as metacharacters. x? will be shown with x| and x+ by xx* where x is an arbitrary regular expression. The variations of {m,n} will be similarly written out (e.g.: instead of x{1,4}, you'll get x|xx|xxx|xxxx).

Sample Input:

^([helo][world])([helo][world])*$

Possible Output (Thanks Peter Taylor):

^(([helo]([world][helo])*)?$|([helo][world])*[^helo]|[helo]([world][helo])*[^wo‌​rld])

Please post a few sample Input/Outputs

Justin

Posted 2014-03-17T03:14:07.297

Reputation: 19 757

3Your sample input uses +, but you state that + will not be used – Mego – 2016-10-04T16:47:33.603

@mbomb007 I think that was meant to be "complement", not "compliment"... At any rate, I didn't answer or comment on this question until now... So I'm a little confused, why refer to me...? – WallyWest – 2017-01-05T22:41:38.360

Not quite a duplicate because it allows regex syntaxes other than ERE, which actually makes a very large difference here. – None – 2017-03-01T23:24:33.283

1

@WallyWest yeah you did...

– mbomb007 – 2017-03-02T15:21:28.467

@Justin Can we assume that every input regex will start with ^ and end with $ (i.e. no "fuzzy" matching)? If so, I think I will have a solution soon. – mbomb007 – 2018-02-01T17:19:39.843

Though honestly, supporting printable ASCII is going to be rough, if meta-characters have to be included in the alphabet. – mbomb007 – 2018-02-01T18:15:59.073

@mbomb007 TBH, it's been so long since I posted this challenge that I'm unsure what is reasonable. I can revisit this and carefully revise it to make it simpler (stuff like only printable ascii or something). Also, is we have a regex regex, if it doesn't start with ^, isn't it exactly equivalent to ^.*regex? I don't know what I originally intended; it will take some work to figure that out – Justin – 2018-02-01T18:27:25.477

Yeah, you're right. I guess that's fine. I guess it depends what you want the challenge to be. Supporting escaped meta-characters is honestly going to be a challenge in itself if they are part of the alphabet. If what you really want is for the complementing the regex to be the entirety of the challenge, maybe supporting the characters [A-Za-z0-9] plus spaces is enough. Idk. If that was enough, I could (probably) have an answer today. – mbomb007 – 2018-02-01T19:40:42.973

The parts I haven't finished yet are replacing character classes [abc] with (a|b|c) so that I can work with them, and supporting (certain) escaped meta-chars. I think I get can printable ASCII to work, but it'll just take a while. – mbomb007 – 2018-02-01T19:43:14.843

Also, golfing will take a long time. The code I have so far is basically 30KB. – mbomb007 – 2018-02-01T20:41:58.890

@mbomb007 write an automatic golfer for it? (Or just do a size-optimization compile) – None – 2018-10-29T21:52:40.963

1@Rogem I just haven't been working on this lately. My previous comment still stands, as I'm not finished with supporting all of the necessary characters yet. As I said, if all I need was to support [A-Za-z0-9], I'd probably have submitted an answer already. – mbomb007 – 2018-10-29T23:19:06.600

2The string ao is neither matched by the input nor the given output. A possible output would be ^(|(..)*.|(..)*[^helo].(..)*|(..)*.[^world](..)*)$ (assuming . matches all characters). – Heiko Oberdiek – 2014-03-17T03:51:02.867

@HeikoOberdiek Thanks, I decided to go with Peter Taylor's version instead. – Justin – 2014-03-17T04:01:43.617

2As far as I can see, there is no way to complement the empty pattern that matches everything. – Heiko Oberdiek – 2014-03-17T04:33:26.400

@HeikoOberdiek See http://codegolf.stackexchange.com/q/18393/9498 . Example: .^ It's not possible for a character to be before the start of a string.

– Justin – 2014-03-17T04:36:18.837

16Oh "complement", not "compliment"... I was going to answer something on the lines of "Nice word boundaries!" – WallyWest – 2014-03-17T10:41:50.593

I feel like this will be too hard unless you give a simpler regex grammar than POSIX ERE. – Peter Olson – 2014-03-17T18:53:59.367

@PeterOlson What do you suggest then? Also, next time, try to bring this up in the sandbox.

– Justin – 2014-03-17T18:58:13.843

2

I'd suggest a simple grammar that can be described in a few lines. You could use the formal definition of a regular expression, which only uses concatenation (RS), alternation (R|S), and Kleene star (R*).

– Peter Olson – 2014-03-17T19:06:31.020

@PeterOlson That looks good, but I would like to include .. Would that make it too difficult? – Justin – 2014-03-17T19:12:42.017

I don't think that would make it much harder – Peter Olson – 2014-03-17T19:17:50.327

@PeterOlson I had another thought; anything in this POSIX ERE can be replaced with something else to make it part of the formal definition. Instead of using the formal definition, I say use POSIX ERE, but restrict input as follows: no ranges in character classes; no character class negations; no use of ? or +; () will only be used when necessary. Output can use any of those. – Justin – 2014-03-18T00:06:14.603

If you include ., what will be the complement of e.g. AB.D? – Antonio Ragagnin – 2014-03-30T09:26:29.823

@AntonioRagagnin Note I said assume that we are using ASCII. This means that . is equivalent to [x-~] where the x is the null character (ASCII 0). Basically, it is possible to translate POSIX ERE into the formal definition of regular expressions. I'm not sure what the complement would be; I'm working on it. – Justin – 2014-03-30T09:38:02.940

@AntonioRagagnin A complement of your regular expression is ^([A][B].[^D]|[A][B].$|[A][B]$|[A][^B]|[A]$|[^A])*$. I also found a tool for testing POSIX ERE expressions: http://www.regexplanet.com/advanced/golang/index.html

– Justin – 2014-03-30T09:54:18.560

1

There's an answer here which describes how to do this: http://stackoverflow.com/a/14817545/1396205

– bazzargh – 2014-03-31T09:47:25.557

4Hmm had more of a look at this. It's possible, but crazy tedious. You need to parse the regexp into an NFA (eg Thompson's algorithm), convert the NFA to a DFA (powerset construction), complete the DFA, find the complement, then convert the DFA to a RE (eg Brzozowski's method). ie slightly harder than writing a complete RE engine! I'd expect a golfed version would need to use library functions for each step. – bazzargh – 2014-04-01T14:33:13.863

No answers