RegEx-golf: match all contents in a string

10

1

Your task is to write a RegEx that matches everything inside strings.

A string is defined as everything surrounded by (but not including) two unescaped ".

A " can be escaped by \, which can also be escaped again.

Testcases

string:  ab\c"defg\\\"hi"jkl"mn\\\\"opqrst""
matches:      ^^^^^^^^^^     ^^^^^^        ^ (the empty string)

Scoring

Shortest solution wins.

Specs

  • Please specify the flavour used.
  • The input will have balanced ".
  • There will be no \ that immediately precedes a string-beginning-delimiter. For example, you would not need to handle abc\"def"

Leaky Nun

Posted 2016-05-20T10:04:24.667

Reputation: 45 011

1Will there be \ before a string? For example abc\"def". – jimmy23013 – 2016-05-20T10:29:05.827

Should it match each string in one group? For example, could I write something that has two matches in abc"de", one is d and the other is e? – jimmy23013 – 2016-05-20T10:43:53.373

It is allowed . – Leaky Nun – 2016-05-20T10:44:29.947

Will there be empty strings? – Martin Ender – 2016-05-20T10:45:46.280

Yes, there will be empty strings. – Leaky Nun – 2016-05-20T10:48:37.420

Answers

3

PCRE, 21 20 15 19 bytes

(.|^)"\K(\\.|[^"])*

Try it here.

This matches a character (or the beginning of the input) before the beginning double quote and then reset the match, to make sure the double quote isn't shared with another match.

PCRE, 25 23 bytes

Thanks to Martin Büttner for golfing off 2 bytes.

(\\.|[^"])*+(?!"(?R)|$)

Try it here.

Explanation

(
    \\.|[^"]     # An escaped character, or a character that isn't a double quote
)*+              # Possessive zero-or-more quantifier, which means backtracking
                 # could not happen after first match is found. That means if \\.
                 # matched, it would never switch to [^"], because it is always a
                 # match if it just stopped after the \\. without backtracking.
(?!"(?R)|$)      # Make sure it is not followed by a double quote and another
                 # match, or the end of the input.

Note that the possessive quantifier (*+) made sure the negative lookahead always begins after a whole string, or a whole segment of non-string.

There are 4 cases:

  • The match begins anywhere outside of a string. \\. would never match a double quote according to the clarification. It could only end just before the next double quote which begins a string, or the end of input. Both cases fails the negative lookahead.
  • The match begins at the beginning of a string. (\\.|[^"])*+ would match a complete string. The next character must be a double quote, and couldn't be the end of input. After the double quote it is outside of the string, so it couldn't be another match. So it passes the negative lookahead.
  • The match begins at the end of a string. It matches an empty string in the same way as the previous case. But it doesn't matter according to the clarification.
  • The match begins in the middle of a string. Impossible because matches don't overlap.

jimmy23013

Posted 2016-05-20T10:04:24.667

Reputation: 34 042

Would (\\.|[^"]) work? – Martin Ender – 2016-05-20T10:50:38.490

@MartinBüttner that matches everything except " – Bálint – 2016-05-20T10:52:29.277

@Bálint I meant in place of ([^\\"]|\\.), not as the complete solution. – Martin Ender – 2016-05-20T10:52:53.413

@MartinBüttner Oh, ok – Bálint – 2016-05-20T10:56:22.017

Martin's suggestion should work, since \\. only fails when there is no character after \ (or new line character, but that can be fixed with flag), and that case is covered by the negative look-behind. The possessive quantifier prevents backtracking, so we have no other case to look at. – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2016-05-20T11:03:47.550

@MartinBüttner Yes, it works thanks to the +. The quick reference on Regex101 "backtracking can't reduce the number of characters matched" seemed to be a lie. – jimmy23013 – 2016-05-20T11:10:01.857

Technically not wrong, since there aren't any backtracking at all. – jimmy23013 – 2016-05-20T11:11:01.620

I would love it if you could include an explanation. After all, the recursion obfuscates it. – Leaky Nun – 2016-05-20T11:12:29.407

@KennyLau Edited. – jimmy23013 – 2016-05-20T11:42:50.070

@mbomb007 The input will have balanced ". – jimmy23013 – 2017-02-16T09:14:19.397

0

JavaScript, 24 bytes

"([^"\\]*(?:\\.[^"\\]*)*)"

Group 1 is the contents of the string.

Whothehellisthat

Posted 2016-05-20T10:04:24.667

Reputation: 129

This doesn't at all work with escaped quotes, and thus fails to meet the spec. – ATaco – 2017-02-15T21:12:46.967

Ah yes--sorry. How about that? – Whothehellisthat – 2017-02-15T22:20:30.803

Close but no cigar, you shouldn't be matching the outer "s – ATaco – 2017-02-15T22:22:57.373

Yeah, that's what I was afraid of. No way of doing it in JavaScript, I'm guessing? – Whothehellisthat – 2017-02-15T22:52:29.760

You can capture it in a subgroup – ATaco – 2017-02-15T22:53:15.733

But it would still match the whole thing, right? – Whothehellisthat – 2017-02-15T22:54:43.130

The "Whole thing" may match it, but the contents needs to be in a distinguishable capture group separate from the "s. – ATaco – 2017-02-15T22:55:30.753

So that would be okay? It was unclear from the original post, as it only talked about "matches." – Whothehellisthat – 2017-02-15T22:56:44.743

To my knowledge, yes. – ATaco – 2017-02-15T22:57:23.963

Basically, I knew how to group just the contents; I didn't know what was allowed to complete the task. – Whothehellisthat – 2017-02-15T22:57:39.547

Cool. Thanks for your help, @ATaco! – Whothehellisthat – 2017-02-15T22:57:53.503

0

JavaScript, 21 15 13 12 bytes

"((\\?.)*?)"

String contents are in group 1.

"   #start of string
(    #capturing group
 (
  \\?. #match character or escaped character
 )*?  #match as few as possible
)        
"   #end of string

12Me21

Posted 2016-05-20T10:04:24.667

Reputation: 6 110