sed: match a string between two different consecutive matches on all occurrencies

2

i have:

bananaOPENqwertyCLOSErandomtextOPENgrapesCLOSEwhateverOPENsunshineCLOSEgreymoon

this line could have many more OPEN and CLOSE strings in it.

I want to print the whole line with only whatever is between any consecutive OPEN and CLOSE and discard anything else. ie i want the output like this:

qwertygrapessunshine

closest i can think of is: sed -n 's/OPEN\(.*\)CLOSE/\1/g;p' which obviously doesn't work.

cablewelo2ma

Posted 2019-08-02T21:01:47.773

Reputation: 33

Answers

0

Because sed matches are "greedy" (more precisely, leftmost-longest), this is tricky. Try:

$ sed 's/OPEN/\n/g; s/[^\n]*\n//; s/CLOSE[^\n]*\n//g; s/CLOSE.*$//' file
qwertygrapessunshine

The above was tested on GNU sed. If you are on BSD/MacOS, some minor but annoying changes will likely be required.

How it works

Remember that sed, by default, reads in one line at a time into its pattern space. This means that, when we start processing a pattern space, it will never contain a newline character. Thus, we can use a newline character, \n, as a marker with no possibility of ambiguity.

  • s/OPEN/\n/g

    Replace OPEN with newlines

    By default, sed reads in only one line at a time into its pattern space. That means that the pattern space will never, on its own, have a newline character in it.

  • s/[^\n]*\n//

    Remove everything before the first OPEN (which is now a newline).

    Note that [\n]* matches zero or more of anything except a newline character. Consequently, [^\n]*\n matches zero or more of anything except a newline followed by a newline. This means it matches up to and including the next newline. By contrast, because sed expressions are "greedy" (leftmost-longest), .*\n matches anything up to and including the last newline in the pattern space.

  • s/CLOSE[^\n]*\n//g

    Remove everything starting from CLOSE and going to the next newline.

  • s/CLOSE.*$//

    Remove from the last CLOSE to the end of the line.

John1024

Posted 2019-08-02T21:01:47.773

Reputation: 13 893

beautifully explained. could you also tell me what is [^\n]* and what's the difference between it and things like .* – cablewelo2ma – 2019-08-02T21:41:46.263

Thank you. [\n]* matches zero or more of anything except a newline character. [^\n]*\n matches zero or more of anything except a newline followed by a newline. This means it matches up to and including the next newline. By contrast, .*\n matches up to and including the last newline in the pattern space. – John1024 – 2019-08-02T21:55:30.323

Understood. Thank you – cablewelo2ma – 2019-08-02T22:07:59.890