Text deletion between patterns across multiple lines with respect to text inside pattern

I have a block of text I need to delete, however, only if it contains specific text inside the block:

...
<script language="JavaScript">
    var somethingA = 0;
    var somethingB = 0;
    var somethingC = 0;
    // do some stuff
</script>

<script language="JavaScript">
    var somethingA = 0;
    var somethingC = 0;
    var somethingD = 0;
    // do some stuff
</script>
....

I want to remove only the <script> block that has var somethingB in it. There could be any number of <script> blocks in the file in any position.

I was hoping to use sed doing something like:

sed 's/<script/,/<\/script>/ D'

However, I can't figure out how to only delete the block with var somethingB in it.

PS: I could also use perl or awk. I would rather use sed for consistency sake, but if it is easier in perl and/or awk I would switch gears pretty quick at this point. Thanks!

Matt

Posted 2014-12-29T20:57:29.600

Reputation: 101

The canonical "don't parse html with regex" answer...

– glenn jackman – 2014-12-30T02:32:38.227

@glennjackman Let's pretend it's not HTML ;-) – Matt – 2014-12-30T13:56:50.447

Answers

If a partial solution in vim is acceptable:

:%s/<script [^<]*\(\n[^<]*\)*somethingB.*\(\n[^<]*\)*<\/script>//g

but it won't work if there are other tags inside the <script> ones, because using [^<], the pattern may not contain <.

Sébastien Guarnay

Posted 2014-12-29T20:57:29.600

Reputation: 21

I do not have a simple solution. Actually it uses awk to code the needed algorithm in the C-like language of awk. Assuming the text to filter is in a file called 'filename':

awk 'BEGIN { curr=0 } \
     /<script .*>/ { in_block=1; del_block=0 } \
     /<\/script>/ { in_block=0; blockend=1 } \
     /var[[:space:]]+somethingB/ { if (in_block==1) \
                                     { del_block=1 } } \
    { if (in_block==0) \
        { if (blockend==0) \
            # Neither in a block nor block end reached.
            # Just print the line
            { print } \
          else \
            { # End of a block reached. Do block end handling
              # just this one time. Block end flag off
              blockend=0
              if (del_block==1) \
                { # delete the block. Just throw away the lines
                  # in the lines array
                  curr=0 } \
              else \
                { # End of block and no delete. Print it out
                  for (i=0; i<curr; i++) \
                    { print line[i] }
                    print   # Print the </script> line
                      # use line-array for the next block
                      curr=0 \
                } \
            } \
        } \
      else \
        { # In a block. Save the current line for later
          line[curr]=$0
          curr++ } \
    }' filename

The pattern for </script> (the end marker of a block) is a bit simple. It expects that it is exactly written like that without any spaces. If it can contain whitespaces, you may want to write it like this:

/<[[:space:]]*\/script[[:space:]]*>/

The pattern for var somethingB is var - one or more white spaces - somethingB, which is probably what you are searching for. If you want it fixed to exactly one space between var and somethingB it's simpler: /var somethingB/

Markus

Posted 2014-12-29T20:57:29.600

Reputation: 11

This should be doable in sed directly. As I'm no sed wizard, I need two runs.

In the first run we prepare the file to ensure that the <script>...</script> blocks are surronded with blank lines:
```
sed -e '/<script/i\ ' -e '/script>/a\ ' code.js
```
It's no rocket-science: i inserts a line before the line matching a pattern, a accordingly appends a line after the line matching a pattern. In both cases the line consists of a single blank only.

This is needed that sed detects every block separately, i.e. non-greedily in the second step).
The second run kills the blocks with var somethingB in it:
```
sed '/<script/,/script>/{H;d;};x;/var somethingB/d'
```
- /<script/,/script>/{H;d;} moves a block into sed's holding space (H appends to holding space, d deletes from pattern space)
- x exchanges the holding space with the pattern space
- if pattern /var somethingB/ matches delete (d) the pattern space, which holds the complete <script> block.
- finally sed implicitly prints the pattern space.
  
  My reference here was the Unix Sed Tutorial.
So, in one command line with a nice pipe:
```
sed -e '/<script/i\ ' -e '/script>/a\ ' code.js | sed '/<script/,/script>/{H;d;};x;/var somethingB/d'
```
If you want to, use a third sed instance to get rid of the additional empty lines:
```
sed '/^ $/d'
```

mpy

Posted 2014-12-29T20:57:29.600

Reputation: 20 866

@Matt: Can you please give a feedback if you are satisfied with my sed approach?! – mpy – 2015-01-22T17:41:43.100

This solution is awesome! For some reason I missed it in my Inbox and never saw it until now. The only downside is it removes the last line in the file. But otherwise it works perfect. – Matt – 2016-03-20T22:15:36.920