Read XML tags and then remove XML tags using a shell script

You can use pup, a command line tool for processing HTML. For XML you can use xpup.

For example, to find parts for removal, run:

$ pup ':parent-of(:parent-of(:contains("154")))' <file.html
<b>
 <c>
  <st>
   154
  </st>
 </c>
 <d>
  <st>
   1457954
  </st>
 </d>
</b>

To remove this section from the input using sed (where file.html is your HTML file), run:

 sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")

Notes:

We use xargs <file.html | tr -d " " to flatten the file into a single line without spaces.
We use mentioned pup command to find the pattern for removal.
We use sed to remove the pattern by: sed "s@PATTERN@@g" <(input).
To replace in-place (by modifying the file), add -i for GNU's sed, or -i'.bak' for BSD's sed.

For easier understanding, the following script can be used:

function flat_it() { xargs | tr -d " "; }
input=$(flat_it <file.html)
remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)
sed "s@$remove@@g" <<<$input

Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.

So instead of xargs | tr -d " ", sed, ex or paste can be used.

Here is the example using ex:

ex +%j +"s/[><]\zs //g" +%p -scq! file.html

And here is the version with shell function (which can replace the previous version):

function flat_it() { ex +%j +"s/[><]\zs //g" +%p -scq! /dev/stdin; }

kenorb

Posted 2017-01-18T06:57:11.803

Reputation: 16 795

Read XML tags and then remove XML tags using a shell script

Answers