Read XML tags and then remove XML tags using a shell script

1

Given the following input:

<start>
   <header>
      This is header section
   </header>
   <body>
      <body_start>
         This is body section
         <a>
            <b>
               <c>
                  <st>111</st>
               </c>
               <d>
                  <st>blank</st>
               </d>
            </b>
         </a>
      </body_start>
      <body_section>
         This is body section
         <a>
            <b>
               <c>
                  <st>5</st>
               </c>
               <d>
                  <st>666</st>
               </d>
            </b>
            <b>
               <c>
                  <st>154</st>
               </c>
               <d>
                  <st>1457954</st>
               </d>
            </b>
            <b>
               <c>
                  <st>845034</st>
               </c>
               <d>
                  <st>blank</st>
               </d>
            </b>
         </a>
      </body_section>
   </body>
</start>

I'd like to perform the following parsing.

If st value of c tag is 154, then the whole <b> to </b> tag needs to removed. Note that value 154 may or not be present in the file.

So, if the value 154 is present, then the removal of the following part is needed:

<b>
   <c>
      <st>154</st>
   </c>
   <d>
      <st>1457954</st>
   </d>
</b>

I want to do the coding in a shell script. I can not use xslt because my system does not support it.

rjg

Posted 2017-01-18T06:57:11.803

Reputation: 11

Question was closed 2019-02-28T11:51:05.697

I think sed isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool. – uzsolt – 2017-01-18T08:33:54.820

4Why inventing a bicycle if nearly all unix based system has in their repos xmlstarlet ? – Alex – 2017-01-18T08:35:45.727

Answers

0

You can use pup, a command line tool for processing HTML. For XML you can use xpup.

For example, to find parts for removal, run:

$ pup ':parent-of(:parent-of(:contains("154")))' <file.html
<b>
 <c>
  <st>
   154
  </st>
 </c>
 <d>
  <st>
   1457954
  </st>
 </d>
</b>

To remove this section from the input using sed (where file.html is your HTML file), run:

 sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")

Notes:

  • We use xargs <file.html | tr -d " " to flatten the file into a single line without spaces.
  • We use mentioned pup command to find the pattern for removal.
  • We use sed to remove the pattern by: sed "s@PATTERN@@g" <(input).
  • To replace in-place (by modifying the file), add -i for GNU's sed, or -i'.bak' for BSD's sed.

For easier understanding, the following script can be used:

function flat_it() { xargs | tr -d " "; }
input=$(flat_it <file.html)
remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)
sed "s@$remove@@g" <<<$input

Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.

So instead of xargs | tr -d " ", sed, ex or paste can be used.

Here is the example using ex:

ex +%j +"s/[><]\zs //g" +%p -scq! file.html

And here is the version with shell function (which can replace the previous version):

function flat_it() { ex +%j +"s/[><]\zs //g" +%p -scq! /dev/stdin; }

kenorb

Posted 2017-01-18T06:57:11.803

Reputation: 16 795