Match and remove first and second pattern within xml tags

0

How can I match and remove first and second pattern within xml tags using sed or awk?

Here is the example

<data>A78-1-1134-HI-1</data>
<data>T78-12-1346-AG-2</data>
<data>G78-4-2156-Ag-6</data>
<data>A78-10-1971-Hh-10</data>

This is the result I am trying to get:

<data>1134</data>
<data>1346</data>
<data>2156</data>
<data>1971</data

Can it be done in one line? This is what I tried:

sed 's/^.*<data>[[:alnum:]]-[0-9]-/<data>/g;s/-[a-Z].*<\/data>$//g'

Or removing just a first pattern, when I use sed to print then it works:

sed -n 's/^.*<data>.*[[:alnum:]]-[0-9]-/<data>/p' file.xml | grep data

But then this command will not work:

sed 's/^.*<data>.*[[:alnum:]]-[0-9]-/<data>/' file.xml

milan_K

Posted 2013-04-20T15:58:27.740

Reputation: 3

Answers

0

Here are a few solutions:

  1. If your file is really as simple as your example, you can do it with this gawk scriptlet. This assumes that your file consists of nothing but data entries as described in your question.

    gawk -F"-" '{print "<data>"$3"</data>"}' file.xml
    
    • -F"-" tells gawk to take - as the field separator, the script then prints the 3rd field.


  2. For slightly more complex files that include lines you do not want, this will print only if the first ($1~/data/) and last ($NF~/data/) fields contain data:

    gawk -F"-" '($1~/data/ && $NF~/data/){print "<data>"$3"</data>"}' file.xml
    
  3. If your file can have many <data> entries and you only care about those that look like A1-2B-C3-4D:

    perl -ne '/(<data>).+?-.+?-(.+?)\-.+(<\/data>)/ && do{print "$1$2$3\n"}' file.xml
    

    -ne means apply this script to each line of the input file. In Perl (and many other tools), parentheses allow capturing regular expression matches. Here, I am capturing three patterns, the opening and closing tags ($1 and $3) so I don't need to type them twice and the pattern we are looking for, $2.

    If you need to be more specific use this to only allow alphaneumeric characters in the first field and only digits in the others:

    perl -ne '/(<data>)[\w\d]+?-\d+?-(\d+?)\-.+(<\/data>)/ && do{print "$1$2$3\n"}' file.xml
    
  4. This all assumes that your <data> and </data> tags are on the same line. If they are not, you can do something like this:

    perl -ne '
     $d++ if /<data>/; 
      /[\w\d]+?-\d+?-(\d+?)\-.+/ && do{
                 print "<data>$1</data>\n" if $d>0
            }; 
     $d-- if /<\/data>/; 
    ' file.xml
    

    $d will be positive if we are within <data></data> tags. If we are and find a line that matches the regular expression, print.


UPDATE:

If you want to edit the file, not just print its contents but actually change the original file, use this:

perl -i -ne 's/(<data>).+?-.+?-(.+?)\-.+(<\/data>)/$1$2$3/; print' file.xml

terdon

Posted 2013-04-20T15:58:27.740

Reputation: 45 216

I am getting a correct printed data by using perl command from your #3 solutions. How do I change the command so it will make the change to xml. <data> tags are not the only tags in xml, and there are space in the front of <data> tag. – milan_K – 2013-04-20T19:03:56.120

@user2302372, see updated answer. – terdon – 2013-04-20T19:36:40.197

Perfect! That was exactly what I need. – milan_K – 2013-04-20T19:45:40.547

1

You're using the wrong tools for the job. Don't parse XML with regular expressions, you will get it wrong. (That's (a) because it's theoretically impossible - XML is not a regular language, and (b) because your practical attempts might work on some XML documents but they will inevitably fail on others.)

With XSLT 2.0 this is a trivial transformation.

<xsl:template match="data">
  <xsl:copy>
    <xsl:value-of select="tokenize(., '-')[3]"/>
  </xsl:copy>
</xsl:template>

Michael Kay

Posted 2013-04-20T15:58:27.740

Reputation: 379

0

It appears that your repetitions are not specified correctly. Also, I find it easier to use subexpressions to extract substrings. I don't know your exact specifications for matching data, but this works for your sample data in the question (I think this is POSIX compliant):

sed 's/<data>[[:alnum:]]\{1,\}-[0-9]\{1,\}-\([0-9]\{1,\}\)-[[:alnum:]]\{1,\}-[0-9]\{1,\}/<data>\1/' file.xml

If you have GNU sed at your disposal, you can take advantage of it's extensions to Extended Regular Expressions for a simpler expression:

sed -r 's/^.*<data>[[:alnum:]]+-[0-9]+-([0-9]+)-[[:alnum:]]+-[0-9]+/<data>\1/' file.xml

depquid

Posted 2013-04-20T15:58:27.740

Reputation: 335

I tried both commands but no success, nothing changes. This is rather a larger xml file and I just sampled out tags that need a change. Tags are in same line and there are spaces in front of <data> tag. – milan_K – 2013-04-20T18:47:27.740