using sed/awk to parse XML tags

For the record, I've spent several days working on this with no luck.

I'm working with XML files with data containing something like the following:

<row id="67581917031" name="4022" filesize="22425" file_content_id="67581868031" lastmodify_datetime="1187126570050" group_id="67581916031"/> <row id="254115371041" name="4022" filesize="49471" file_content_id="254115361041" lastmodify_datetime="1220512827666" group_id="253405951041"/> <row id="286104505041" name="4022" filesize="3802672" file_content_id="286104455041" lastmodify_datetime="1223348052489" group_id="286104504041"/> <row id="289541609041" name="4022" filesize="42235" file_content_id="264826268041" lastmodify_datetime="1223587308419" group_id="289541607041"/> <row id="306643757002" name="4022" filesize="392560" file_content_id="243411753011" lastmodify_datetime="1218251898489" group_id="67581916031"/> <row id="367316910041" name="4022" filesize="381083" file_content_id="367316830041" lastmodify_datetime="1232592570004" group_id="74169006021"/>

If you look carefully, you will find that two of these records have the same "name" and "group_id". I'm trying to write a script that will find these rows and spit out the row ID, name, and group_id in question. I hoping to do was either use sed to pick up on the end of each "row" and insert a newline (\n) so then I could use nl to print out the number of lines, store that number in a variable, then use a for loop to run an awk command to pattern match each row id, name, and group_id and somehow check if the name and group_id are a match to any other rows, and if they're a match, print out the row id and name.

Chris Olin

Posted 2013-06-11T18:20:10.240

Reputation: 167

3It might be easier to use an XML parser for this. – suspectus – 2013-06-11T18:52:03.843

Answers

If you are looking for those rows that have the same name AND group_id, you could do something like this (assuming you are on a *nix OS, you don't say in your question, you can just paste this directly to the command line):

sed 's#/>#/>\n#g' simple_file.xml |
        perl -ne 'if(/row id=.(.+?)\".+name=.(.+?)\".+group_id=.(.+?)\"/){ 
         push @{$k{join("\t",$2,$3)}},$1;} 
         END{ 
           foreach (keys(%k)){ 
            if($#{$k{$_}}>0){
                 print "$_\t",pop @{$k{$_}},"\n" 
          } }}'

EXPLANATION:

sed 's#/>#/>\n#g' simple_file.xml : Add a newline after each entry (after each />) to facilitate parsing.
perl -ne : process the file, line by line
/row id= ... group_id=.(\d+)/; : use a regex (which is generally a bad idea for [X]HTML files, you may have the blood of fluffy kittens on your hands) to get the row_id,name and group_id, these are saved as $1,$2 and $3 respectively.
push @{$k{join("\t",$2,$3)}},$1; : This is a bit more complex. It creates a hash of arrays called (%k), then it uses join to connect the name and group_id with a tab. Finally, it adds the row_id to the array. In other words, if your row_id is 123, your name is 456 and your group_id is 789, that will create an array and save it as the value of hash %k for key 456 789.
The END{} block is executed once, when the rest of the file has been processed. It will go through each of the keys of the hash (whose values are arrays) and print out those cases where the array has more than one entry, in other words, the duplicates. The pop function returns the last element of an array, in this case the row_id.

I ran this on your example and got this output:

4022    67581916031 306643757002
----    ----------  ------------
 |           |           |---------------> row id
 |           |---------------------------> group id
 |---------------------------------------> name

If you did not see the link in the second bullet point, I would just like to stress that You. Should. Never. Parse. [X]HTML. With. Regular. Expressions.

terdon

Posted 2013-06-11T18:20:10.240

Reputation: 45 216

Could you try running this script against this and see if you experience the same results?

I did not. When running the script that you provided, it didn't return anything (and yes, I am running a *NIX environment).

– Chris Olin – 2013-06-11T19:52:06.780

EDIT: I think I see why this isn't working for me. The code I provided appears to have automatically been formatted to put each row on a separate line. The file I'm running this against has everything on one line. – Chris Olin – 2013-06-11T19:58:13.333

close, but it's getting caught up in spaces rows with spaces in the 'name' value. I tried putting $2 in quotes, but it didn't seem to fix it. I'm more familiar with BASH than perl, so I'm not sure if quotes work the same way. – Chris Olin – 2013-06-11T20:12:21.883

Nevermind. It's the regex. I'm trying to rework the name regex to properly parse the various scenarios. – Chris Olin – 2013-06-11T20:19:23.830

@BinaryMan this kind of thing is why you should not parse HTML. Anyway, I updated it to look for all characters up to the first " for each of the fields. If your field values are always quoted, and also added an if so it would only save on matching lines. Try the latest, it should work. – terdon – 2013-06-11T20:19:35.623

That's what I came up with too. Lesson learned. Thanks a bunch. It's much appreciated. – Chris Olin – 2013-06-11T20:33:13.410

1The reason you spent several days on it without success is that you are using the wrong technology. If you want to parse XML, use an XML parser. – Michael Kay – 2013-06-12T08:01:44.520

It's axiomatic that it's not possible to safely parse XML with regular expressions. You need an XML parser.

You can parse known subsets of XML, but in practice that often turns out to be much harder than just learning to use an XML parser.

Medievalist

Posted 2013-06-11T18:20:10.240

Reputation: 109