trimming an XML file the dumb way

(see solution below)

I have XML files which I parse with a Python script (I did not write it but it does the job perfectly). The problem is that the XML file is large (~ 1GB) and the parsing take ages due to memory congestion. The XML file is full of useless information in certain elements - what would be the best way to get rid of them? I tried xmlstarlet but it is too "XML-oriented", ie. it takes ages for the same reasons that the Python script.

What I just need to do is to get rid of given elements in a dumb way: remove everything between <mytag> and </mytag> all through the file (there are multiple <mytag>...</mytag> pairs, all to be removed).

I would really appreciate your ideas since I am sure there are good ways to do that without reinventing the wheel.

Thank you!

EDIT: I finally ended up with

perl -pe "undef $/;s/<mytag>.*?<\/mytag>//msg" < inputfile.xml > outputfile.xml

which I did not realize @Vlad posted as well.

WoJ

Posted 2012-02-16T10:12:23.850

Reputation: 1 580

Answers

When working with very large XML files, the recommended approach is to use a SAX event-driven parser. lxml can do that in Python, here's an excellent article on the topic: High-performance XML parsing in Python with lxml.

Another option would be to use something like sed to remove those tags from the file.

Or a Perl script:

perl -i.bak -pe 'BEGIN{undef $/;} s/<mytag>.*<\/mytag>//smg' file.xml

Vlad

Posted 2012-02-16T10:12:23.850

Reputation: 768

I tried to use sed (sed 's-<mytag>.*</mytag>--g' a.xml) but no luck so far – WoJ – 2012-02-16T12:40:19.717

@WoJ sed is line-based. There are tricks to use multi-line patterns but they mostly involve concatenating all lines into one and applying substitution over it. In your case this would probably not be very efficient. See http://austinmatzko.com/2008/04/26/sed-multi-line-search-and-replace/ and http://stackoverflow.com/questions/5710424/sed-multiline-range-match for a Perl alternative.

– Vlad – 2012-02-16T12:49:36.553

Sorry - I did not realize you posted the perl solution! This is what I also ended up with :) – WoJ – 2012-02-16T13:54:01.747

@WoJ - I'm curios how it went performance-wise. Care to share the details: input file size, output file size, processing time, OS, CPU and memory? – Vlad – 2012-02-16T14:26:07.230

Search and Replace with a text editor that can do it with wildcards? Preferably one that doesn't try to load the whole file on opening (or it will take ages). Most Hex-editors also have text search-replace capabilities.

Rainer

Posted 2012-02-16T10:12:23.850

Reputation: 81