Efficiently extracting a few data from a large XML file

2

I need to extract a few field contents from a large XML file. I currently do this though a combination of xmlstarlet and a Python script (using ElementTree). The idea was to trim the XML file from useless data with xmlstarlet and then process the smaller file with Python (using Python directly on the file was not doable - memory and CPU were hogged and some files never got processed). It basically works but:

  • it is not efficient
  • it is not particularly flexible
  • it is quite ugly (the least of my concerns, but a concern nevertheless from a maintenance perspective)

I am looking for advice on how best to handle such a case (the amount of extracted data is about 5% of the initial file). I am open to anything reasonable (a specific language, maybe dumping the XML file into a DB and then extract what I need before dumping the DB?, ...)

WoJ

Posted 2013-01-30T09:58:41.430

Reputation: 1 580

Answers

2

Are you using ElementTree's iterparse? It should be able to efficiently handle large inputs without parsing the whole tree in-memory (which is usually where the wheels come off an XML parser).

You can find plenty of use cases and examples on stackoverflow.

mr.spuratic

Posted 2013-01-30T09:58:41.430

Reputation: 2 163

No, I am not. Thanks for the hint -- I will read & implement and be back with feedback (and mark the question as answered) – WoJ – 2013-01-30T11:46:41.490

The solution with iterparse works great. It improved the parsing time by at least an order of magnitude. I stumble, however, on a problem but I will open a separate question – WoJ – 2013-02-02T23:06:13.847