Efficiently extracting a few data from a large XML file

I need to extract a few field contents from a large XML file. I currently do this though a combination of xmlstarlet and a Python script (using ElementTree). The idea was to trim the XML file from useless data with xmlstarlet and then process the smaller file with Python (using Python directly on the file was not doable - memory and CPU were hogged and some files never got processed). It basically works but:

it is not efficient
it is not particularly flexible
it is quite ugly (the least of my concerns, but a concern nevertheless from a maintenance perspective)

I am looking for advice on how best to handle such a case (the amount of extracted data is about 5% of the initial file). I am open to anything reasonable (a specific language, maybe dumping the XML file into a DB and then extract what I need before dumping the DB?, ...)

WoJ

Posted 2013-01-30T09:58:41.430

Reputation: 1 580

http://stackoverflow.com/questions/30305724/how-to-do-command-line-xpath-queries-in-huge-xml-files || http://stackoverflow.com/questions/7528249/tools-to-validate-large-xml-100mb-file – Ciro Santilli 新疆改造中心法轮功六四事件 – 2015-10-07T12:57:21.067

No, I am not. Thanks for the hint -- I will read & implement and be back with feedback (and mark the question as answered) – WoJ – 2013-01-30T11:46:41.490

The solution with iterparse works great. It improved the parsing time by at least an order of magnitude. I stumble, however, on a problem but I will open a separate question – WoJ – 2013-02-02T23:06:13.847

Efficiently extracting a few data from a large XML file

Answers