Python: Parsing XML with lxml

Last time, we looked at one of Python’s built-in XML parsers. In this article, we will look at the fun third-party package, lxml from codespeak. It uses the ElementTree API, among other things. The lxml package has XPath and XSLT support, includes an API for SAX and a C-level API for compatibility with C/Pyrex modules. We’ll just do a few simple things with it though.

Anyway, for this article, we will use the examples from the minidom parsing example and see how to parse those with lxml. Here’s an XML example from a program that was written for keeping track of appointments:

The XML above shows two appointments. The beginning time is in seconds since the epoch; the uid is generated based on a hash of the beginning time and a key (I think); the alarm time is the number of seconds since the epoch, but should be less than the beginning time; and the state is whether or not the appointment has been snoozed, dismissed or not. The rest are pretty self-explanatory. Now let’s see how to parse it.

First off, we import the needed modules, namely the etree module from the lxml package and the StringIO function from the builtin StringIO module. Our parseXML function accepts one argument: the path to the XML file in question. We open the file, read it and close it. Now comes the fun part! We use etree’s parse function to parse the XML code that is returned from the StringIO module. For reasons I don’t completely understand, the parse function requires a file-like object.

Anyway, next we iterate over the context (i.e. the lxml.etree.iterparse object) and extract the tag elements. We add the conditional if statement to replace the empty fields with the word “None” to make the output a little clearer. And that’s it.

Parsing the Book Example

Well, the result of that example was kind of lame. Most of the time, you want to save the data you extract and do something with it, not just print it out to stdout. So for our next example, we’ll create a data structure to contain the results. Our data structure for this example will be a list of dicts. We’ll use the MSDN book example here:

<?xmlversion="1.0"?><catalog><bookid="bk101"><author>Gambardella, Matthew</author><title>XML Developer's Guide</title><genre>Computer</genre><price>44.95</price><publish_date>2000-10-01</publish_date><description>An in-depth look at creating applications
with XML.</description></book><bookid="bk102"><author>Ralls, Kim</author><title>Midnight Rain</title><genre>Fantasy</genre><price>5.95</price><publish_date>2000-12-16</publish_date><description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description></book><bookid="bk103"><author>Corets, Eva</author><title>Maeve Ascendant</title><genre>Fantasy</genre><price>5.95</price><publish_date>2000-11-17</publish_date><description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description></book><bookid="bk104"><author>Corets, Eva</author><title>Oberon's Legacy</title><genre>Fantasy</genre><price>5.95</price><publish_date>2001-03-10</publish_date><description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description></book><bookid="bk105"><author>Corets, Eva</author><title>The Sundered Grail</title><genre>Fantasy</genre><price>5.95</price><publish_date>2001-09-10</publish_date><description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description></book><bookid="bk106"><author>Randall, Cynthia</author><title>Lover Birds</title><genre>Romance</genre><price>4.95</price><publish_date>2000-09-02</publish_date><description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description></book><bookid="bk107"><author>Thurman, Paula</author><title>Splish Splash</title><genre>Romance</genre><price>4.95</price><publish_date>2000-11-02</publish_date><description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description></book><bookid="bk108"><author>Knorr, Stefan</author><title>Creepy Crawlies</title><genre>Horror</genre><price>4.95</price><publish_date>2000-12-06</publish_date><description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description></book><bookid="bk109"><author>Kress, Peter</author><title>Paradox Lost</title><genre>Science Fiction</genre><price>6.95</price><publish_date>2000-11-02</publish_date><description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description></book><bookid="bk110"><author>O'Brien, Tim</author><title>Microsoft .NET: The Programming Bible</title><genre>Computer</genre><price>36.95</price><publish_date>2000-12-09</publish_date><description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description></book><bookid="bk111"><author>O'Brien, Tim</author><title>MSXML3: A Comprehensive Guide</title><genre>Computer</genre><price>36.95</price><publish_date>2000-12-01</publish_date><description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description></book><bookid="bk112"><author>Galos, Mike</author><title>Visual Studio 7: A Comprehensive Guide</title><genre>Computer</genre><price>49.95</price><publish_date>2001-04-16</publish_date><description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.</description></book></catalog>

This example is pretty similar to our last one, so we’ll just focus on the differences present here. Right before we start iterating over the context, we create an empty dictionary object and an empty list. Then inside the loop, we create our dictionary like this:

book_dict[elem.tag] = text

The text is either elem.text or “None”. Finally, if the tag happens to be “book”, then we’re at the end of a book section and need to add the dict to our list as well as reset the dict for the next book. As you can see, that is exactly what we have done. A more realistic example would be to put the extracted data into a Book class. I have done the latter with json feeds before.

Refactoring the Code

As pointed out by my vigilant readers, I wrote some pretty crappy code. So I have cleaned the code up a bit and hope this is a little better:

As you can see, we dropped the StringIO module entirely and put all the file I/O stuff right in the lxml method calls. The rest is the same. Cool huh? As usual, Python rocks!

Wrapping Up

Did you learn anything in this article? I certainly hope so. Python has lots of cool parsing libraries both in its standard library and outside of it. Be sure to check them out and see which one fits your way of programming the best.

Further Reading

Post navigation

The next time you read the contents of a file into a variable only to turn around and put those contents back into a file like object, I’m going to strangle you! 🙂

Either go with… etree.parse(open(‘file.xml’)) … or if you’re really insistent on reading a file out to a variable, then just use etree.fromstring(myvar)

olt

You don’t need to call etree.parse if you are using iterparse.

Anonymous

Thanks to both you and brutimus, I was inspired to try to fix the code. I’ve added another section with some refactored code that hopefully won’t “offend” anyone else. Thanks a lot for the constructive feedback!

Anonymous

Ah…Thanks for the info. I updated my last example to reflect this. Thanks again!

– Mike

Anonymous

I do show an example of the parse command…but not quite in the way you’re talking about. Thanks for the suggestion though. I’m still a little green when it comes to XML parsing, I guess.