If you’re a long time reader, you may remember that I started programming Python in 2006. Within a year or so, my employer decided to move away from Microsoft Exchange to the open source Zimbra client. Zimbra is an alright client, but it was missing a good way to alert the user to the fact that they had an appointment coming up, so I had to create a way to query Zimbra for that information and show a dialog. What does all this mumbo jumbo have to do with XML though? Well, I thought that using XML would be a great way to keep track of which appointments had been added, deleted, snoozed or whatever. It turned out that I was wrong, but that’s not the point of this story.

In this article, we’re going to look at my first foray into parsing XML with Python. If you do a little research on this topic, you’ll soon discover that Python has an XML parser built into the language in its xml module. I ended up using the minidom sub-component of that module…at least at first. Eventually I switched to lxml, which uses ElementTree, but that’s outside the scope of this article. Let’s take a quick look at some ugly XML that I came up with:

If I recall correctly, this code was based on an example from the Python documentation (or maybe a chapter in Dive Into Python). I still don’t like this code. The url parameter you see in the ApptParser class can be either a url or a file. I had an XML feed from Zimbra that I would check periodically for changes and compare it to the last copy of that XML that I had downloaded. If there was something new, I would add the changes to the downloaded copy. Anyway, let’s unpack this code a little.

In the getXml, we use an exception handler to try and open the url. If it happens to raise an error, than we assume that the url is actually a file path. Next we use minidom’s parse method to parse the XML. Then we pull out a node from the XML. We’ll ignore the conditional as it isn’t important to this discussion (it has to do with my program). Finally, we return the node object.

Technically, the node is XML and we pass it on to the handleXml. To grab all the appointment instances in the XML, we do this: xml.getElementsByTagName(“appointment”). Then we pass that information to the handleAppts method. Yes, there is a lot of passing around various values here and there. It drove me crazy trying to follow this and debug it later on. Anyway, all the handleAppts method does is loop over each appointment and call the handleAppt method to pull some additional information out of it, add the data to a list and add that list to another list. The idea was to end up with a list of lists that held all the pertinent data regarding my appointments.

You will notice that the handleAppt method calls the getElement method which calls the getText method. I don’t know why the original author did it that way. I would have just called the getText method and skipped the getElement one. I guess that can be an exercise for you, dear reader.

Now you know the basics of parsing with minidom. Personally I never liked this method, so I decided to try to come up with a cleaner way of parsing XML with minidom.

Making minidom Easier to Follow

I’m not going to claim that my code is any good, but I will say that I think I came up with something much easier to follow. I’m sure some will argue that the code is not as flexible, but oh well. Here’s a new XML example that we will parse (found on MSDN):

<?xmlversion="1.0"?><catalog><bookid="bk101"><author>Gambardella, Matthew</author><title>XML Developer's Guide</title><genre>Computer</genre><price>44.95</price><publish_date>2000-10-01</publish_date><description>An in-depth look at creating applications
with XML.</description></book><bookid="bk102"><author>Ralls, Kim</author><title>Midnight Rain</title><genre>Fantasy</genre><price>5.95</price><publish_date>2000-12-16</publish_date><description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description></book><bookid="bk103"><author>Corets, Eva</author><title>Maeve Ascendant</title><genre>Fantasy</genre><price>5.95</price><publish_date>2000-11-17</publish_date><description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description></book><bookid="bk104"><author>Corets, Eva</author><title>Oberon's Legacy</title><genre>Fantasy</genre><price>5.95</price><publish_date>2001-03-10</publish_date><description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description></book><bookid="bk105"><author>Corets, Eva</author><title>The Sundered Grail</title><genre>Fantasy</genre><price>5.95</price><publish_date>2001-09-10</publish_date><description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description></book><bookid="bk106"><author>Randall, Cynthia</author><title>Lover Birds</title><genre>Romance</genre><price>4.95</price><publish_date>2000-09-02</publish_date><description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description></book><bookid="bk107"><author>Thurman, Paula</author><title>Splish Splash</title><genre>Romance</genre><price>4.95</price><publish_date>2000-11-02</publish_date><description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description></book><bookid="bk108"><author>Knorr, Stefan</author><title>Creepy Crawlies</title><genre>Horror</genre><price>4.95</price><publish_date>2000-12-06</publish_date><description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description></book><bookid="bk109"><author>Kress, Peter</author><title>Paradox Lost</title><genre>Science Fiction</genre><price>6.95</price><publish_date>2000-11-02</publish_date><description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description></book><bookid="bk110"><author>O'Brien, Tim</author><title>Microsoft .NET: The Programming Bible</title><genre>Computer</genre><price>36.95</price><publish_date>2000-12-09</publish_date><description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description></book><bookid="bk111"><author>O'Brien, Tim</author><title>MSXML3: A Comprehensive Guide</title><genre>Computer</genre><price>36.95</price><publish_date>2000-12-01</publish_date><description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description></book><bookid="bk112"><author>Galos, Mike</author><title>Visual Studio 7: A Comprehensive Guide</title><genre>Computer</genre><price>49.95</price><publish_date>2001-04-16</publish_date><description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.</description></book></catalog>

For this example, we’ll just parse the XML, extract the book titles and print them to stdout. Are you ready? Here we go!

This code is just one short function that accepts one argument, the XML file. We import the minidom module and give it the same name to make it easier to reference. Then we parse the XML. The first two lines in the function are pretty much the same as the previous example. We use getElementsByTagName to grab the parts of the XML that we want, then iterate over the result and extract the book titles from them. This actually extracts title objects, so we need to iterate over that as well and pull out the plain text, which is what the second nested for loop is for.

That’s it. There is no more.

Wrapping Up

Well, I hope this rambling article taught you a thing or two about parsing XML with Python’s builtin XML parser. We will be looking at XML parsing some more in future articles. If you have a method or module that you like, feel free to point me to it and I’ll take a look.