Login

Working with XML Documents and Python

XML can be used for describing data without needing a database. However, this leaves us with the problem of interpreting the data embedded within the XML. This is where Python comes to the rescue, as Peyton explains.

Introduction

XML, or eXtensible Markup Language, is a useful tool for describing all sorts of data. Take a look at this example:

We describe four articles in the DevShed backend. We start with a tag named devshed. Inside it, the tag categories contains two names, python and php. Inside each of these is an article tag, containing a title tag and an author tag. With this system, we are able to describe the data contained very easily and with no database. The data can be modified later with little effort and by someone with little experience.

However, we are faced with the problem of interpreting the data embedded within the XML. We need something to parse it with. Python includes a few tools that can aid us in parsing the data, and we’ll take a look at them in this article, learning about them through example.

{mospagebreak title=Organizing a Book Collection}

Let’s say we want to organize a book collection using XML to describe it all. We don’t need anything fancy. We only need to store the title, author, and genre of the book. Let’s go ahead and create the markup for a few books:

Now we’re left with parsing the data and turning it into something presentable. If you examine the way the data is stored, you will notice that it is similar to a dictionary in Python. Therefore, a dictionary would be an ideal type to store the data in. We’ll create a chunk of code that does just this. SAX, Simple API for XML, will be used for this project. It is contained in xml.sax:

import xml.sax

# Create a collection list
collection = []

# This handles the parsing of the content
class HandleCollection ( xml.sax.ContentHandler ):

As you can see, there’s really not much work involved. All we have to do is write the instructions that organize each book into a dictionary and put all the dictionaries into the collection list. We start by subclassing xml.sax.ContentHandler. The class we create is charged with handling the content of the document we parse. In our class’s __init__ method, we define a few variables. The book dictionary will, of course, house the book’s information. The title variable will be used by the characters method to determine whether we are dealing with the title tag’s content. The same goes for the author variable and the genre variable. These are set to True in startElement if we’re dealing with that particular element. They are then set to False when we have finished using them in endElement. Finally, we instruct Python to parse the file in the last three lines.

We are now free to present this information to the user in whichever way we see fit. For example, if we wanted to just output the book information without dressing it up too much, we could simply append some code to the above script that sorts through the list of dictionaries that the script creates:

While the above XML structure is fine for many things, compacting things into attributes is often very helpful and a much better idea. Let’s consider a music library (consisting of individual songs rather than whole albums). Instead of creating indivual tags for the album the song comes from, the artist of the song, the name of the song and length of the song, which would get tiresome to type, we could simply create attributes for each of these properties. Here’s an XML file that does this:

If we had given every property of the song its own tag, then the file would have been much longer than it is now. However, we are able to reduce the length of the file by putting properties in attributes.

Now, however, we are left with the task of parsing the data. The process is similar to what we did above, but there are some differences since we are dealing with attributes this time around. Here’s how it all works:

It’s not a very complex script, and it’s not very lengthy. It strongly resembles the previous script, but note that we choose not to store anything in a dictionary. Rather, we just dump it all out to the user as we receive it. The attributes variable in the startElement method is an object representing all the attributes of that tag. We then access the attributes by name with the getValue method, saving the values in variables that we print out in the characters method. That’s all there is to parsing attributes.

What if, however, we do not know the names of the attributes? It isn’t too much of a problem, since we can loop through all the attributes and get their values:

In the above script, we use the getNames method to retrieve a list of attribute names. We loop through the list and print each attribute’s name (with the first letter capitalized) and value.

{mospagebreak title=The Document Object Model}

SAX is not the only way to process XML data in Python. The Document Object Model exists, and it gives us an object-oriented interface to XML data. Python contains a library called minidom that provides a simple interface to DOM without a whole lot of bells and whistles. Let’s recreate our music library parser, making use of DOM this time:

We end up with a very short script in the above example. We start by pointing Python to the file we wish to parse. Then we get all tags by the name of “track” that belong to the main tag. We loop through the list provided, printing out the information contained within. To access the text inside the element, we access the track’s list of childNodes. The text is stored in a node, and we print out the value of it. The attributes are stored in attributes, and we reference them by name, printing out nodeValue.

Again, though, what if we don’t know everything we are going to parse? Let’s recreate the second example of the previous section using DOM:

We loop through the names of the attributes returned in the keys method, printing the name of the attribute out with the first letter capitalized. We then print the value of the attribute out by referencing the attribute by its key and then accessing nodeValue.

Let’s say we have tags nested in each other. Consider our very first script that parsed a book collection. Let’s rebuild it with DOM:

We load the collection file and then get a list of books. Then, we loop through the list of books and print out what’s wrapped inside of each tag, which is in the form of a child node that we must get the value of. It’s all very simple.

DOM includes many more features, but all you need to simply read a document is contained within the minidom module.

Conclusion

XML is a useful tool for describing data. It can be used to describe just about anything –- from book collections and music libraries to user settings for an application. Python contains a few utilities that can be used to read and process XML data, namely SAX and DOM. SAX allows you to create handler classes that can process individual items within an XML document, and DOM abstracts the entire document, allowing you to easily navigate through a tree of XML data. Both tools are extremely simple to use and contribute to the phrase “batteries included” that is used to describe the Python language.