DOM Parsing With Python

This guide assumes you have basic knowledge of python and have done at least some work with HTML, XHTML, and/or XML.

Background

DOM stands for Document Object Model. It is a convention used in HTML, XHTML, and XML for representing and interacting with objects. As fairly well described by the name, things like HTML have many elements with relationships to other elements. For example, you may have a <span> element in your <body> element. The <span> element’s parent is the <body>. The <span> may have child elements and/or sibling elements. It works similar to a family relationship. The elements in an HTML document may have identifiers, specified by attributes like id=’something’, class=’something’, and/or name=’something’. You can use these identifiers to keep track of and find a specific element or list of elements. Once you have found the element(s) you are looking for, you can change things in a dynamic manner or get desired information.

Lets Try Some Beautiful Soup

As I found the need to parse HTML documents a little while ago, I went in search of a module to accommodate my needs. I could have made my own class to handle it (as DOM parsing really isn’t that hard), but I don’t have nearly the time I would need to take on such a project. Instead I found a module called ‘BeautifulSoup’. As I looked into this module, it seemed to be well-written and have full functionality. Through experience I found that this module is quite easy to use.

In this document we have two child elements under the <body> element. Now, lets say we want to access the element with the id ‘someid’. First we need to parse the document like so (we will assume the variable ‘doc’ contains the HTML):

import BeautifulSoup
dom = BeautifulSoup.BeautifulSoup(doc)

Now we need to get the element. The python object ‘dom’ contains the parsed document. There are some provided methods for searching the document tree. Here are some examples:

# Find the first element with the id 'someid' (all have the same result)
elm1 = dom.find(None,{"id":"someid"})
elm1 = dom.find(None,id="someid")
elm1 = dom.find("span",{"id":"someid"})# Only searches 'span' tags
elm1 = dom.find("span",id="someid")# Same as above# Find all elements with the id 'someid'
elms1 = dom.findAll(None,id="someid")# Find the first element with the class 'someclass'
elm2 = dom.find(None,{"class":"someclass"})# Find all elements with the class 'someclass'
elms2 = dom.findAll(None,{"class":"someclass"})# You cannot specify 'class' as a keyword argument, since it is reserved in python.# That is why the find methods allow a dictionary that specifies what to look for.# Also, you may specify any of a 'class', 'id', and/or 'name' to look for.
elm1.nextSibling# A reference to the next sibling element
elm2.previousSibling# A reference to the previous sibling element# The above two lines are references to each other.# Now, as it is a document _tree_ (each element references others), you can daisy-chain# These will just lead back to the same element that elm1 referenced to begin with:
elm1.nextSibling.parent.find(None,id="someid")
elm1.parent.first()# Now, of course you can do more than just walk the tree.# Print all text contained in the element and all child elements:print elm1.text# Print all raw HTML contained in the element:print elm1.renderContents()

Now, of course you can do more, like manually looking through the attributes of an element or something, but this gives a basic idea of how to use the module for your needs. To parse XML, instead of HTML or XHTML, you will want to parse the document with `BeautifulSoup.BeautifulStoneSoup(“…”)`. I hope you found this helpful. Good luck in your own DOM parsing.