Python and HTML Processing

Various Web surfing tasks that I regularly perform could be made much
easier, and less tedious, if I could only use Python to fetch the HTML pages and to
process them, yielding the information I really need. In this document I
attempt to describe HTML processing in Python using readily available tools
and libraries.

NOTE: This document is not quite finished. I aim to
include sections on using mxTidy to deal with broken HTML as well as some
tips on cleaning up text retrieved from HTML resources.

Depending on the methods you wish to follow in this tutorial, you need the
following things:

For the "SGML parser" method, a recent release of Python is probably
enough. You can find one at the Python download page.

For
the "XML parser" method, a recent release of Python is required, along
with a capable XML processing library. I recommend using libxml2dom, since it can handle badly-formed HTML documents as well as well-formed XML or XHTML documents. However, PyXML also provides support for such documents.

For fetching Web pages over secure connections, it is important that
SSL support is enabled either when building Python from source, or in any
packaged distribution of Python that you might acquire. Information about
this is given in the source distribution of Python, but you can download
replacement socket libraries with SSL support for older versions of Python for Windows from Robin Dunn's site.

Accessing sites, downloading content, and processing such content, either
to extract useful information for archiving or to use such content to
navigate further into the site, require combinations of the following
activities. Some activities can be chosen according to preference: whether
the SGML parser or the XML parser (or parsing framework) is used depends on
which style of programming seems nicer to a given developer (although one
parser may seem to work better in some situations). However, technical
restrictions usually dictate whether certain libraries are to be used instead
of others: when handling HTTP redirects, it appears that certain Python
modules are easier to use, or even more suited to handling such
situations.

Supplying Data

Sometimes, it is necessary to pass information to the Web server, such as
information which would come from an HTML form. Of course, you need to know
which fields are available in a form, but assuming that you already know
this, you can supply such data in the urlopen function call:

The above example passed data to the server as an HTTP POST request.
Fortunately, the Vaults of
Parnassus is happy about such requests, but this is not always the case
with Web services. We can instead choose to use a different kind of request,
however:

# We have the encoded data. Now get the file-like object...
f = urllib.urlopen("http://www.vex.net/parnassus/apyllo.py?" + data)
# And the rest...

The only difference is the use of a ? (question mark) character and the
adding of data onto the end of the Vaults of Parnassus URL, but
this constitutes an HTTP GET request, where the query (our additional data)
is included in the URL itself.

Fetching Secure Web Pages

import urllib
# Get a file-like object for a site.
f = urllib.urlopen("https://www.somesecuresite.com")
# NOTE: At the interactive Python prompt, you may be prompted for a username
# NOTE: and password here.
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

Including data which forms the basis of a query, as illustrated above, is
also possible with URLs starting with https.

Handling Redirects

Many Web services use HTTP redirects for various straightforward or even
bizarre purposes. For example, a fairly common technique employed on "high
traffic" Web sites is the HTTP redirection load balancing strategy where the
initial request to the publicised Web site (for example, http://www.somesite.com) is
redirected to another server (for example, http://www1.somesite.com) where a user's
session is handled.

Fortunately, urlopen handles redirects, at least in Python
2.1, and therefore any such redirection should be handled transparently by
urlopen without your program needing to be aware that it is
happening. It is possible to write code to deal with redirection yourself,
and this can be done using the httplib module; however, the
interfaces provided by that module are more complicated than those provided
above, if somewhat more powerful.

Using the SGML Parser

Given a character string from a Web service, such as the value held by
s in the above examples, how can one understand the content
provided by the service in such a way that an "intelligent" response can be
made? One method is by using an SGML parser, since HTML is a relation of
SGML, and HTML is probably the content type most likely to be experienced
when interacting with a Web service.

In the standard Python library, the sgmllib module contains
an appropriate parser class called SGMLParser. Unfortunately, it
is of limited use to us unless we customise its activities somehow.
Fortunately, Python's object-oriented features, combined with the design of
the SGMLParser class, provide a means of customising it fairly
easily.

Defining a Parser Class

First of all, let us define a new class inheriting from
SGMLParser with a convenience method that I find very convenient
indeed:

What the parse method does is provide an easy way of passing
some text (as a string) to the parser object. I find this nicer than having
to remember calling the feed method, and since I always tend to
have the entire document ready for parsing, I do not need to use
feed many times - passing many pieces of text which comprise an
entire document is an interesting feature of SGMLParser (and its
derivatives) which could be used in other situations.

Deciding What to Remember

Of course, implementing our own customised parser is only of interest if
we are looking to find things in a document. Therefore, we should aim to
declare these things before we start parsing. We can do this in the
__init__ method of our class:

Here, we initialise new objects by passing information to the
__init__ method of the superclass (SGMLParser);
this makes sure that the underlying parser is set up properly. We also
initialise an attribute called hyperlinks which will be used to
record the hyperlinks found in the document that any given object will
parse.

Care should be taken when choosing attribute names, since use of names
defined in the superclass could potentially cause problems when our parser
object is used, because a badly chosen name would cause one of our attributes
to override an attribute in the superclass and result in our attributes being
manipulated for internal parsing purposes by the superclass. We might hope
that the SGMLParser class uses attribute names with leading
double underscores (__) since this isolates such attributes from access by
subclasses such as our own MyParser class.

Remembering Document Details

We now need to define a way of extracting data from the document, but
SGMLParser provides a mechanism which notifies us when an
interesting part of the document has been read. SGML and HTML are textual
formats which are structured by the presence of so-called tags, and in HTML,
hyperlinks may be represented in the following way:

<a href="http://www.python.org">The Python Web site</a>

How SGMLParser Operates

An SGMLParser object which is parsing a document recognises
starting and ending tags for things such as hyperlinks, and it issues a
method call on itself based on the name of the tag found and whether the tag
is a starting or ending tag. So, as the above text is recognised by an
SGMLParser object (or an object derived from
SGMLParser, like MyParser), the following method
calls are made internally:

Note that the text between the tags is considered as data, and that the
ending tag does not provide any information. The starting tag, however, does
provide information in the form of a sequence of attribute names and values,
where each name/value pair is placed in a 2-tuple:

Why SGMLParser Works

Why does SGMLParser issue a method call on itself,
effectively telling itself that a tag has been encountered? The basic
SGMLParser class surely does not know what to do with such
information. Well, if another class inherits from SGMLParser,
then such calls are no longer confined to SGMLParser and instead
act on methods in the subclass, such as MyParser, where such
methods exist. Thus, a customised parser class (for example, MyParser)
once instantiated (made into an object) acts like a stack of components, with
the lowest level of the stack doing the hard parsing work and passing items
of interest to the upper layers - it is a bit like a factory with components
being made on the ground floor and inspection of those components taking
place in the laboratories in the upper floors!

Class

Activity

...

Listens to reports, records other interesting things

MyParser

Listens to reports, records interesting things

SGMLParser

Parses documents, issuing reports at each step

Introducing Our Customisations

Now, if we want to record the hyperlinks in the document, all we need to
do is to define a method called start_a which extracts the
hyperlink from the attributes which are provided in the starting a tag.
This can be defined as follows:

# Continuing from above...
def start_a(self, attributes):
"Process a hyperlink and its 'attributes'."
for name, value in attributes:
if name == "href":
self.hyperlinks.append(value)
# More to come...

All we need to do is traverse the attributes list, find
appropriately named attributes, and record the value of those attributes.

Retrieving the Details

A nice way of providing access to the retrieved details is to define a
method, although Python 2.2 provides additional features to make this more
convenient. We shall use the old approach:

The print statement should cause a list to be displayed,
containing various hyperlinks to locations on the Python home page and other
sites.

The Example File

The above example code can be downloaded
and executed to see the results.

Finding More Specific Content

Of course, if it is sufficient for you to extract information from a
document without worrying about where in the document it came from, then the
above level of complexity should suit you perfectly. However, one might want
to extract information which only appears in certain places or constructs - a
good example of this is the text between starting and ending tags of
hyperlinks which we saw above. If we just acquired every piece of text using
a handle_data method which recorded everything it saw, then we
would not know which piece of text described a hyperlink and which piece of
text appeared in any other place in a document.

# An extension of the above class.
# This is not very useful.
def handle_data(self, data):
"Handle the textual 'data'."
self.descriptions.append(data)

Here, the descriptions attribute (which we would need to
initialise in the __init__ method) would be filled with lots of
meaningless textual data. So how can we be more specific? The best approach
is to remember not only the content that SGMLParser discovers,
but also to remember what kind of content we have seen already.

Remembering Our Position

Let us add some new attributes to the __init__ method.

# At the end of the __init__ method...
self.descriptions = []
self.inside_a_element = 0

The descriptions attribute is defined as we anticipated, but
the inside_a_element attribute is used for something different:
it will indicate whether or not SGMLParser is currently
investigating the contents of an a element - that is, whether
SGMLParser is between the starting a tag and the ending a
tag.

Let us now add some "logic" to the start_a method, redefining
it as follows:

def start_a(self, attributes):
"Process a hyperlink and its 'attributes'."
for name, value in attributes:
if name == "href":
self.hyperlinks.append(value)
self.inside_a_element = 1

Now, we should know when a starting a tag has been seen, but to avoid
confusion, we should also change the value of the new attribute when the
parser sees an ending a tag. We do this by defining a new method for this
case:

def end_a(self):
"Record the end of a hyperlink."
self.inside_a_element = 0

Fortunately, it is not permitted to "nest" hyperlinks, so it is not
relevant to wonder what might happen if an ending tag were to be seen after
more than one starting tag had been seen in succession.

Recording Relevant Data

Now, given that we can be sure of our position in a document and whether
we should record the data that is being presented, we can define the "real"
handle_data method as follows:

And we can add the following line to our test program in order to display
the descriptions:

print myparser.get_descriptions()

The Example File

The example code with these modifications can be downloaded and executed to see the results.

Problems with Text

Upon running the modified example, one thing is apparent: there are a few
descriptions which do not make sense. Moreover, the number of descriptions
does not match the number of hyperlinks. The reason for this is the way that
text is found and presented to us by the parser - we may be presented with
more than one fragment of text for a particular region of text, so that more
than one fragment of text may be signalled between a starting a tag and an
ending a tag, even though it is logically one block of text.

We may modify our example by adding another attribute to indicate whether
we are just beginning to process a region of text. If this new attribute is
set, then we add a description to the list; if not, then we add any text
found to the most recent description recorded.

The __init__ method is modified still further:

# At the end of the __init__ method...
self.starting_description = 0

Since we can only be sure that a description is being started immediately
after a starting a tag has been seen, we redefine the start_a
method as follows:

Clearly, the method becomes more complicated. We need to detect whether
the description is being started and act in the manner discussed above.

The Example File

The example code with these modifications can be downloaded and executed to see the results.

Conclusions

Although the final example file produces some reasonable results - there
are some still strange descriptions, however, and we have not taken images
used within hyperlinks into consideration - the modifications that were
required illustrate that as more attention is paid to the structure of the
document, the more effort is required to monitor the origins of information.
As a result, we need to maintain state information within the
MyParser object in a not-too-elegant way.

For application purposes, the SGMLParser class, its
derivatives, and related approaches (such as SAX) are useful for casual
access to information, but for certain kinds of querying, they can become
more complicated to use than one would initially believe. However, these
approaches can be used for another purpose: that of building structures which
can be accessed in a more methodical fashion, as we shall see below.

Using XML Parsers

Given a character string s, containing an HTML document which
may have been retrieved from a Web service (using an approach described in an
earlier section of this document), let us now consider an alternative method
of interpreting the contents of this document so that we do not have to
manage the complexity of remembering explicitly the structure of the document
that we have seen so far. One of the problems with SGMLParser
was that access to information in a document happened "serially" - that is,
information was presented to us in the order in which it was found - but it
may have been more appropriate to access the document information according
to the structure of the document, so that we could request all parts of the
document corresponding to the hyperlink elements present in that document,
before examining each document portion for the text within each hyperlink
element.

In the XML world, a standard called the Document Object Model
(DOM) has been devised to provide a means of access to document information
which permits us to navigate the structure of a document, requesting
different sections of that document, and giving us the ability to revisit
such sections at any time; the use of Python with XML and the DOM is
described in another document.
If all Web pages were well-formed XML - that is, they all complied with
the expectations and standards set out by the XML specifications - then
any XML parser would be sufficient to process any HTML document found
on the Web. Unfortunately, many Web pages use less formal variants
of HTML which are rejected by XML parsers. Thus, we need to employ
particular tools and additional techniques to convert such pages to DOM
representations.

Below, we describe how Web pages may be processed using the PyXML toolkit and with the libxml2dom
package to obtain a top-level document object. Since both approaches
yield an object which is broadly compatible with the DOM standard, the
subsequent description of how we then inspect such documents applies
regardless of whichever toolkit or package we have chosen.

Using PyXML

It is
possible to use Python's XML framework with the kind of HTML found on the Web by employing a special
"reader" class which builds a DOM representation from an HTML document, and
the consequences of this are described below.

Creating the Reader

An appropriate class for reading HTML documents is found deep in the
xml package, and we shall instantiate this class for subsequent
use:

from xml.dom.ext.reader import HtmlLib
reader = HtmlLib.Reader()

Of course, there are many different ways of accessing the
Reader class concerned, but I have chosen not to import
Reader into the common namespace. One good reason for deciding
this is that I may wish to import other Reader classes from
other packages or modules, and we clearly need a way to distinguish between
them. Therefore, I import the HtmlLib name and access the
Reader class from within that module.

Loading a Document

Unlike SGMLParser, we do not need to customise any class
before we load a document. Therefore, we can "postpone" any consideration of
the contents of the document until after the document has been loaded,
although it is very likely that you will have some idea of the nature of the
contents in advance and will have written classes or functions to work on the
DOM representation once it is available. After all, real programs extracting
particular information from a certain kind of document do need to know
something about the structure of the documents they process, whether that
knowledge is put in a subclass of a parser (as in SGMLParser) or
whether it is "encoded" in classes and functions which manipulate the DOM
representation.

Anyway, let us load the document and obtain a Document
object:

doc = reader.fromString(s)

Note that the "top level" of a DOM representation is always a
Document node object, and this is what doc refers
to immediately after the document is loaded.

Using libxml2dom

Obtaining documents using libxml2dom is slightly more straightforward:

import libxml2dom
doc = libxml2dom.parseString(s, html=1)

If the document text is well-formed XML, we could omit the html
parameter or set it to have a false value. However, if we are not sure
whether the text is well-formed, no significant issues will arise
from setting the parameter in the above fashion.

Deciding What to Extract

Now, it is appropriate to decide which information is to be found and
retrieved from the document, and this is where some tasks appear easier than
with SGMLParser (and related frameworks). Let us consider the
task of extracting all the hyperlinks from the document; we can certainly
find all the hyperlink elements as follows:

a_elements = doc.getElementsByTagName("a")

Since hyperlink elements comprise the starting a tag, the ending a
tag, and all data between them, the value of the a_elements
variable should be a list of objects representing regions in the document
which would appear like this:

<a href="http://www.python.org">The Python Web site</a>

Querying Elements

To make the elements easier to deal with, each object in the list is not
the textual representation of the element as given above. Instead, an object
is created for each element which provides a more convenient level of access
to the details. We can therefore obtain a reference to such an object and
find out more about the element it represents:

# Get the first element in the list. We don't need to use a separate variable,
# but it makes it clearer.
first = a_elements[0]
# Now display the value of the "href" attribute.
print first.getAttribute("href")

What is happening here is that the first object (being the
first a element in the list of those found) is being asked to return the
value of the attribute whose name is href, and if such an attribute exists,
a string is returned containing the contents of the attribute: in the case of
the above example, this would be...

http://www.python.org

If the href attribute had not existed, such as in the following example
element, then a value of None would have been returned.

<a name="Example">This is not a hyperlink. It is a target.</a>

Namespaces

Previously, this document recommended the usage of namespaces and the getAttributeNS
method, rather than the getAttribute
method. Whilst XML processing may involve extensive use of namespaces,
some HTML parsers do not appear to expose them quite as one would
expect: for example, not associating the XHTML namespace with XHTML
elements in a document. Thus, it can be advisable to ignore namespaces
unless their usage is unavoidable in order to distinguish between
elements in mixed-content documents (XHTML combined with SVG, for
example).

Finding More Specific Content

We are already being fairly specific, in a sense, in the way that we have
chosen to access the a elements within the document, since we start from a
particular point in the document's structure and search for elements from
there. In the SGMLParser examples, we decided to look for
descriptions of hyperlinks in the text which is enclosed between the starting
and ending tags associated with hyperlinks, and we were largely successful
with that, although there were some issues that could have been handled
better. Here, we shall attempt to find everything that is
descriptive within hyperlink elements.

Elements, Nodes and Child Nodes

Each hyperlink element is represented by an object whose attributes can be
queried, as we did above in order to get the href attribute's value.
However, elements can also be queried about their contents, and such contents
take the form of objects which represent "nodes" within the document. (The
nature of XML documents is described in another introductory document which discusses the DOM.) In
this case, it is interesting for us to inspect the nodes which reside within
(or under) each hyperlink element, and since these nodes are known generally
as "child nodes", we access them through the childNodes
attribute on each so-called Node object.

# Get the child nodes of the first "a" element.
nodes = first.childNodes

Node Types

Nodes are the basis of any particular piece of information found in an XML
document, so any element found in a document is based on a node and can be
explicitly identified as an element by checking its "node type":

print first.nodeType
# A number is returned which corresponds to one of the special values listed in
# the xml.dom.Node class. Since elements inherit from that class, we can access
# these values on 'first' itself!
print first.nodeType == first.ELEMENT_NODE
# If first is an element (it should be) then display the value 1.

One might wonder how this is useful, since the list of hyperlink elements,
for example, is clearly a list of elements - that is, after all, what we
asked for. However, if we ask an element for a list of "child nodes", we
cannot immediately be sure which of these nodes are elements and which are,
for example, pieces of textual data. Let us therefore examine the "child
nodes" of first to see which of them are textual:

Navigating the Document Structure

If we wanted only to get the descriptive text within each hyperlink
element, then we would need to visit all nodes within each element (the
"child nodes") and record the value of the textual elements. However, this
would not quite be enough - consider the following document region:

<a href="http://www.python.org">A <em>really</em> important page.</a>

Within the a element, there are text nodes and an em element - the
text within that element is not directly available as a "child node" of the a element. If we did not consider textual child nodes of each child node,
then we would miss important information. Consequently, it becomes essential
to recursively descend inside the a element collecting child node values.
This is not as hard as it sounds, however:

To contrast this with the SGMLParser approach, we see that
much of the work done in that example to extract textual information is
distributed throughout the MyParser class, whereas the above
function, which looks quite complicated, gathers the necessary operations
into a single place, thus making it look complicated.

Getting Document Regions as Text

Interestingly, it is easier to retrieve whole sections of the original
document as text for each of the child nodes, thus collecting the complete
contents of the a element as text. For this, we just need to make use of a
function provided in the xml.dom.ext package:

from xml.dom.ext import PrettyPrint
# In order to avoid getting the "a" starting and ending tags, prettyprint the
# child nodes.
s = ""
for child_node in a_elements[0]:
s += PrettyPrint(child_node)
# Display the region of the original document between the tags.
print s

Unfortunately, documents produced by libxml2dom do not work with PrettyPrint. However, we can use a method on each node object instead:

# In order to avoid getting the "a" starting and ending tags, prettyprint the
# child nodes.
s = ""
for child_node in a_elements[0]:
s += child_node.toString(prettyprint=1)
# Display the region of the original document between the tags.
print s

It is envisaged that libxml2dom will eventually work better with such functions and tools.