Building Dictionaries With SAX

People have always liked to complain that Python is slow. People have always liked
to
complain that XML processing is slow. I've always thought both complaints are typically
misguided. Neither Python nor XML emerged out of the need for ultrasonic speed at
runtime.
Python emerged as an expressive language for programming applications, and XML as
an
expressive system for building data representation into applications. In both cases
the goal
for the resulting applications is ease of maintenance. This ease primarily comes from
the
expressiveness that Python and XML allow. One should expect that such gain in productivity
would be matched by a corresponding loss in some area, and sometimes the deficit comes
from
performance. Complaints about the downsides of this tradeoff often stem from premature
optimization, when developers insist on basing key decisions on meaningless benchmarks
rather than considering the entire profile under which an application operates.

All that having been said, there are certainly real world cases where one has chosen
Python
and XML tools with an open mind, and then decided that a performance boost is in order
upon
careful profiling. There are many ways to squeeze performance out of Python XML
applications, some of which take advantage of Python or XML's expressive power. Each
application will need its own mix of practical optimizations, but in this article
I shall
present a modest technique that might be handy to have in your war chest.

The dictionary express

Dictionaries are perhaps the core data structure most dear to the heart of Python.
The
basic object model of Python is largely constructed upon dictionaries. They have been
optimized to the utmost by some of the brightest minds in the computer industry.
Conventional wisdom suggests that if you need to optimize a bit of code in Python,
you
should take advantage of the excellent Python/C interface. A better bit of advice
is that
you might first want to check whether you are taking full advantage of dictionaries.
One
detail to keep in mind is that dictionaries are more suited to fast lookup operations
than
memory savings: again there is a tradeoff, space for speed in this case.

My pet description of XML's fundamental data model is "labeled strings in nested packages".
The labeling and nesting are what differentiate XML from good old comma or tab-delimited
value and tabular ("square") data models such as spreadsheets and classic SQL databases.
This same labeling and nesting makes for a natural accommodation of data from XML
in Python
dictionaries. Sometimes writing tortured SAX code for complex processing just doesn't
cut it
as an alternative to loading documents into heavyweight DOM or the like for XPath
access.
You may be better off writing SAX code that is just sophisticated enough to shape
the key
bits from XML into a structure of nested dictionaries.

Take for example the address labels format I've been using as an example in some earlier
installments, reproduced in listing 1.

Well, it's not an exact reproduction. In earlier articles I used the character
&#133;, mistakenly thinking that it was the ellipsis in Unicode. In fact,
it is NEL, an end of line character used most often in IBM mainframe data. I erroneously
looked up a table based on Windows code pages rather than Unicode, which illustrates
all too
well the danger I cautioned against earlier of confusing character sets. Starting
from this
article I'm using the correct character &#8230;. As an aside, there has
been a lot of controversy in the XML community because NEL, the character I happened
to
mistake for ellipsis, has been added as a standard white space character in a revision
that
will probably become XML 1.1. Some claim this breaks backward compatibility and is
not worth
the convenience of a (relatively) few mainframe users. The fact that &#133
is whitespace from the Unicode point of view explains what I thought was confusing
treatment
of the character by Python string processing APIs.

But back to the main topic of the article. Listing 2 contains some SAX code to create
a
handy dictionary of dictionaries from labels.xml. Python 2.x is the only requirement:
I used
Python 2.3.3.

Listing 2: a program (listing2.py) to create Python dictionaries from the labels XML
format

textnormalize is a SAX filter that I contributed to The Python
Cookbook. A SAX parser can report contiguous text using multiple
characters events. In other words, given the following XML document:

<spam>abc</spam>

The text "abc" could technically be reported as three characters events: one
for the "a", one for the "b" and a third for the "c". Such an extreme case is unlikely
in
real life, but in any case textnormalize ensures that all such broken-up events
are reported to downstream SAX handlers in the manner most developers would expect.
In the
above case, the filter would consolidate the three characters events into a
single one for the entire text node "abc". This frees downstream handlers from implementing
code to deal with broken-up text nodes, which can be rather cumbersome when mixed
into
typical SAX-style state machine logic. I've reproduced the filter class as listing
3 for
sake of convenience.

Working with the resulting structure

In listing 2 label_dict_handler is my special SAX handler for the labels
format. It's a somewhat rough case and does not go into the detail required to handle
some
edge cases, but it does accomplish the required task. It constructs a dictionary of
labels,
each of which is a dictionary of label details. The address details are given as yet
another
nested set of dictionaries. The main module code constructs the chain from the parser
through the textnormalize filter to the label_dict_handler
handler.
I ran this code and prepared to experiment with the resulting dictionary as follows:

$ python -i listing2.py labels.xml

Python's standard pprint module is a friendly way to inspect the contents of
dictionaries:

The Python dictionary query is certainly a bit more wordy, and it uses Python trickery
such as list comprehensions; but it's more efficient than most XPath implementations,
and if
you're using Python to a fair extent you're likely to be familiar with all the neat
idioms
anyway. Notice that I use the method iteritems to spool out the contents of the
dictionary rather than the more common items. iteritems requires
Python 2.2 and saves memory by creating the key/value pairs for each item in the dictionary
on demand rather than all at once. On older Python versions just replace with
items.

The dictionary data is just as easily manipulated although it would take some additional
work to save any such updates back into the XML format.

Limitations and all that

This article's fare is not a technique for all seasons. For one thing, even a trimmed
down
SAX handler that does nothing but create dictionaries can become rather complicated
and
tricky to debug. The code to generate equivalent data bindings and Python structures
in any
of the many tools I've examined in this column is almost trivial compared to the SAX
presented in this article. But using plain SAX to construct dictionaries is likely
to offer
more raw speed and less memory overhead. It also has the advantage of requiring no
third
party modules. One possible variation is to use SAX to construct specialized objects
rather
than dictionaries, which makes the query code a bit neater, but does reintroduce some
overhead.

It has been a slow couple of months in the Python/XML community. One interesting
development has been the announcement of Rx4Rdf. As its web page says:

Rx4RDF is a specification and reference implementation for querying, transforming
and updating W3C's RDF by specifying a deterministic mapping of the RDF model to the
XML
data model defined by XPath. Rx4RDF shields developers from the complexity of RDF
by
enabling you to use familiar XML technologies like XPath, XSLT and XUpdate. We call
their
RDF equivalents RxPath, RxSLT, and RxUpdate respectively.

Browsing through the code, Rx4RDF appears to start with various modules from 4Suite,
but
combines and enhances these in very interesting ways for an RDF-centric approach to
data
processing. I especially found intriguing the description of Racoon, which sounds like a variation on
the idea behind the popular Cocoon server, but using RDF and Python rather than XML/XSLT
and
Java.