Gnosis XML Utilities is a Python package with a variety of utility
classes for data management, especially utility classes for XML
processing. Mertz writes separate columns covering Python (Charming
Python) and XML (XML
Matters) on IBM developerWorks. The Gnosis tools are very
handy and complementary to PyXML and 4Suite, which I have introduced
in other recent articles in this column.

The module import gnosis.xml.objectify allows you to
convert arbitrary XML documents to Python objects. At its most
basic, it does ordinary marshaling and unmarshaling, but it's also a
sophisticated data binding tool. Let's begin our examination by
unmarshaling the sample document from the last article, reproduced
as listing 1.

There are two steps to creating the Python representation of the
XML document. gnosis.xml.objectify.XML_Objectify
sets up a preparatory object with a DOM tree from which the Python
structure is created. The make_instance method does
the actual work of generating the Python structure.
Considerations of memory usage, or any other performance measures,
are not part of this comparison; but as I mentioned in the last
article, it would be nice if Python data bindings were able to
minimize memory usage. I think that it's best to process XML in
small chunks, but I despair of convincing others of this. It
seems that since people are used to treating traditional database
instances as monolithic resources, they have a natural tendency to
want to do the same with XML, stuffing all their data into huge
documents that are very unwieldy to process. As a first step one
can at least make sure the DOM used by make_instance
is cleaned up right away by using the following variation:

py_obj = XML_Objectify('listing1.xml').make_instance()

As soon as the instance is created, and the interpreter leaves
that line, the temporary XML DOM is reclaimed. So the DOM is
temporarily in memory at the same time as the Python structure, but
this is par for Python data bindings and certainly not unreasonable.
If this is too heavyweight for you, gnosis.xml.objectify allows you
to generate the binding from the streaming pyexpat interface rather
than DOM, although you do lose some features if you chose this
approach, which is much faster and uses less memory.

The resulting Python data structure consists of a set of classes
that are defined on the fly based on the XML structure. The root
(document) element labels represented by
py_object itself:

Also notice that gnosis.xml.objectify does the right thing with
content: it represents it as Python Unicode objects. (I did not
check how it would handle elements that use Unicode -- or dashes for
that matter -- in GIs, given Python's identifier name limitations.)
This bodes well for the high character test; indeed, it handles the
ellipsis character just fine:

>>> print repr(py_obj.label[0].quote.PCDATA)
u' is its own season\x85\n '

The above quote element, however, is mixed content.
It appears that gnosis.xml.objectify only keeps the last chunk of
content in the mix by default, but not the rest. In particular,
the text before the emph element, even though it's
only white space, doesn't seem directly accessible. The
emph element is handled conventionally:

The quote element I'm exploring also has a comment,
and gnosis.xml.objectify seems to offer experimental support for
comments. I say "experimental" because digging into the relevant
structure demonstrates very odd results:

Of course, the documentation says that comments are ignored, so
I'd guess support is in development. The final thing to note
about this default behavior of gnosis.xml.objectify is that the
accumulation of various elements into Python lists means that the
actual order of child elements in an XML document is lost. For
example, the document in listing 2 would result in a root object
with a spam data member which is a list of two
elements and an eggs data member which is a list of
one element, with no record of the fact that eggs
occurred between the spam elements.

Listing 2: XML file demonstrating loss of ordering

<monty>
<spam/>
<eggs/>
<spam/>
</monty>

gnosis.xml.objectify does have a very nice feature that allows
you to recover a lot of the elided information. It keeps around
the raw markup of any object with mixed content in a special data
member, _XML.

You can also tune gnosis.xml.objectify to not maintain this raw
information or to maintain it for all elements. And there is much
more to the flexibility of the package than such simple tuning.

Customizing the binding

One of the key features of gnosis.xml.objectify is the ability to
customize data bindings by substituting your classes for the
autogenerated ones. For example, if I know that I will need the
ability to compute initials from names in label entries, I might
write a program such as listing 3:

Scratching my itch

So far I have looked at two Python data binding tools which
represent the current state of the art. I listed some other tools
in the last article, but I won't cover them just yet. In
particular, XBind and Skyron look interesting, but they use
specialized languages to drive the binding process. This is a
reasonable approach, one which offers some potential advantages,
including support for multiple programming languages. But I'm
focusing on systems that are completely built around Python's
dynamism.

Part of the reason why I still use DOM rather than Python
bindings is that I'm accustomed to a lot of the other XML-processing
tools that work closely with DOM right now: XPath, XPatterns, etc.
And a lot of my XML usage has to do with the document flavor of XML,
which doesn't really suit a lot of the current data bindings. I
have long incubated ideas for a Python data binding library that
would tend to suit my needs better. Setting the stage for this
library has been one of my motives for taking a close look at the
state of the art. In the next article I shall offer a preliminary
examination of my effort, as well as a general discussion of what
one might like in the ultimate Python data binding tool.

Since the last article, Mike Olson and I released 0.6
of wsdl4py, our simplistic library for WSDL document
manipulation. The release is mainly based on Mark Bucciarelli's
patches to support recent DOM libraries.