Location, Location, Location

It is often useful to keep track of the location of some data in an
XML file being processed. If you are parsing a file in order to
perform sophisticated search and analysis tasks, you may want to know
in which element or other such node a specific pattern was found (or
even at what file location). XPath is the standard way to convey the
location of an XML node. In the case of DOM, you might like to be able
to compute an XPath expression selecting a specific node. In the case
of SAX, you might want to have an XPath location for a current event,
or you may want to get information on a current file location from the parser. In
this article, I cover techniques for figuring out such location
information. Along the way, I shall be providing some examples of
marginally documented corners of Python's SAX libraries.

Locating DOM Nodes

Alexandre Conrad asked
the question on XML-SIG: "In XPath, is there a way I can get the
absolute path for a node?" You can compute a unique absolute path
for any node by using positional predicates in the portion of the
path that identifies a specific element and then by using name tests
to find attributes or similar tricks for other node types. I'll
illustrate using the the sample file in Listing 1.

Have a look at the first line, a comment with the new standard
instruction to specify that the Python source file is not a plain
ASCII file. (I mention Florian Bösch's name in acknowledgements in a
comment.) If you're not familiar with this instruction, introduced
for Python 2.3, see PEP 263. In
general, I coded the function for any Python 2.2 or more recent, and
I tested it with Python 2.3. I think the only construct that won't
work in 2.0 and 2.1 is node.nodeType in OTHER_NODES
(rather than node.nodeType in OTHER_NODES.keys()). The
code is straightforward, using simple recursion to build each step
of the absolute path while working up towards the root of the DOM
tree. The following session illustrates the use
of abs_path.

File Locations from SAX

Another approach to getting information about where you are in a
document is tracking the location of a current event while using a
streaming interface such as SAX. SAX provides for this using
the Locator interface. As explained in the xml.sax
standard library documentation, "SAX parsers are strongly encouraged
(though not absolutely required) to supply a locator: if [a parser]
does so, it must supply the locator to the application by invoking
[the method setDocumentLocator] before invoking any of the other
methods in the DocumentHandler interface." Every SAX driver I know of
that comes with Python or on PyXML supports locators. Listing 3
serves as an example of how useful SAX locators can be. It is a
simple, XML-aware, regular expressions search that is smart enough to
only search actual element content (as opposed to, say, tags, comments,
or attribute content), and it reports the position where any match was
found.