At its core, SAX, the Simple API for XML,
is based on just two interfaces, the
XMLReader interface that
represents the parser
and the
ContentHandler interface that
receives data from the parser. These two interfaces alone suffice
for 90% of what you need to do with SAX. This chapter shows the
basic operation of XMLReader
and discusses ContentHandler in
detail. The next chapter explores a variety of ways to customize
the parsing process through the more advanced features of the
XMLReader interface.

What is SAX?

The Simple API for XML, SAX, was invented in late 1997/early 1998
when Peter Murray-Rust and several
authors of XML parsers written in Java
decided there wasn’t
much point to maintaining multiple similar yet incompatible
APIs to do exactly the same thing.
Murray-Rust was the first to suggest what he called
“YAXPAPI”.
The reason Murray-Rust wanted Yet Another XML Parser API
was that he was thoroughly sick of supporting multiple,
incompatible XML parsers for his parser-client application JUMBO.
Instead, he wanted
a standard API everyone could agree on.
Parser authors Tim Bray
and David Megginson quickly signed on to the project,
and work began in public on the xml-dev
mailing list where many people participated.
Megginson wrote the initial draft of SAX.
After a short beta period, SAX 1.0 was released on May 11,
1998.

SAX was designed around abstract
interfaces rather than concrete classes so
it could be layered on top of parsers’ existing
native APIs.
SAX is not the most sophisticated XML API imaginable,
but that’s part of its beauty. The ease with which SAX could
be implemented by many parser vendors with very different
architectures contributed to its success and rapid
standardization.

SAX in other languages

SAX has been unofficially ported to several other
object oriented languages
including C++, Visual Basic, Python, and Perl. The general patterns and
names of most functions remain the same. However, the details of
implementation change quite a bit. For instance, C++ doesn’t
have interfaces, but does have multiple inheritance, so
ContentHandler, XMLReader
and the like become classes containing nothing but pure
virtual functions. The C++ string classes can’t handle
Unicode so parsers must use pointers to arrays of custom types such as
XMLCh instead.
Unfortunately, there’s no standard C++ binding
for SAX so the custom classes
vary from one parser to the next, and you can’t easily port
C++ SAX programs between different compilers and platforms
in either binary or source form.

Although supporting the “Desperate Perl Hacker”
was a goal of the original XML working group,
Perl has always lagged other languages quite a bit when it
comes to XML. The initial problem was the lack of support for
Unicode, a sine qua non for XML.
Today modern Perls have decent Unicode support. To really handle XML
you need at least version 5.005_52 of Perl,
preferably, 5.6.1 or later and ideally 5.8.

There are several XML parsers available for Perl, though
far and away the most popular is
Larry Wall and Clark Cooper’s XML::Parser.
This is a wrapper around James Clark’s
expat
XML parser written in C.
However, this parser isn’t really SAX compatible though it’s used in a
lot of legacy code.
New projects should use XML::SAX
instead.

However, even with this module,
in my opinion Perl is still not as ideal a language for processing
XML as you might expect. Perl’s strength is its ability
to work with the
implicit structure in text documents such as tab delimited text files
and comma separated values files. However, XML documents tend to
have very explicit structure that is easily addressed by a
language like Java.
Perl’s strengths don’t come into play; but you
still suffer the numerous well-known disadvantages of working with Perl,
The inevitable
obfuscation of Perl code seems to me too high a price to pay.

Python probably has the best support for SAX and XML of any of the
non-Java languages.
XML parsing including a SAX port has been a standard
part of Python since version 2.0. Furthermore, Python has a
standard Unicode string type. This is not quite the same as
Python’s regular string type, but Python’s weak typing means
this isn’t nearly as big an inconvenience as it is in C++.
However, the fact remains that SAX is designed in and for Java,
and Java is certainly the most convenient language with
which to write SAX programs.

Although SAX is very much a de facto standard, it has not gone through
any formal standardization process. Its development
was open to anyone interested. All you had to do was
join the xml-dev mailing list and participate in the
discussions. The end result was explicitly placed in the
public domain. It is free to be implemented or extended by
anyone for any purpose without permission from anybody.
It is not copyrighted or trademarked. As far as is known,
no parts of it are patented by anyone either.

In late 1999, work began on SAX2. This was a radical reformulation
of SAX that, while maintaining the same basic event-oriented
architecture, replaced almost every class in SAX1. The main
impetus for this radical shift was the need to make SAX
namespace aware. However many other new capabilities were added in
SAX2 including filters and optional support for
lexical events and DTDs.
SAX2 was finished in May 2000, and
has proven even more successful than SAX1. Indeed SAX2 is the
most complete XML API available anywhere. In 2002, all major
parsers that support SAX at all support SAX2. There is no
reason to learn or concern yourself with the older classes
and interfaces from SAX1, and henceforth I will discuss SAX2
exclusively.

For the first few years of
its life, the official SAX distribution and documentation
was maintained by David Megginson. However, he
recently passed the torch to David Brownell who has begun work
on SAX 2.1. At the time of this writing, SAX 2.1 seems unlikely
to be as radical a shift relative to SAX2
as SAX2 was relative to SAX1. Version 2.1 will add
a few bits of information
from the XML document that are not exposed by SAX2 such as
the encoding declaration. However, no SAX2 classes,
interfaces, or methods
will be deprecated in SAX 2.1; and only programmers with very
special needs will need to concern themselves with the new
functionality in SAX 2.1.