This chapter is from the book

This chapter is from the book

Like the popular HTML, Extensible Markup Language (XML) consists of tagged,
human-readable text. Unlike HTML, the tags in an XML document follow one simple
rule: For every opening tag <tag> there is a closing tag
</tag> . An XML document in which every opening tag has a closing
tag is said to be well formed.

As long as the XML document is well formed, you can fabricate the tags in any
way you want. An XML document is typically parsed by an XMLparser, which creates an in-memory logical data structure for navigating
the document. There are different types of XML parsers. The most common do not
usually care what the tags are as long as they are well formed. Sometimes a
parser can validate an XML document against a set of rules that limit the
document to only a certain subset of tags. Such parsers are called validating
parsers.

The two most popular mechanisms for parsing XML documents are to create a
Document Object Model (DOM) tree or to use the event-based Simple API
for XML (SAX) model. An XML document can be validated against a DTD (the set
of rules that define the type and structure of the XML tags) or an XML
schema.

This chapter looks at C#'s API for DOM and SAX parsing of XML documents.
We look at validating an XML document against a DTD. We also look at other
utilities, such as XPath and Extensible Stylesheet Transformation (XSLT), that
are built into the .NET API.

20.1 XML Support in Java

For a long time, XML was not built into the Java API. Support for XML was
primarily through third-party libraries (such as Apache Xerces or JDOM).
Fortunately, that has changed, and now you can get the Java XML Pack, a toolset
for dealing with everything XML in Java. The XML Pack brings together several of
the key industry standards for XML, such as SAX, DOM, XSLT, SOAP, Universal
Description, Discovery & Integration (UDDI), Electronic Business using
Extensible Markup Language (ebXML), and Web Services Description Language
(WSDL). The two common programmatic XML APIs (SAX and DOM) are now built into
the core Java API (as of J2SE 1.4.0).

The SAX parser is an event-driven parser in which the parser fires off events
when it encounters XML elements. Users write content handlers, which they can
register with the parser. A content handler is like an event listener and
can take appropriate action upon encountering, say, a particular XML tag. The
SAX parser is based on a push model, wherein the parser pushes events to content
handlers.

The DOM parser parses the XML into an in-memory tree data structure (also
known as a DOM tree). The Document Object Model is an API for valid HTML
and well-formed XML documents. It defines the logical structure of documents and
the way a document is accessed and manipulated. In the DOM specification, the
term "document" is used in the broad sense; increasingly, XML is being
used as a way to represent many kinds of information that may be stored in
diverse systems. Much of this has traditionally been seen as data rather than as
documents. Nevertheless, XML presents this data as documents, and the DOM can be
used to manage this data.

With the Document Object Model, programmers can build documents, navigate
their structure, and add, modify, or delete elements and content. Anything found
in an HTML or XML document can be accessed, changed, deleted, or added using the
DOM. The DOM is a W3C specification
(http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/).
The JDOM (http://www.jdom.org)
API is one of the easier APIs for dealing with the XML DOM.