Parsing documents with a DOM Parser

Unlike SAX, DOM does not have a class or interface that
represents the XML parser. Each parser vendor provides their own
unique class. In Xerces, this is
org.apache.xerces.parsers.DOMParser.
In Crimson it’s
org.apache.crimson.jaxp.DocumentBuilderImpl.
In Ælfred it’s an inner class,
gnu.xml.dom.JAXPFactory$JAXPBuilder.
In Oracle, it’s
oracle.xml.parser.v2.DOMParser
In other implementations
it will be something else.

Furthermore, since these classes do not share a common
interface or superclass, the methods they use to parse
documents vary too. For example, in Xerces, the two methods that
read
XML documents have these signatures:

Other parsers have slightly different methods still.
What all of these have in common is that they read
an XML document from a source of text,
most commonly a file or a stream, and provide an
org.w3c.dom.Document
object.
Once you have a reference to this Document
object you can work with it using only the standard methods
of the DOM interfaces. There’s no further need to use parser-specific classes.

JAXP DocumentBuilder and DocumentBuilderFactory

The lack of a standard means of parsing an XML document
is one of the holes that JAXP fills. If your parser
implements JAXP, then instead of using the parser-specific
classes, you can use the
javax.xml.parsers.DocumentBuilderFactory
and javax.xml.parsers.DocumentBuilder
classes to parse the documents. The basic approach is as follows:

Use the static
DocumentBuilderFactory.newInstance()
factory method to return a DocumentBuilderFactory
object.

Use the
newDocumentBuilder()
method of this DocumentBuilderFactory
object to return a parser-specific instance of the
abstract
DocumentBuilder class.

Use one of the five
parse()
methods of DocumentBuilder
to read the XML document and return an
org.w3c.dom.Document object.

Example 9.5 demonstrates with a simple program that
uses JAXP to check documents for well-formedness.

Example 9.5. A program that uses JAXP to check documents for well-formedness

import javax.xml.parsers.*; // JAXP
import org.xml.sax.SAXException;
import java.io.IOException;
public class JAXPChecker {
public static void main(String[] args) {
if (args.length <= 0) {
System.out.println("Usage: java JAXPChecker URL");
return;
}
String document = args[0];
try {
DocumentBuilderFactory factory
= DocumentBuilderFactory.newInstance();
DocumentBuilder parser = factory.newDocumentBuilder();
parser.parse(document);
System.out.println(document + " is well-formed.");
}
catch (SAXException e) {
System.out.println(document + " is not well-formed.");
}
catch (IOException e) {
System.out.println(
"Due to an IOException, the parser could not check "
+ document
);
}
catch (FactoryConfigurationError e) {
// JAXP suffers from excessive brain-damage caused by
// intellectual in-breeding at Sun. (Basically the Sun
// engineers spend way too much time talking to each other
// and not nearly enough time talking to people outside
// Sun.) Fortunately, you can happily ignore most of the
// JAXP brain damage and not be any the poorer for it.
// This, however, is one of the few problems you can't
// avoid if you're going to use JAXP at all.
// DocumentBuilderFactory.newInstance() should throw a
// ClassNotFoundException if it can't locate the factory
// class. However, what it does throw is an Error,
// specifically a FactoryConfigurationError. Very few
// programs are prepared to respond to errors as opposed
// to exceptions. You should catch this error in your
// JAXP programs as quickly as possible even though the
// compiler won't require you to, and you should
// never rethrow it or otherwise let it escape from the
// method that produced it.
System.out.println("Could not locate a factory class");
}
catch (ParserConfigurationException e) {
System.out.println("Could not locate a JAXP parser");
}
}
}

For example, here’s the output from when I ran this program
across this chapter’s DocBook source code:

How JAXP Chooses Parsers

You may be wondering which parser this program actually uses.
JAXP, after all, is reasonably parser-independent.
The answer depends on which parsers are installed in your
class path and how certain system properties are set.
The default is to use the class named by the
javax.xml.parsers.DocumentBuilderFactory
system property. For example, if you want to make sure that
Xerces is used to parse documents, then you would run
JAXPChecker like this:

If the javax.xml.parsers.DocumentBuilderFactory
property is not set, then JAXP looks in
the lib/jaxp.properties properties file
in the JRE directory
to determine a default value for
the javax.xml.parsers.DocumentBuilderFactory
system property. If you want to consistently use a certain
DOM
parser, for instance gnu.xml.dom.JAXPFactory, place the following line in that
file:

javax.xml.parsers.DocumentBuilderFactory=gnu.xml.dom.JAXPFactory

If this fails to locate a parser, next JAXP looks for a
META-INF/services/javax.xml.parsers.DocumentBuilderFactory
file
in all JAR files available to the runtime to find the name of the
concrete DocumentBuilderFactory
subclass.

Finally, if that fails,
then DocumentBuilderFactory.newInstance()
returns a default class, generally the parser from the
vendor who also provided the JAXP classes.
For example, the JDK JAXP classes pick
org.apache.crimson.jaxp.DocumentBuilderFactoryImpl by default but
the Ælfred JAXP classes pick
gnu.xml.dom.JAXPFactory instead.

Configuring DocumentBuilderFactory

The DocumentBuilderFactory
has a number of options that allow you to determine exactly
how the parsers it creates behave. Most of the setter
methods take a boolean that turns the feature on if true or
off if false. However, a couple of the features are defined
as confusing double negatives, so read carefully.

Coalescing

These two methods determine whether CDATA sections are
merged with text nodes or not. If the coalescing feature
is true, then the result tree will not contain any CDATA
section nodes, even if the parsed XML document does
contain CDATA sections.

The default is false, but in most situations you
should set this to true, especially if
you’re just reading the document and are not going to
write it back out again.
CDATA sections should not be treated differently than any
other text. Whether or not certain text is written in a
CDATA section should be purely a matter of syntax sugar
for human convenience, not anything that has any effect
on the data model.

Expand Entity References

The following two methods determine whether the parsers
produced by this factory expand entity
references.

The default is true. If a parser is validating, then
this it will expand entity references, even if this feature is set to false.
That is, the validation feature overrides the expand
entity references feature.

The five predefined references—
&amp;, &lt;,
&gt;, &quot;, and &apos;
—will always be expanded regardless of the
value of this property.

Ignore Comments

The following two methods determine whether the parsers
produced by this factory will generate comment nodes for
comments seen in the input document. The default, false,
means that comment nodes will be produced.
(Watch out for the double negative here.
False means include comments, and true means don’t include comments.
This confused me
initially, and I was getting my poison pen all ready to
write about the brain damage of throwing away comments
although the spec required them to be included, when I
realized that the method was in fact behaving like it
should.)

Ignore Element Content Whitespace

The following two methods determine whether the parsers
produced by this factory will generate text nodes for
so-called “ignorable white space”; that is, white space
that occurs between tags where the DTD specifies that
parsed character data cannot appear.

The default is false; that is, include text nodes for
ignorable white space. Setting this to true might well be
useful in record-like documents. However, for this property
to make a difference, the documents must have a DTD and
should be valid or very nearly so. Otherwise the parser
can’t tell which white space is ignorable and which isn’t.

Namespace Aware

The following two methods determine whether the parsers
produced by this factory are
“namespace aware.”
A namespace aware parser will set the prefix and namespace
URI properties of element and attribute nodes that are in a namespace.
A non-namespace aware parser won’t.

Validating

The default is false, do not validate. If you want to
validate your documents, set this property to true.
You’ll also need to register a SAX
ErrorHandler with the
DocumentBuilder using its
setErrorHandler()
method to receive notice
of validity errors. Example 9.6
demonstrates with a program that uses JAXP to validate a
document named on the command line.

Parser-specific Attributes

Many JAXP aware parsers support various custom features.
For example, Xerces, has an
http://apache.org/xml/features/dom/create-entity-ref-nodes
feature
that lets you choose whether or
not to include entity reference nodes in the DOM tree.
This is not the same as deciding whether or not to expand entity
references. That determines whether the entity nodes that
are placed in the tree
have children representing their replacement text or not.

JAXP allows you to set and get these custom features
as objects of the appropriate type using these two methods:

For example, suppose you’re using Xerces and you don’t
want to include entity reference nodes.
They're included by default so you need to set
http://apache.org/xml/features/dom/create-entity-ref-nodes
to false.
You would use setAttribute() on
the DocumentBuilderFactory
like this:

The naming conventions for both attribute names and
values depends on the underlying parser.
Xerces uses URL strings like SAX feature names.
Other parsers may do something different.
JAXP 1.2 will add a couple of standard attributes related
to schema validation.

DOM3 Load and Save

JAXP only works for Java, and it is a Sun proprietary
standard. Consequently, the W3C DOM working group is
preparing an alternative cross-vendor means of parsing an XML
document with a DOM parser. This will be published as part of
DOM Level 3. DOM3 is not close to a finished recommendation
at the time of this writing
and is not yet implemented by any parsers, but I can show
you pretty much what the interface is likely to look like.

Parsing a document with DOM3 requires four steps:

Load a DOMImplementation object
by passing the feature string "LS-Load 3.0" to the
DOMImplementationRegistry.getDOMImplementation()
factory method. (This class is also new in DOM3.)

Cast this DOMImplementation object
to
DOMImplementationLS,
the sub-interface that provides the extra methods you need.

Call the implementation’s createDOMBuilder()
method to
create a new DOMBuilder object.
This is the new DOM3 class that represents the parser.
The first argument to createDOMBuilder()
specifies whether the document is parsed synchronously or asynchronously.
The second argument is a URL identifying the type of schema
to be used during the parse,
"http://www.w3.org/2001/XMLSchema" for W3C XML Schemas,
"http://www.w3.org/TR/REC-xml" for DTDs.
You can pass null to ignore all schemas.

Pass the document’s URL to the the builder object’s
parseURI() method
to read the document and
return a Document object.

Example 9.7 demonstrates with a simple program that
uses DOM3 to check documents for well-formedness.

Example 9.7. A program that uses DOM3 to check documents for well-formedness

For the time being, JAXP’s DocumentBuilderFactory
is the obvious choice since it works today and is supported
by almost all DOM parsers written in Java.
Longer term, DOM3 will provide a number of important
capabilities JAXP does not, including parse progress
notification
and document filtering. However, since these APIs
are far from ready
for prime time just yet,
for the rest of this book, I’m mostly going to use JAXP
without further comment.