DOM Level 3

DOM Level 3 will finally add a standard Load and Save package
so that it will be possible to write completely
implementation independent DOM programs. This
package, org.w3c.dom.ls,
is identified by the feature strings LS-Load and LS-Save.
The loading parts includes the
DOMBuilder interface you've
already encountered.
The saving part is based on the
DOMWriter interface.
DOMWriter is more powerful
than XMLSerializer
Whereas XMLSerializer is limited to
outputting documents, document fragments, and elements,
DOMWriter can output any kind
of node at all.
Furthermore, you can install a filter into
a DOMWriter that controls its
output.

Caution

This section is based on very early bleeding edge
technology and specifications, particularly the
July 25, 2002 Working Draft of the
Document
Object Model (DOM) Level 3 Abstract
Schemas and Load and Save Specification
and Xerces-J 2.0.2. Even with Xerces-J 2.1, most of the code in
this section won’t even compile, much less run.
Furthermore, it’s virtually guaranteed that the details in this
section will change before DOM3 becomes a final
recommendation.

As shown by the method signatures in
Example 13.3,
DOMWriter
can copy a Node object from
memory into serialized bytes or characters.
It has methods to write XML nodes onto
a Java OutputStream
or a String. The most common kind of
node you’ll write is a
Document,
but you can write all the other kinds of node as well
such as
Element,
Attr, and
Text.
This interface also has methods to control exactly how the
output is formatted and how errors are reported.

Note

DOMWriter is not a
java.io.Writer. In fact, it even
prefers OutputStreams to
Writers. The name is just a
coincidence.

The primary purpose of this interface is
to write nodes into strings or onto streams.
These nodes can be complete documents or parts thereof like
elements or text nodes. For example, this code fragment uses
the DOMWriter object
writer to
copy the Document object
doc onto System.out
and copy its root element into a
String:

DOMWriter
also has several methods to configure the output.
The setNewLine() method can choose
the line separator used for output. The only legal values
are carriage
return, a line feed, or both; that is, in Java parlance,
"\r", "\n", or "\r\n".
You can also set this to null to
indicate you want the platform’s default value.

The setEncoding() method
changes the character encoding used for the output. Which
encodings any given serializer supports varies from
implementation to implementation, but common values include
UTF-8, UTF-16, and ISO-8859-1.
UTF-8 is the default if a value is not supplied.
For example, this writer sets up the output for use on a
Macintosh:

More detailed control of the output can be achieved by getting
and setting features of the
DOMWriter,
as you’ll see shortly.

The setErrorHandler()
method can install an org.w3c.dom.DOMErrorHandler
object that receives notification of any problems that arise
when outputting a node such as an element that uses the
same prefix for two different namespace URIs on two attributes.
This is a callback interface, similar to
org.xml.sax.ErrorHandler but even
simpler since it doesn’t use different methods for different
kinds of errors. Example 13.4 shows
this interface. The handleError() method
returns true if processing should continue after
the error, false if it shouldn’t.

In Xerces-2, the
XMLSerializer class implements the
DOMWriter interface, so if you
prefer you can use these methods instead of the ones
discussed in the last section.
Example 13.5 demonstrates
a complete program that builds a simple
SVG document in memory and
writes it into the file
circle.svg
in the current working directory using
a \r\n line end and the UTF-16 encoding.
The error handler is set to
an anonymous inner class that prints error messages on
System.err and returns false to
indicate that processing should stop when an error is
detected.

Note

Xerces-J 2.1 currently puts the
DOMWriter
interface in the org.apache.xerces.dom3.ls package
instead of the org.w3c.dom.ls package.
The Xerces team is trying to keep the experimental DOM3
classes separate from the main API until DOM3 is more
stable.

Creating DOMWriters

Example 13.5 depends on Xerces-specific
classes. It won’t work with GNU-JAXP or Oracle or other
parsers, even after these parsers are upgraded to support
DOM3. However, you can write the code in a much more
parser-independent fashion by using
the DOMImplementationLS
interface, shown in Example 13.6,
to create concrete
implementations of DOMWriter,
rather than constructing the implementation classes directly.
DOMImplementationLS
is a sub-interface of DOMImplementation
that adds three methods to create new
DOMBuilders,
DOMWriters, and
DOMInputSources.

You retrieve a concrete instance of this factory interface
by using the DOM3
DOMImplementationRegistry
factory class
introduced in Chapter 10 to request a
DOMImplementation object that
supports the LS-Save feature. Then you cast that object to
DOMImplementationLS. For
example,

Using this technique, it’s straightforward to write a
completely implementation independent program to generate and
serialize XML documents. Example 13.7 demonstrates.
It uses the DOMImplementationRegistry
class to load the
DOMImplementationLS
and the DOMWriter class to
output the final result. Otherwise, it just uses the standard
DOM2 classes that you've seen in previous chapters.

Example 13.7. An implementation
independent DOM3 program to
build and serialize an XML document

This program has to test for both the LS-Load and LS-Save
features because it’s not absolutely guaranteed that an
implementation that has one will have the other, especially in
the early days of DOM3.

Serialization Features

The defaults used by the writeNode()
and writeToString() methods are
acceptable for
most uses. However, occasionally you want a little more
control over the serialized form. For instance, you might
want the output to be pretty printed with extra white space
added to indent the elements nicely.
Or perhaps you want the output to be in canonical form. All
of this and more can be controlled by setting features in
the writer before invoking
the write method.

Defined features include:

normalize-characters, optional, default true

If true, output text should be normalized according to
the W3C Character Model. For example, the word
café would be represented as the
four character string c a f é
rather than the five character string c a f e combining_acute_accent.
Implementations are only required to support a false value
for this feature.

split-cdata-sections, required, default true

If true, CDATA sections containing the CDATA section end delimiter
]]> are split into pieces and the
]]> included in a raw text node.
If false, such a CDATA section is not split.
Instead an error is reported and output stops.

If true, all white space is output.
If false, text nodes containing only white space
are deleted if the parent element’s declaration
from the DTD/schema
does not allow #PCDATA to appear
at that point.

discard-default-content, required, default true

If true, the
implementation
will attempt write out any nodes whose presence can be inferred from the DTD or schema;
e.g. default attribute values. If false, it won’t include them
explicitly.

canonical-form, optional, default false

If true,
the document will be written according to the rules specified
by the Canonical XML specification. For instance attributes will be lexically
ordered and CDATA sections will not be included. If false, then
the exact output is implementation dependent.

format-pretty-print, optional, default false

If true, white space will be adjusted to
“pretty print”
the XML. Exactly what this means, e.g. how many spaces elements
are indented or what maximum line length is used, is left up to
implementations.

validation, optional, default false

If true, then the document’s
schema is used to validate the document as it is being
output. Any
validation errors that are discovered are reported to the
the registered error handler.
(Both validation and error handlers are other new features
in DOM3.)

In addition implementations may define additional custom
features. These names will generally begin with
vendor specific prefixes like “apache:” or
“oracle:”.
For portability, you should check for the
existence of such a feature with
canSetFeature() before setting
it. Otherwise, you’re likely to encounter an unexpected
DOMException when the program is
run with a different
parser.

For example, this code fragment attempts to output
the Document object doc
onto
the OutputStreamout
in canonical form.
However, if the implementation of
DOMWriter doesn’t support
Canonical XML, it just outputs the document in the normal
way:

Filtering Output

One of the more original aspects of the DOMWriter API
is the ability to attach filters to a writer that remove
certain nodes from the output. A DOMWriterFilter
is a sub-interface of NodeFilter
from last chapter’s traversal API, and works almost
exactly like it. This shouldn’t be too surprising since
serializing a
document is just another tree-walking operation.

To perform output filtering you first implement the
DOMWriterFilter interface shown in
Example 13.8.
As with the NodeFilter
superinterface, the
acceptNode() method returns
one of the three named constants NodeFilter.FILTER_ACCEPT,
NodeFilter.FILTER_REJECT,
or NodeFilter.FILTER_SKIP
to indicate whether or not a particular node and its
descendants should be output. (This method isn’t
listed here because it’s inherited from the superinterface.)

The getWhatToShow()
method returns an int
constant indicating which kinds of
nodes are passed to this filter for processing. This is a
combination of the bit constants used by
NodeIterator
and TreeWalker in the last
chapter; that is, NodeFilter.SHOW_ELEMENT,
NodeFilter.SHOW_TEXT, NodeFilter.SHOW_COMMENT, etc.

Chapter 8 demonstrated a
SAX filter that removed
everything that wasn’t in the XHTML namespace from a document.
Example 13.9 is a
DOMWriterFilter that accomplishes the same task.

The one thing this doesn’t filter out is non-XHTML
attributes.
Those are written out with their elements. They
are not passed to acceptNode().
To filter out attributes from other namespaces would require
a custom DOMWriter.
You might be able to remove them from the element nodes
passed to acceptNode(), but this
would modify the in-memory tree as well as the streamed
output. Furthermore, although Java doesn’t support this, the
IDL code for DOMWriter indicates
that the Node passed to
acceptNode() is read-only.
The underlying implementation is probably not expecting
acceptNode() to modify its
argument.
Doing so is asking for corrupt data structures.

You can install a filter into a DOMWriter
using the setFilter() method.
Then any node the filter rejects will not be serialized.
Example 13.10 uses the above XHTMLFilter
to output pure XHTML from an input document that might contain
SVG, MathML, SMIL, or other non-XHTML elements.