HOWTO Avoid Being Called a Bozo When Producing XML

“There’s just no nice way to say this: Anyone
who can’t make a syndication feed that’s well-formed XML
is an incompetent fool.——Maybe this is unkind and elitist
of me, but I think that anyone who either can’t or won’t
implement these measures is, as noted above, a bozo.” –
Tim
Bray, co-editor of the XML 1.0 specification

There seem to be developers who think that well-formedness is
awfully hard—if not impossible—to get right when
producing XML programmatically and developers who can get it right
and wonder why the others are so incompetent. I assume no one wants
to appear incompetent or to be called names. Therefore, I hope the
following list of dos and don’ts helps developers to move from the
first group to the latter.

Note about the scope of this document: This document focuses on
the Unicode layer, the XML 1.0 layer and the Namespaces in XML layer.
Getting higher layers like XHTML and Atom right are outside the scope
of this document. Also, anything served as text/html is
outside the scope of this document, alhough the methods described
here can be applied to producing HTML. In fact, doing so is even a good idea.

Don’t think of XML as a text format

Even people who have used compilers and seen the error and warning
messages seem to think that text formats can be written casually and
the piece of software in the other end will be able to fix small
errors like a human reader. This is not the case with XML. If the
document is not well-formed, it is not XML and an XML processor has
to cease normal processing upon finding a fatal error.

It helps if you think of XML as a binary format like PNG—only
with the added bonus that you can use text tools to see what is in
the file for debugging.

Don’t use text-based templates

Text-based Web templating systems (MovableType, WordPress, etc.)
and active page technologies that seem to allow you to embed program
code in document skeleton (ASP, PHP, JSP, Lasso, Net.Data, etc.) are
designed for tag soup. They don’t guarantee well-formed XML output.
They don’t guarantee correct HTML output, either. They seem to work
with HTML, because text/html user agents are lenient and
try to cope with broken HTML. The most common mistakes involve not
escaping markup-significant characters or escaping them twice.

Don’t use these systems for producing XML. Making mistakes with
them is extremely easy and taking all cases into account is hard.
These systems have failed smart people who have actively tried to get
things right.

When your program grows and is modified, these things become
increasingly difficult to keep track of. It is very easy to overlook
something. Indeed, it is very likely that something goes wrong.

Use an isolated serializer

Still, producing the markup characters and writing them as bytes
into an output stream has to happen somewhere. Putting all the code
the writes to the output stream in a single class or compilation unit
makes it possible to debug the escaping-sensitive code in one place.
The serializer should have SAX-like methods such as
startElement(nsUri, localname, attributes),
endElement(nsUri, localname), characters(text),
processingInstruction(target, data), etc. The methods
always take unescaped strings and escape attribute values and
character data. With this approach, the notorious escaping problem
just vanishes!

Use a tree or a stack (or an XML parser)

Although the serializer API sketched above makes the escaping
problem disappear, the application could still call startElement()
and endElement() in a bad sequence and break well-formed
nesting.

Since an XML document parses into a tree, traversing an analogous
programmatically produced tree (eg. DOM or XOM) induces the right
sequence of startElement() and endElement()calls.
It is worth noting that even though recursive tree traversal usually
gets all the attention in algorithm and data structure text books, a
tree with parent references can be traversed iteratively.

If you are serializing a tree data structure into an XML format
that closely mirrors the in-memory structure, you can use the
treeness of the data structure for ensuring well-formed nesting
instead of first building a DOM or XOM (or similar) tree.

A tree may be an overkill, however. To ensure proper nesting, a
stack is sufficient. A stack can keep track of the open elements
without wasting space on parts of the document that have already been
handled or have not been handled yet. More importantly, the stack
does not need to be explicit: the runtime stack can be used. If
startElement is always called at the beginning of a
method and endElement is always called in the end, the
runtime stack guarantees the nesting.

Finally, one way of producing SAX events in a proper sequence may
be obvious: a SAX parser emits SAX parse events in a proper sequence.
It may also be so obvious that it is easy to overlook.

The original way to get some SAX events is parsing an XML document
at runtime. But if you are producing XML dynamically, what good does
it do to parse a static document? Well, boilerplate markup can be put
in a static XML file and the interesting parts can be produced
programmatically. A SAX filter can look for interesting points in the
XML document (eg. a particular processing instruction or element) and
inject additional SAX events to the pipeline before returning to
control to the parser. The injection may involve parsing
another document and injecting events from it into the same pipeline.
If the static XML data is trusted, it is possible to even name
methods in processing instructions and use
reflections to call back into the application based on the XML
data.

Another approach to boilerplate markup is code generation in such
a way that the parse events from an XML parser are recorded as
generated program code that can play back the events efficiently
without actually reading input at runtime. My SaxCompiler
takes this approach. Since the events are recorded from an XML
parser, they occur in a permissible sequence.

Don’t try to manage namespace declarations manually

Namespaces in XML makes it possible for XML element and attribute
names to be in a namespace. Being in a namespace means being
associated with an additional string symbol, which is required to be
an URI alhough it is compared code point for code point. The name of
the XHTML element for paragraps is not just p. It is the
pair consisting of the XHTML namespace URI and p—that
is (http://www.w3.org/1999/xhtml, p) or in
James Clark’s notation {http://www.w3.org/1999/xhtml}p.

The URI is bound to the local name by using an intermediate
syntactic abstraction. The namespace can be declared as a default
that affects unprefixed element name (but not attribute names) or it
can be bound to a prefix. The crucial point is that the prefix string
itself can be chosen arbitrarily and carries no meaning. Also, the
declarations can appear earlier in the document tree and are scoped.

My aim in the above paragraps is to convey that the namespace
mechanism is sufficiently complex to be dangerous to be left up to
the casual programmer and application code. Instead, the application
programmer should use the URI–local name pair and leave the
management of the namespace declarations and prefixes to a dedicated
piece of code that someone has already debugged. (Of course, it is OK
for the programmer to suggest prefixes to make the output
more readable.)

For the GNU JAXP framework, gnu.xml.pipeline.NSFilter
is such a piece of code. GenX, on the other hand, does this within
the serializer component itself.

Use unescaped Unicode strings in memory

To keep the abstractions clear, the content strings in memory
should be in the unescaped parsed form. For example, if you have
content that says two is greater than one the string in the memory
should be “2 > 1”. In particular, it should not be “2
&gt; 1”. “2 > 1” is what you mean. Only when
the string reaches the serializer, it is the responsibility on the
serializer to write “2 &gt; 1” in the output.

Passing along a chunk of markup is done either by passing a tree
data structure (eg. DOM fragment) or by emitting multiple SAX events
in sequence.

Moreover, the chances for mistakes are minimized when in-memory
strings use the encoding of the built-in Unicode string type of the
programming language if your language (or framework) has one. For
example, in Java you’d use java.lang.String and char[]
and, therefore, UTF-16. Python has the complication that the Unicode
string type can be either UTF-16 (OS X, Jython) or UTF-32 (Debian)
depending on how the interpreter was compiled. With C it makes sense
to choose one UTF and stick to it.

Use UTF-8 (or UTF-16) for output

The XML 1.0 specification requires all XML processors to support
the UTF-8 and UTF-16 encodings. XML processors may support other
encodings, but they are not required to. It follows that using any
encoding other than UTF-8 or UTF-16 is unsafe, because the XML
processor used by the recipient might not support the encoding. If
you use an encoding other than UTF-8 or UTF-16 and communication
fails, it is your fault. Arguments about particular legacy encodings
being common in a particular locale (eg. Shift_JIS in Japan or
ISO-8859-1 in Western Europe) are totally irrelevant here. (The
xml:lang attribute can be used for CJK disambiguation.
There is no need to use parochial encodings for that.)

From the XML point of view both UTF-8 and UTF-16 are equally
right. If your serializer only supports either one, just go with the
one the serializer already supports.

UTF-8 is more compact than UTF-16 (in terms of bytes) for
characters in the ASCII range. Even if your content does not contain
characters from the ASCII range frequently, the element and attribute
names in well-known vocabularies as well the XML syntax itself
consist of characters from the ASCII range. UTF-8 data is also easier
to examine for debugging with byte/ASCII-oriented network sniffing
and file examination tools. UTF-16 is more compact than UTF-8 only
when the number of characters from the U+0800–U+FFFF range
exceeds the number of characters from the ASCII range—and
the latter includes markup whenever well-known XML vocabularies are
used.

It might be tempting to try to optimize the size of the document
by choosing the encoding depending on the content or the expected
content. However, doing so opens up more possibilities for bugs. Even
when the serializer offers a choice, it is safer to pick either UTF-8
or UTF-16 and stick to the choice regardless of content or deployment
locale. I am biased in favor of UTF-8.

Use NFC

In Unicode, common accented letters can be expressed in two
different ways: as a single character or as a base character followed
by combining character. For example ‘ä’ can be
represented as one character (LATIN SMALL LETTER A WITH DIAERESIS) or
as two characters (LATIN SMALL LETTER A followed by COMBINING
DIAERESIS). The former is known as the precomposed form and the
latter as the decomposed form. There are also presentation forms that
are considered compatibility equivalents of other characters. For
example, LATIN SMALL LIGATURE FI is a presentation form of LATIN
SMALL LETTER F and LATIN SMALL LETTER I.

There are a lot of transitional applications that treat Unicode as
wide ISO-8859-1—like ISO-8859-1 is wide ASCII. These
applications are able to deal with precomposed accented characters
but not with the canonically equivalent NFD representations. Thus,
NFC is the safer choice if you want to maximize the probability that
your text renders nicely. Using NFC is not a well-formedness
requirement—just a robustness bonus.

Don’t expect software to look inside comments

According to
the XML spec, “an XML processor MAY, but need not, make it
possible for an application to retrieve the text of comments”.
Since the receiving application is not guaranteed to see the
comments, comments are not an appropriate place for data that you
want to the recipient to process. That a particular DTD does not
allow embedded RDF metadata does not make comments an appropriate
place for metadata.

Don’t rely on external entities on the Web

It follows from the XML spec that external entities are inherently
unsafe for Web documents, because non-validating XML processors are
allowed not to process them and someone may be using a non-validating
XML processor to parse the content you serve on the Web. Therefore,
it makes sense not to rely on external entities. When you are not
relying on them, why have them around at all? Anyone processing them
would just waste time. The straight-forward way is to produce
doctypeless XML.

But what about validation? It turns out there is a better
validation formalism than DTDs. It is more interesting to know the
answer to the question “Does this document conform to these
rules?” than to the question “Does this document conform
to the rules it declares itself?” RELAX
NG validation answers the first question. DTD validation of
answers the second. RELAX NG allows you to validate a document
against a schema that is more expressive than a DTD without polluting
the document with schema-specific syntax.

Don’t bother with CDATA sections

XML provides two ways of escaping markup-significant characters:
predefined entities and CDATA sections. CDATA sections are only
syntactic sugar. The two alternative syntactic constructs have no
semantic difference.

CDATA sections are convenient when you are editing XML manually
and need to paste a large chunk of text that includes
markup-significant characters (eg. code samples). However, when
producing XML using a serializer, the serializer takes care of
escaping automatically and trying to micromanage the choice of
escaping method only opens up possibilities for bugs.

Don’t bother with escaping non-ASCII

Since you are using UTF-8 (or UTF-16), the output encoding can
represent the whole of Unicode directly. There is no need to escape
non-ASCII characters in any way. Only <, >, & and (in
attribute values) " need escaping. That’s it. No entities
needed. No numeric character references needed.

If you insist on escaping non-ASCII, please make sure you handle
astral characters correctly.

Avoid adding pretty-printing white space in character data

XML has a design problem that makes source formatting leak into
parsed content. Instead of reserving eg. literal tabs and line feeds
exclusively for source formatting so that the parser could always
discard them, XML allows white space to be both significant content
and meaningless pretty-printing. The mess is left for higher layers
to sort out.

To avoid problems, it is prudent never to introduce
pretty-printing white space in character data. Personally, I don’t
pretty-print at all when I produce XML programmatically. The safe way
to pretty-print is to put the white space inside the tags themselves
instead of putting it between them.

That is, if you have<foo>bar</foo>instead
of doing this<foo> bar</foo>do
this<foo >bar</foo>

Don’t use text/xml

The XML specification provides a means for XML documents to
declare their own character encoding. This way, the encoding
information travels with the document even in environments that can’t
store or communicate the encoding information externally.

Unfortunately, the XML specification allows external encoding
information to override the internal encoding information.
Considering Ruby’s
Postulate, it would probably be a better idea to count on the
internal information just like you trust a ZIP file itself when it
comes to figuring out which compression method has been used instead
of letting an external HTTP header say which decompression method you
should apply. According to RFC
3023, the text/xml content type never
allows you to use the internal information. Even in the absence of an
explicit charset parameter, the default
is US-ASCII trumping the XML spec. (Of course, there’s a lot of
software that ignores the RFC, but that’s not a good basis to build
on.)

When the type application/xml is used without the
charset parameter, the XML spec governs on the matter of
character encoding. For some vocabularies, there are types of the
form application/*+xml, which also don’t suffer from the
counter-intuitive encoding default of text/xml.

Use XML 1.0

XML 1.1 adds the ability to use some previously forbidden control
characters like the form feed while still forbidding U+0000, so you
still cannot zero-extend random binary data and smuggle it over XML
as text. XML 1.1 also allows you to use Khmer,
Amharic, Ge’ez, Thaana, Cherokee, and Burmese characters in
element and attribute names. Contrary to what XML 1.1
propaganda may lead people to believe, XML 1.0 already allows content
in those languages. Additionally, XML 1.1 changes the definition of
white space to accommodate IBM mainframe text conventions.

Test with astral characters

Unicode was originally supposed to be 16 bits wide. However, the
original 16 bits running up to U+FFFF turned out to be insufficient.
Thus, Unicode was extended to extend up to U+10FFFF. The range of
scalar values is considered to be partitioned into 17 planes with 16
bits worth of code points on each plane. The characters in the range
of the original 16 bits constitute the Basic Multilingual Plane (or
BMP or Plane 0). The range above U+FFFF consists of astral planes and
the characters above U+FFFF are called astral characters.

The original way of simply storing a character as an unsigned
16-bit integer was extended to cover the astral planes using
surrogate pairs yielding the UTF-16 encoding. A range of values that
fall in the BMP are set aside to be used as surrogates. An astral
character is represented as a surrogate pair: a high surrogate (a
16-bit code unit) followed by a low surrogate (another 16-bit code
unit).

Some programs operating on 16-bit units may not pass surrogate
pairs through intact even though one might think the surrogate pairs
could be smuggled through legacy software as two adjacent
“characters”. Moreover, when UTF-16 data is converted
into UTF-8, the surrogate pair needs to be converted into the scalar
value of the code point which is then converted into a 4-byte UTF-8
byte sequence. Some broken converters may produce a 3-byte sequence
for each surrogate instead. (This kind of broken UTF-8 has been
formalized as CESU-8.)

Because of these issues, it is a good idea to test that astral
characters can travel through your system intact and that the output
produced is proper UTF-8 and not CESU-8.

Test with forbidden control characters

XML semi-arbitrarily forbids some ASCII control characters and
Unicode values that are reserved to be used as sentinels (eg. U+0000
and U+FFFF). These characters render the document ill-formed.
Therefore, it is important to make sure they cannot occur in the
output of your system.

It is a good idea to try to introduce these characters into the
system and make sure that they are either caught right upon input or
at least filtered out in the XML serializer.

Test with broken UTF-*

Whichever UTF you use in memory or for input, it is possible to
construct illegal code unit sequences. With UTF-32 the scalar value
may be outside the Unicode range. With UTF-16 there may be unpaired
surrogates. With UTF-8 there may be overlong byte sequences,
sequences that are not the shortest form for a given character or
sequences whose scalar value fall in the surrogate range.

You should try throwing broken code unit sequences at your system
and make sure that broken input can never silently translate into
broken output. Most importantly, if your input or memory UTF is the
same as the output UTF, you should not merely copy code units into
the output without checking them.

Usually checking is achieved as a side effect by using UTF-8 for
input and output and UTF-16 in memory, so broken data is caught in
the conversion.