XML was designed to be “SGML for the Web”.
It was meant for the same
sorts of narrative documents SGML and HTML had been used for previously:
articles, books, short stories, poems, technical manuals, web pages, and
so forth. Much to its inventors’ surprise, it achieved its first great
successes not in the publishing and writing arenas it was intended for,
but rather in the much more prosaic world of data formats. XML was
enthusiastically adopted by programmers who needed a robust, extensible,
standard format for data. For the most part, this was not narrative data
like stories and articles, but record oriented data such as that found
in databases. Uses included object serialization, financial records,
vector graphics, remote procedure calls, and similar tasks. This chapter
explores some of the flaws in traditional formats for such data and
elucidates the features of XML that make it surprisingly well-suited for
such tasks.

Motivating XML

If you’re reading this book you’re a developer. (At
least I hope you are. Otherwise a lot of what I say isn’t going
to make any sense :-) ) Doubtless over the course of your career
you’ve written
numerous programs that read and write files. And every time you wrote a new
program you had to invent or learn a new file format.
File formats I’ve personally had to deal with over the years include
RTF, Word .doc files, tab delimited text, FITS,
PDF, PostScript, and many more. You’ve probably encountered
a few of these yourself. Doubtless, you’ve also seen
many other formats.

If you’re like me you’ve learned to dread encountering a new file
format. If it’s documented at all, the documentation is likely
incomplete or worse yet misleading. Important details like
byte order and line ending conventions are often left unspecified.
Different tools that all claim
to read and write the same format actually produce subtly different
variants that are often incompatible in practice.
When you think you’ve finally wrestled the last bug out of
your code, you discover a file
written by somebody else’s software that you can’t read;
and you realize you’ve made one too many assumptions
about the format, so you have to go back to the
drawing board.

Consequently, when designing new file formats,
developers have tended to gravitate toward the simplest formats
they can imagine, often tab delimited text or comma separated values.
Nonetheless, even these plain, undecorated formats
often present unexpected problems.
For example, should two tabs in a row be interpreted as the empty string,
null, or the same as one tab? In fact, all three variations are used in practice.
Java’s StringTokenizer class takes the last
interpretation, two consecutive tabs are the same as one tab,
even though this is the least common approach in actual data files,
a fact which has surprised many Java programmers and led to not a few
bugs in Java programs.[1]

A Thought Experiment

With all that in mind, let’s do a thought experiment.
Imagine you’ve been tasked with writing a server side program that
accepts orders over the Internet for an e-commerce site.
The web server must send each completed order
to the internal system, one order at a time.
You’re responsible for writing the code on the server that sends the order
to the internal system and for writing the code on the internal system
that receives and processes the order. The only connection between the two systems
is a TCP/IP network; that is, you don’t have some sort of higher level API
like JDBC that lets you move data between the two systems.
You need to invent a data format you can generate on one end and
parse on the other end that’s flexible
enough to contain all the information in a typical order.
This includes the customer name, the product ordered, its price,
the manufacturer’s stock keeping unit (SKU) number,
the address to ship to, the tax, and the shipping and handling charges.
One possibility is to
place each piece of information on a separate line
as shown in Example 1.1:

Would you rather write the code
to send and receive orders that are formatted as nice, simple
linefeed delimited files as shown in Example 1.1
or as complex, marked up XML documents such as Example 1.2?
Both documents contain the same information. Most uninitiated
developers prefer the first, simpler form. After all each piece of
information is presented on a line by itself with no extraneous markup
characters getting in the way. It’s my goal to convince you that
contrary to most developers’ first intuition the second form is
more robust, more extensible, and much easier to work with.

Robustness

Let’s consider robustness first. Suppose your program receives the
order in Example 1.3:

Example 1.3. A document indicating an
order for 12 Birdsong Clocks, SKU 244?

Look’s the same as Example 1.1 doesn’t it?
However, if you compare it very
carefully with Example 1.3 you may notice that the 12 and the 244 have
changed places. What used to be an order for 12 bird clocks may now be
an order for 244 whoopee cushions. Maybe somebody will notice the
problem before the order is shipped and maybe they won’t. Worse yet, the
shipping charge and the total price got flipped around. This entire order
now costs eight dollars and ninety-five cents.
Again, maybe someone will notice the problem before it’s too late and maybe
not. These sorts of problems aren’t theoretical. More than one
e-commerce site has lost both revenue and customer goodwill by
mispricing items.

In the XML version, this simply would not be an issue because each datum
is marked up with what it means. You can freely reorder the quantity and
the SKU or the shipping cost and the total price without any confusion
about which is which. Example 1.4 demonstrates.
What can be devastating mistakes in a traditional
system are harmless in XML.

Some readers will be objecting at this point that you would never let a
mistake like that through your system. After all you check every value
for sensibility. You look up the SKU in the company database to make
sure it matches the product name and price before completing an order.
You check every return value from a method call to see if it’s null and
you catch every exception. You write extensive tests to verify that
each method is doing what you think it’s doing. You use a source
code control system so you can always back out changes, and you never
check code in until it’s passed all the regression tests.
Every line of code is scrupulously documented.
In fact, you write more documentation than actual code.
And you’ve never, ever missed church on Sunday. In this case your name
is Donald Knuth. The rest of us need a little more help making
sure we don’t do something stupid.

Even if you are that conscientious, are you really willing to gamble
on everyone else who sends or receives data from you being equally
anal retentive? Wouldn’t it make more sense to use the most robust format possible
so that when the inevitable errors do creep in, they’ll do less damage?

Of course, XML has a lot to offer the anal developer as well. When
defining constraints such as “Every order must have a shipping address”,
“the currency must be one of the three letter codes USD, CAN, or GBP” or
“the total cost must be the sum of the unit price times the number of
items, the tax, and the shipping”, it’s easiest to use a declarative
language that specifies what the constraints are without elaborating the
actual code to check these constraints. When your data is XML, you can
use a declarative schema language to define and test such constraints.
Indeed, you have a choice of several schema languages. The simplest and
most broadly supported, the classic document type definition (DTD),
allows you to verify that all required elements are present in the
required order with any necessary attributes. The W3C XML schema
language goes further and lets you constrain the contents of particular
elements and attributes so that you can guarantee that the total price
is a decimal number greater than 1.00. Schematron, the most powerful
schema language of all, allows you to state multi-element constraints
such as “the actual price must be less than or equal to the suggested
retail price”. I’ll discuss all of these languages in more detail later
in this chapter and the rest of the book. For now what you need to know
is that you can list all the constraints on a document in a simple
fashion and check those constraints without writing a lot of extra code
to do so. You feed your documents through a validator before you act on
them. Validation becomes a separate, modular and more maintainable part
of the process. You can
even change constraints or add new ones without recompiling your code.

Extensibility

Robustness isn’t the only advantage of the XML approach. The XML
solution is also far more extensible. For example, suppose you suddenly
discover a need to add a discount percentage to some products. The
change to the XML is straightforward. Just add an extra element:

The change to the plain text file (or the equivalent binary
file) is much less obvious. You can certainly add an extra line of
data. However, then everything that follows it will be out of order. You
could put the new information at the end of the document,
but then it isn’t close to the item it
logically belongs with. And suppose not all orders have discounts. Will
there be blank lines for products that don’t have discounts? How will
your program recognize that it’s supposed to convert an empty string into a
zero discount rather than NaN or throwing an exception? This is
not an insurmountable problem, but the simple solution is becoming more
complex.

Now suppose someone wants to add a gift message field whose value can
contain line breaks. Now the data can contain the delimiter
character! You can probably escape the line breaks as \n or some such,
and then escape the backslash character as \\, but your nice simple
solution is becoming quite a bit more complex.
However, once again this is not a problem for XML as this solution
demonstrates:

Throughout this example, I’ve assumed that each order is for exactly one
product. That’s probably not true. Some customers will order multiple
products at a time. Thus each order will contain between one and an
indefinite number of products. Different products may even be going to different
addresses. Do you break each individual item into a separate order
document and repeat the customer information? If so how do you calculate
the total shipping and total cost? Or do you allow multiple products in
a single order? If so how do you tell where one product ends and the
next begins? Again, none of these
problems are unsolvable, but the simple solution proves more and
more complex as the needs grow. The XML approach, by contrast, scales
very well to expanded functionality in a very obvious way.
Example 1.5
is an XML document that accomplishes all of the above. The boundaries
between the individual parts are obvious.

Example 1.5. An XML document indicating an
order for multiple products shipped to multiple addresses

This example still isn’t really complete. Many pieces are missing
including the credit card information, billing address, and more. Real
world examples are larger and more complex than can comfortably fit in a
book. Adding these other parts would only stretch the flat format
further and make the advantages of XML still more obvious. The more
complex your data is, the more important it is to use a hierarchical
format like XML rather than a flat format like tab or line-delimited
text.

Ease of Use

Now here’s the real kicker: not only is the XML document far more robust.
Not only is it much more extensible in the face of both expected and
unexpected changes. Not only does it more easily adapt to more complex
structures. It is also easier for your programs to read! Writing a program
to accept orders written in XML will be many times easier than writing a
program to accept orders delivered in simple line delimited files. “How can
that be?” you may be asking. After all, the program reading the XML
document has to hunt for less than signs and quotation marks rather than
just picking each piece of data off of a line. It has to make sure not to
confuse any less than signs and quotation marks that may appear in the data
itself with those in the markup. It has to deal with data that may extend
across multiple lines. And in fact, there are many more possibilities
not evident in this simple example that a real program has to handle.

Fortunately none of this matters to you as a developer because you don’t
have to do any of it. Instead of writing the code to process XML
documents directly, you let an XML parser do the hard work for you. A
parser is a software library that knows how to read XML documents and
handle all the markup it finds. The parser takes responsibility for
checking documents for well-formedness and validity. Your own code
reads the XML document only through the parser’s API. At this level, you
can simply ask the parser to tell you what it saw in any particular
element. Or you can ask the parser to tell you everything it sees as
soon as it sees it. In either case, the parser just gives you the data
after resolving all the markup. For instance, if you want to ask the
parser what the total price was, it can tell you 290.79 and that this
price has the currency USD. You don’t have to concern yourself
with stripping off the markup around the information you want. Nor
do you necessarily have to take the information in the order it appears
in the input document. If you want the total price before the customer
name, you can have it. If you just want to look at the price and ignore
the rest of the order completely, you can do that too. You take the
information in the form that’s convenient to you without worrying
excessively about low level serialization details.

Note

One of the original ten goals for XML was that
“It shall be easy to write programs which process XML documents.”
Originally, this was interpreted as meaning that a
“Desperate Perl Hacker” could write an XML
parser in a weekend.
Later it became clear that XML was simply too complex, even in its
simplest form, for this goal to be met.
However, the understanding of this requirement changed to
mean that a typical programmer could use any of a number
of free tools and libraries to process XML.
Given this interpretation, the goal has most certainly been met.

The parser shields you from a lot of irrelevant details that you don’t
really care about. These include:

How text is encoded: in Unicode, ASCII, Latin-1, SJIS, or
something else

Whether carriage returns, line feeds, or both
separate lines

How reserved characters such as < are
escaped when used in the plain text parts of the document

Whether the byte order is big-endian or little-endian

None of these issues actually matter. None of them have any effect on
what the data means or what the format allows you to say. However, when
designing a data format, you must answer all these questions. As soon
as you’ve said, “The underlying format of the data is XML”, every one of
these questions is answered. Some are answered by simply choosing one
possible solution. (The less than sign is escaped as
&lt;.) Others are answered by allowing all
possibilities and letting the parser sort things out (line endings). In
all cases, the design problem is greatly simplified by picking XML as
the underlying format.

[1]
This interpretation makes sense once you realize that
java.util.StringTokenizer is designed for parsing Java source code,
not for reading tab delimited data files. Nonetheless many programmers
do use it for reading tab delimited data.