XML : Java Glossary

The primary function of XML (extensible Markup Language)
is to consume RAM (Random Access Memory)
and datacommunication bandwidth. Presumably it was promoted to its current frenzy by
companies who sell either RAM
or bandwidth. Others promoting it have patents they hope to spring on the public once
it is entrenched. XML
is the biggest con game going in computers. You probably guessed, I am known for my
rabid dislike of XML.

The Basics

XML
is a W3C (World Wide Web Consortium)
proposed recommendation. Like HTML (Hypertext Markup Language),
XML
is based on SGML (Standard Generalised Markup Language), an International Standard
( ISO (International Standards Organisation)
8879) for creating markup languages. However, while HTML
is a single SGML
document type, with a fixed set of element type names (aka tag
names), XML
is a simplified profile of SGML
: you can use it to define many different document types, each of which uses its own
element type names (instead of HTML
’s html, body,
h1, ol, etc.). For example,
in XML,
Fields that there can be only zero or one of are usually specified as attributes e.g. unit= box. Fields that there can be many of are
enclosed in tags called elements e.g. <item>…</item> e.g. Just like
HTML,
comments begin with <!-- and end with -->. You can abbreviate <mytag
myattrib=something></mytag> as <mytag myattrib=something />.

XML
was designed to make it easy to write a parser. I think this was an unfortunate
decision. Only a handful of people in the world will ever write an
XML
parser, but hundreds of thousands have to compose XML
. They
should have designed it to be easy and terse to write. For example, its mandatory
quotes around each field are there solely for the convenience of the parser writer.
The tag names in the </mytag;> are redundant and
should be optional. They are not needed at all in XML
designed solely for machine consumption. Even in human-read
XML, they
add nothing on the innermost nest on a single line.

Naming

Pretty well any character is legal in an element or attribute name. You can use
upper or lower case, accented letters, digits or punctuation. _ is good for separating words. You may not use a space. It is
considered poor style to use -, . and :. Names cannot start with a number
or punctuation or with the letters xml (in any case). Names are case-sensitive.

Encoding

UTF-8 is the default encoding, but unfortunately the encoding could be any ruddy
encoding ever invented. Using other encodings destroys XML
as an interchange format. Don’t do it!

Schemas

You describe your little XML
subgrammar by writing a DTD (Document Type Definition)
file. Optionally, you can include the DTD
inline inside your XML
file. There are other more elaborate schema
grammars including RELAX NG, Schematron, XSD and various
other schemas. I like XSD (XML Scheme Definition)
s the best.

Validation

Each schema has its corresponding technique for validating an
XML
file that the syntax is valid. If you use a DTD
, here:

Parsing

There are two popular parsing techniques,
SAX (Simple API for XML), which hands you each field as it parses and
W3CDOM (Document Object Model)
tree which creates a complete parse tree you can prune and repeatedly scan.

I personally detest XML, however, it has caught on like a cocaine wave. It
must have some redeeming features.

XML
Benefits

XML
is the latest fad. Almost every program is learning to import and export data in
XML
format, which makes it a lot easier to glue programs created by different people
together.

It unifies the grammar of thousands of little files so that you don’t
have to learn the syntax quirks of each one.

It is relatively easy to whip up a DTD
to describe an XML
grammar for some little data file. That DTD
is all you need to generate a parser.

The XML
files can be viewed or composed by humans using a text editor.

XML
is about as simple a grammar as you can get.

XML
can work with almost any 8-bit or 16-bit character set.

XML
is good at handling hierarchical data.

You can have Pick OS-like data, with arbitrarily long fields and arbitrarily
repeated fields.

XML
is platform independent. It has no big-little endian problems.

It is possible to parse XML
without writing a DTD. This process presumes the
XML
file is perfectly formed.

XML
search engines can take into account the tag context, e.g. Washington inside tag <state>,
<president>, <mountain>, <moviestar>. An
XML
search engine can show you want tags in found and let you choose the relevant
ones.

XML
settles on Unicode character encoding to allow transmitting data in any language,
though it does require clumsy entity encoding/decoding.

A program does not need to understand the entire structure of a file. It can
just pick out the tags of interest. This means new tags can be easily added without
disturbing existing software that uses the file.

XML
Drawbacks

XML
is incredibly fluffy and repetitive. It wastes bandwidth in transmission. You
must compress it. Happily, ZIP-style compression works very well
on XML.
Unfortunately, you have to fluff it back up to process it, wasting
RAM
with unprecedented abandon. In practice no one does compress it.

It takes up huge amounts of RAM
and disk space to store it.

The DOM
parse tree considers every space significant, even spaces between
tags, even spaces for indenting, even trailing spaces on a line, even double spaces
embedded in data.

There is no mechanism to describe the types of the data. To
XML,
everything is a string. There is no way to specify a field must be numeric, that in
needs two decimal places, that it must represent a date in some range, that it must
not have accented letters, that it be restricted to certain punctuation, or be one
of a certain set of legal values. There are scores of tack-ons trying to fix this
and other shortcomings turning the simple XML
into a tower of Babel.

You can’t use the XML
files directly, they need to be parsed first. Perhaps some day there will be
pre-parsed, compact, computer-friendly versions of XML
. I
have heard rumour such a beast called XMLC (extensible Markup Language Compiler) has been proposed.

It uses HTML
’s fluffy system of entities such as &nbsp;

There are a raft of recommendations surrounding XML
,
such as XPath, XPointer, XSL (extensible Stylesheet Language), CSS (Cascading Style Sheets)
, XLink and
so forth. In the pipeline are XHTML (extensible Hypertext Markup Language)
, Metadata
and Namespaces and a Schema system. XML
is fast becoming very complicated because it is not really standalone. You need
added extras to make it usable. Competing standards will have to fight it out.
The #1 reason XML
caught on was its raging-idiot simplicity. Now it has not even that advantage.

XML
advocates say Memory is cheap and bandwidth is cheap, so what
the hell, let’s squander it. However, this is not true with handhelds.
Memory consumes battery power, the main limit today of handheld capabilities.
Bandwidth consumes radio air time and battery time. We are running out of broadcast
frequencies. You can’t manufacture more of them once the channels are filled,
just use them more efficiently. Further, the delays caused by bloated
XML
packets consume precious people time and frustrate the heck out of users
completely needlessly.

In an Applet or a hand held device, memory for data and code is at a premium.
You normally carefully massage the data offline to be as predigested and as compact
as possible, e.g. Serialized Objects. As well as being fat,
XML
needs considerable processing before it can be used. This consumes
RAM
for both data and code and battery power to do the massaging.

There is no standard way to compress XML
. You
can use ZIP which is very CPU (Central Processing Unit)
and ram heavy. You can use WBXML (Wireless Binary extensible Markup Language)
. The problem is on
receipt, it is fluffed back up to regular XML
then parsed, so it is has even more parsing overhead that regular
XML.
There are other compressed formats ASN-1
and WML (Website Meta Language). In practice most XML
gets sent in its outrageously fluffy default form. People think
XML
files are always tiny little 1K configuration files and
so why worry. The point is once a format gets established, it gets used for all
sorts of things the originators would never have dreamed of, like 3 gig image
files. ASN.1 (Abstract Syntax Notation 1)
schemas now can be used to validate XML
files. XML
files with XML
schemas can be automatically converted to ASN.1
.
ASN.1
files can be decoded 100 times faster than
XML. I
think it is time to start thinking of using ASN.1
instead of XML
for large files, or for when they must me transported over the wire.

There is sort of mania to convert everything to XML
,
even things for which it is only marginally well-suited.

Greyed out stores probably do not have the item in stock. Try looking for it with a bookfinder.

XML
is an example of conspicuous waste, waste for waste’s sake. I find it
morally repugnant. I reminds me of Roman Emperor Caligula who took a bite of a
peach, tossed it away, then grabbed a fresh one. The authors went out of their
way to create a bloated, ugly syntax.

Using XML
to transmit data is the analog of insisting that all code be passed around as triple
spaced Java source files, with added dummy comments, rather than as binary byte code.
There is no guarantee a source file is even syntactically correct. It is impossible
to create a syntactically incorrect byte code file. Byte code files can be processed
without time-consuming parsing. In byte code, repeating strings are naturally
specified only once. XML, as it stands, suffers from all those analogous
drawbacks and more.

What Should Replace XML
?

The characteristics include:

It needs to be a binary format for compactness. Files have to both be
transmitted and stored. Size does matter. Smaller is better. People think in terms
of one page XML
files, but they potentially could be gigabytes long. If XML
becomes an established interchange format we will pay for the slop in
XML
trillions of times over. It is not good enough to say XML
files will always be stored in compressed form. In my experience in practice
XML
files are never compressed. Files should be both compact and quick to process.
XML
as it stands is neither.

It needs to be a binary format to ensure correctness. Human readable formats
tempt people to manually compose documents that are almost syntactically correct,
e.g. HTML. This is too sloppy for an interchange format.
Consider how much better chance you have of getting a working program first time if
someone sends you java byte code rather than Java source that may not even
compile.

It needs to be computer-friendly so that a program can rapidly find the data it
wants without having to parse for delimiters of various flavours. If people want to
examine the file detail for debugging, let them use a binary reader/editor. You
could use counted strings rather than delimited strings and use integers to encode
the field types so they can be used directly as table indexes. I would not go quite
so far is to ask for a serialized tree of nodes, but push for a representation that
can rapidly be turned into one.

For giant files, the representation should not have substantially more overhead
than the raw binary. There need to be ways of efficiently expressing repeating
patterns. For example, there is no need for delimiters for fixed length data. There
is no need for individual field identifiers for standard groupings of fields. You
want to push as much as possible of the file format description into the descriptor
file, out of the data file. The descriptor file need be transmitted only once. The
data file will typically be transmitted again and again. There is no need to make
the format simple, just compact and fast to process. All you need is a simple
programmer’s interface to it. Only a handful of programmers
ever need concern themselves with its inner structure.

XML
currently only allows for hierarchical trees of data. There are one or two other
types of data out there in the world, (e.g. tables, relations, references, graphs)
A universal interchange format should be a little more flexible. If it is worth
doing, it is worth doing right. Obviously the format can’t be expected to
handle every conceivable data structure and obsolete every specialised interchange
format ever devised. However, XML
is talking big about becoming universal and should deliver. It can’t even
handle ordinary business data which is typically relational not strictly
hierarchical.

The other thing it needs is in the DTD
some information about the allowed data types, there need to be the usual bounded
ints, IEEE (Institute of Electrical & Electronics Engineers)
floats, IEEE
doubles, 8-bit encoded strings in some reasonably small
number of character sets, with maximum and minimum lengths, as well as a variety of
business types, such as zip, zip+4, state, country, Canusan phone, international
phone, date, time, credit card number, latitude, longitude, etc. When someone is
handing you data you need to know how clean it is. You need to know ahead of time
the minimum and maximum enforced limits on various field sizes.

Ideally the new binary format, or a variant of it would also handle the
function HTML
does now. This would, in a stroke, give four benefits:

Much more compact transmissions, which means much faster transmissions and
lighter loaded servers.

No more syntax errors. In the process of converting to binary format all
syntax would either have to be manually or automatically corrected. This means
the browser no longer has to deal with both the official standard and also all
the common variant errors that people type. This means pages would always
render properly. As it is, pages render properly only in the browser used by
the author which forgives his particular errors. The binary protocol
effectively blocks human HTML
coding errors from getting out on the net.

Faster rendering since the data would arrive already preparsed. The browser
would know for example how big tables are before it had finished reading the
entire file and so could start rendering the top part of the document
accurately immediately.

Consider the total dollars invested in equipment in the world to transmit
HTML,
including servers, satellite links, fibre optic links, cable
connections… In a stroke, you would double the capacity of that
equipment to deliver HTML, simply by switching to a binary delivery
format.

One possible candidate for the XML
replacement job is the Java serialized object format. It can handle just about any
data structure imaginable. It is platform independent. It has a simple
DTD
— Java source code for the corresponding class. Some claim it is Java-only. Not
so. It is no more difficult for C++ to parse than
any other similar newly concocted protocol. It is not tied to any hardware or
OS (Operating System). It is just that Java has a
head start implementing it. Java can implement it with no extra overhead.

There have been some efforts made to patch up the shortcomings of
XML, in fact
there are dozens of them. XML
is no longer simple any more. It is raggedy patchwork quilt. People were sucked in by
the initial simplicity, then discovered that it was not really all that useful in its
simple form. Schema was added to allow specifying types (but still only permitting
strings). Yes we need a standard interchange format, but XML
was only a back of the envelope stab at it. XML
was destined to fail since it totally ignored so many factors in coming up with a
good design.

One such effort is VTD (Virtual Token Descriptor). A VTD
record is a 64-bit integer that encodes the starting
offset, length, type and nesting depth of a token in an XML
document. Because VTD
records don’t contain data fields, they work alongside of the original
XML
document, which is maintained intact in memory by the processing model.

Due to the stupidity, duplicity and/or greed of those promoting
XML, we will
likely be stuck with some committee-patched variant of it forever — something
that will make even HTML
look clean. We need a common data interchange format, but not so inept.

DTD

You need to compose a DTD
file that describes the format of the XML
file. The <!ELEMENT statement is used to list the various
tags you will use and which tags may be used inside which tags and how often and in
which order. The <!ATTLIST statement is used to list the
various attributes (mandatory and optional) of each tag. The <!ENTITY statement lets you make up you own abbreviations.

Here is a simple example:

DTD
:

<!ELEMENT square EMPTY>
<!ATTLIST square width CDATA0>

The CDATA
means the value of the field is a string.

XML
:

<square width=100></square>

Schema

A schema is a document that describes what constitutes a
legitimate XML
document. It might be very generic, describing all XML
documents, or some particular class of XML
documents, say ones describing an invoice for the XYZ company. The original
XML
schema was called DTD, borrowed from the HTML
people. It was clumsy and did not allow very tight specification. It basically just
let you specify the names of the tags and attributes. Since then there have been
several other flavours of schema: RELAX NG,
Schematron and a new one from
W3C
called XML
schema. DTD
s look nothing like XML
itself. XML
Schema is itself a flavour of XML
.
XML
Schema is a major advance over DTD
. It is
described in three documents: Primer, Structures and
Data
Types. It can define datatypes, ranges, enumerator, dates, complex datatypes to
much more rigidly specify what constitutes a valid XML
file. In English, entity means a thing
with a distinct independent existence. It is as meaningful as thingamajig. Had it been my choice, I would have called them
stand-ins, locums or deputies.

Handling Awkward Characters, XML
Entities

XML
has a similar problem to HTML
with reserved characters. What if < incidentally
appears in your data? It would be look like the beginning of some </end> tag. There is only one truly awkward character, namely
< and you deal with it the same way you do in
HTML, by
encoding it as an entity reference, namely &lt;. (They are not called entities
in XML
since that term is already taken to mean a group of data.)

HTML
has scores of entities whereas XML
has only five:

< ( &lt; ),
& ( &amp; ),
> ( &gt; ),
" ( &quot; ),
' ( &apos; ).

All of the entity references are optional except for &lt; and &amp;

But what about awkward non-ASCII characters such as é and Ω and ⇔? There are six ways around the restriction that
XML
does not support the full set of HTML
character entity references.

If you use UTF-8 encoding, you can use any
Unicode characters plain without entification.

If you use an 8-bit encoding such as ISO-8859-1, you can stick to just 256 characters defined in that
encoding.

You could use decimal NCE (Numeric Character Entities)
e.g. &#8364; for the euro sign €. Values of numeric character references are interpreted as
Unicode characters — no matter what encoding you use for your document. To be
perverse, you could use decimal numeric entity references or the basic entity
references i.e.
< ( &#60; ),
& ( &#38; ),
> ( &#62; ),
" ( &#34;
), ' ( &#39;
).

You could write a DTD
to create the additional alphabetic character entities references you need, e.g.
&euro;

If you take a depraved pleasure in deformity, you could use the CDATA sandwich. Place pretty well whatever data you
want, including raw (un-entified) <, > and &, within in a bizarre
sandwich of characters namely: <![
CDATA
[ … ]]>

e.g. <caption><![ CDATA
[Rah! <><><> Rah! & all
that.]]></caption>

Handling awkward characters is a concern if:

You compose XMLby hand with a text editor.

You are developing code and read XML
files directly.

You write code to generate XML
directly without using any sort of XML
package.

Otherwise, the XML
package will transparently handle awkward characters for you both on writing and
reading, so you can forget about them.

UTF-8 files using the basic five character-entity
encodings, or ISO-8859-1, with the basic five character
entities (possibly excluding &apos;) plus decimal
NCE
s, will create the files easiest to read and compose manually,
XML
’s saving grace.

Nearly all XML
documents now use UTF-8 encoding, so the usual way to
handle awkward characters is to code them with a UTF-8-aware text editor as ordinary
characters. That leaves you with only < > " and & to worry about.

Quoting

You must enclose parameters in either " or '. If the attribute value itself
contains "s, you must enclose the parameter value in
'. If the attribute value itself contains 's, you must enclose it in ". What do
you do if a string contains both " and '? You must use the entity &quote;
for embedded " and surround the string in "s, e. g.:
<album title="Sergeant
Pepper’s Lonely Hearts Club Band"><album title='The Wall'><album title="Peter’s
&quot;Weird Songs&quot;">

Writing

There are a number of ways of writing XML
.

For simple files just use println.

Use the DOM
method to build a tree in memory then transform it into a text stream with
javax.xml.parsers.DocumentBuilder
javax.xml.parsers.DocumentBuilderFactory, javax.xml.transform.Transformer,
javax.xml.transform.TransformerFactory,
javax.xml.transform.dom.DOMSource.

Use XML serialization You
won’t have to write much code, but you won’t have any control over
precisely what the stream looks like.

use SAX
to build a tree in memory then transform it into a text stream with org.apache.xml.serialize, org.xml.sax

XML
Serialization

There is another form of serialization that produces
XML
instead of binary ObjectOutputStreams. It uses the
java.beans.XMLEncoder class.
It does not use the Serializable interface, but writes
ordinary Objects that have JavaBean-style getter and
setter methods and a no-arg constructor. It does not persist fields, but rather
properties (in the Delphi sense, not System. setProperty), implemented with get/set. Basically it looks for all
the get XXX methods and calls them and emits a
stream of tags named after the properties. To reconstitute, XMLDecoder instantiates an Object of the
class and calls the corresponding set XXX
methods from the values in the XML
stream. The source and target classes need not have matching code the way they do
with true serialization. Most trouble using this features comes from thinking it
behaves like ordinary serialization. They have almost nothing in common.

Tools

There are all kinds of tools for reading and writing
XML. I am
familiar with only a few of them. Please help me fill out this table.

XML
Tools

XML
Tool Comparison

Tool

Advantages

Disadvantages

Manual

A hand-written parser will run quickly

Writing XML
by hand is conceptually simpler and faster than doing it with a tool.

Writing XML
by hand gives you complete control over layout, headers, encoding etc.

Not feasible for all but the simplest files.

Hard to maintain.

DOM

You can navigate the tree in any way you please in any order.

Will not work for large files since the whole tree must reside in
RAM.

Slow parsing.

SAX/StAX

Fast parsing.

You can represent the data with a different structure from the
XML
structure of the file.