Friday, November 18, 2011

Who needs XML?

With the latest beta release, Candle introduces several new
features to its markup
format, including a new
object notation and a new clean namespace syntax. With this
release, I believe Candle Markup to be one of the best general data-exchange
format.

Before I talk specifically about Candle's Markup format, let's look at
the existing general data-exchange formats. The most well known formats
are XML and JSON. We've seen many hot debates on which one is better.
I'm not going to restart the holy war here. I'll try to
get you out of the tit-for-tat comparison of JSON against XML,
and let your look the problem from a more fundamental perspective. You
can ask yourself a few questions: why
do we need
a general data-exchange format? what purposes should it
server?
what characteristics should it have?
Once you've got your answers, you
can read on to see if yours reconcile with mine.
I think most people would agree that the need for a general
data-exchange format arises because of the Internet.
It connects people from
different culture, and different computers running different OSs, and
different applications
written in different programming languages. Thus they need
a general data-exchange format to
facilitate the communication. I don't need to emphasize too
much on the importance of such a format.

A general format simplifies application implementation, but more
importantly, it helps application
user so
that they don't have to remember many different abstract syntaxes.
Remember how UNIX newbies sighed about having to learn
the all different command line syntaxes? Today, most
applications and
programming languages have adopt XML as the format of their
configuration files. But why hasn't XML rule out the entire world?
That's because there's a pitfall here. A general purpose
format suffers from the problem of "good for everything,
great at nothing".
A carefully designed domain-specific format can be more
convenient and user-friendly in a specific domain. For example, RDF
data expressed in
Turtle or N3 notation is much more terse and readable than the
corresponding XML format. So while there are still valid
cases for domain specific data formats, we want a good general data
format to eliminate as many of them as possible.

What
characteristics should it have then?
Well, it
must enable data-exchange
of course. And I think there are 3
aspects we need to look at: 1) the
syntax,
2) the
data
types; 3) the data structure.
XML addresses 1st
and 3rd aspects but does not touch on the 2nd. Syntax-wise, it
is
straight forward (if we ignore DTD declaration). Structural-wise, XML's
hierarchical
structure has
advantages over more flat text formats like Windows ini files, Java
property
files and HTML name-value paired form data. And
it built-in support of mixed content makes it good for
complex textual documents.

However, XML's
silence on data type
is its major weakness. The
inventors of XML might want to make
it very extensible, and thus purposely left it to the applications
to define their own data types and the detailed syntaxtic
encoding of the literal values. But when XML
doesn't touch on the
data types and data model
behind the format, it becomes the burden of the applications. This
makes XML less ideal for structured
object data exchange. And we saw the rise of JSON to claim back
this area.

Yet there are XML gurus who still persist that XML should just be a
common
syntaxtic format, and resist the idea of common data types and
data model. It might
look more extensible and versatile that way, but that extensibility is
just
illusion if people no longer uses XML. Just look at the Common Wealth.
If the alliance only brings in a common
Queen and a common language (English), it is too cheap for people to
break away from it. The alliance has to go into deeper integration, for
example a common currency like the Europe Union; or even more, a common
constitution like the United States. Of course, every step up the
staircase of standardization is harder to achieve, but
also brings more benefits. The
entire Internet deserves something better.

If XML insists on being just a common
syntaxtic format, it's fate is going
to be like Common Wealth, with
more and more applications breaking away from it (like JSON and
HTML5). We saw more and more people inventing domain specific
formats like RDF Turtle, JavaFX literal object and GroovyMarkup, DOT language
in Graphviz, and Lua being used as configuration file.

Is the XML ship sinking? I don't think so. But water is leaking in.

I'm glad that people are starting to rethink the design of XML. The
discussion
started by James Clark on MicroXML
is good food for thoughts.
But I
think a simple XML subset, like MicroXML is not going to save
XML. People might just as well use JSON. While a MicroXML format may
have its niche market, the more important direction for XML
to evolve is to do more rather than less.

And the biggest
area to patch up is on the data types and data model. All the
advanced processing on XML (schema and
ontology, path selection, query and update) have to build on top of
standardized data
types and data model. However, with all the mess created by various
data model
related XML standards,
including XML Infoset, XML Schema, XQuery data models and RDF data
models, I double W3C's ability to clean it up.

If the XML designers don't keep on asking themselves questions,
like why should we use a
general
format, like XML?what
good does it buys us? And
come up with solid answers. Then the application designers
will.

In the next blog article, I'll talk about the unique features of Candle
Markup and how it compares to existing formats like XML, JSON, YAML and
JavaFX literal object.