Bad Pro-XML Arguments

Lots of people complain loudly about XML (because it sucks). One of the creators of XML, Tim Bray, apparently got sick of hearing it and wrote an article defending XML: "Why XML Doesn’t Suck". I felt that he didn’t address the real reasons people hate XML and so I don’t think he persuaded anyone worth persuading. (But it worked great as a mindlesssycophantdetectiondevice).

He listed arguments in support of XML and then “debunked” the common anti-XML arguments. Most of what he says could apply to any standard format. They do not properly focus on why he thinks XML is a good standard format.

Pro-XML Arguments

Go read the article first (it’s pretty short and to-the-point). If you don’t, this section won’t make much sense.

XML has internationalization pretty well nailed.

I agree, but internationalization can easily be added to almost any format. It’s an orthogonal issue.

XML can pretty well represent anything.

The XML data model is crap. Sure, you will be able to find some way of encoding your data, but that’s true for a lot of other crap data models (the Windows registry, Lisp S-Expressions).

Even though there is some encoding for your data, I don’t think XML lets you encode it naturally. When you put your data into XML, you’ll probably end up having to subtlely mutate it to compensate for XML’s shortcomings.

XML forces syntax-level interoperability.

Syntax-level interoperability is not worth much. You need interoperability at the level of the data model. As stated before, the XML data model is crap, making interoperability painful.

XML supports constructive finger-pointing.

His example is that you can easily tell when someone gives you XML that isn’t well-formed. Big deal. There’s nothing special about XML that makes it easier to create a strict validator for.

XML confers longevity.

Any standard format does. As stated before, XML usually wont let you use the most natural structure for your data, so you could be losing a little bit in the conversion.

"Debunked" Anti-XML Arguments

XML is verbose.

Tim Bray says that bandwidth isn’t really a problem nowadays and that XML compresses really well. Whether or not bandwidth is a problem, XML is redundant, and that is often a bad thing. (Long variable names are OK, unnecessarily repeated code is not.)

He also says not to use XML where bandwidth is critial. So are we supposed another format for certain applications? The more reasonable solution would be to find a more efficient encoding of XML for those purposes. Don’t know why he didn’t suggest that. Very efficient encodings can be created if the data is thoroughly typed.

XML does with Lisp S-Expressions and CSV already could.

Here, he basically says that the Lisp S-Expression format is just as good as XML. I think both are bad, but there’s an important difference… S-Expressions are not expressive enough because of their very minimalist format. XML’s format is much more complicated but no more expressive than S-Expressions.

XML has both elements and attributes, why?.

The real reason XML has attributes is that HTML has attributes. They’re useful in HTML for “out-of-band” data because HTML was originally just a way of “marking-up” a plain text document.

His “show-stopper” example of the usefulness of this feature confuses syntax with data model. Attributes are just syntatic sugar.

A parser could accept both attributes and sub-elements but then treat them uniformly. But, of course, XML attributes can only be simple strings, so they can’t represent nested structures like sub-elements can. Because of this deficiency, attributes are, in general, completely useless.

Mixed content sucks.

I, personally, don’t think this is a good anti-XML argument. Mixed content is a good syntatic optimization for document-style data. Still, XML’s data model itself is fundamentally tied to mixed content, making “normal” data harder to deal with.

It’s also important to realize is that “mixed content” isn’t magic. It’s simply the sum of two syntatic optimizations: implicit text nodes and implicit content containers. Take the following fragment, for example:

That’s all it is. If you’ve done any XML programming, you might recognize the de-sugared structure.

XML is both a tree and a sequence.

Tim says: “Well, I have news for you, data is often both a tree and a sequence. This may not fit neatly into the programming paradigm you’re currently practising, but it’s the way life is.”

He doesn’t provide any examples of data that doesn’t fit into the standard types that most programming languages use. Now, Tim Bray is well-respected (and deservedly so) but he still has a ways to go before he can start using Proof By Assertion!

To be fair, though, I think he just didn’t understand the criticism. It’s obvious that a data structure can be composed of both trees (records) and sequences but existing programming languages handle this just fine. XML’s problem is that the same entity is overloaded to represent sequences and records, which is inelegant.

There are ugly complex standards built on XML.

Tim Bray says that he doesn’t like many of the standards built on top of XML but that the suckiness of those standards shouldn’t be blamed on XML. He says, in particular, that he doesn’t like XML Schema. (I think DTDs suck too; deep down, the two are basically the same thing).

He also says that people are doing all sorts of cool things with plain XML. So what? Any standard format would work just fine. There’s nothing about the XML format that “enables” certain applications.

First of all, the reason people created additional standards was probably because plain XML is insufficient for many tasks. However, those standards were doomed from the start because they were built upon a shaky foundation.

Makes Programming Harder

Tim acknowledges that the XML data model is difficult to deal with in current programming languages (he wrote an entire article about it). Instead of leaving it at that, he tries to make excuses:

“But the impedence mismatch, I suggest, is just a fact of life, and the benefits we get (i18n, interop, and so on) make it worthwhile.”

That’s a false dilemma along with a little more Proof By Assertion (actually, it looks like Proof By Suggestion, which is even weaker). A good data model doesn’t preclude internationalization, interop or any number of “so on”s. There’s nothing about the XML format or data model that aids interoperability. In fact, a better data model would make interoperability less painful.

Once again, he’s arguing that XML does do some things well (all of them peripheral) without defending against attacks on XML’s core problems.

Other Arguments

There are some other incorrect arguments that aren’t in Tim Bray’s article.

XML Is Easy For Programs To Parse

The standard rebuttal to this is simply that the statement is false. The XML standard is huge and complex and it’s not easy to write a fully conformant parser.

But who cares? How many people need to write parsers? You essentially need only one implementation. It might be nice to have a couple more to foster healthy competition, but the important thing is that you don’t really gain much by making it easier to parse. Java source code isn’t easy to parse (relative to XML) but it doesn’t matter because everyone just uses an existing parser (any Java compiler).

The goal of making something trivial for a program to parse often involves making the format harder for humans to deal with. Since people read and write XML more often than they develop XML parsers, why optimize for the uncommon case?

Bean-Counting

Some people in the pro-XML crowd seem to excrete terms like “training cost” or “industry standard” on a regular basis. They’ll try to justify using XML just because other people use it and that it would be more work to use a different format.

I’m not going to touch those arguments. I do not agree with them, but I don’t think these arguments can be settled qualitatively. You need solid real-world statistics (and that’s way too much work!). Even if it does make “business sense” to go with XML, it would only be because we need some standard. The XML format itself would continue to suck.

XML is Self-Describing

“Self-describing” implies much more than XML can deliver. All XML does is force you to annotate each node with its name. What are the benefits? Well, if you have binary or CSV data and you lose the description of how the data is organized, you’re in trouble. With XML, you might be able to recover some of that information (manually) from your data files by looking at the tag names and guessing. This is even less useful than a simple text file that describes your data format in prose.

How useful is that ability? I don’t know…how often do you lose your type definition? Keep that safe and everything is fine. Relational databases force you to keep the schema and the data together which, in a way, makes database data even more self-describing. Database schemas also contain more comprehensive type information and if you throw triggers into the mix, XML has no chance of being as self-describing as a relational database.

But assuming you’re competent enough to take care of your type definitions, there’s no benefit to being self-describing. It doesn’t make XML easier to program with because an XML document can’t really tell a program what kind of data it represents. The tag and attribute names look like random text to a computer. An XML file doesn’t even have all the information that a DTD would have. In the end, “self-describing” just means “partially redundant”.

Then why is it so popular?

I don’t know…momentum, maybe?

The reason it’s gotten this far is that it doesn’t look so bad at first. Before I had used it, I thought it was a neat idea. After using it for a while it’s easy to see XML’s deficiencies, but you’ve already made an investment and might not feel like throwing it away.

The other problem is that “everybody” else seems to love it. Unfortunately, the perception that “everyone” is using it was probably a result of hype from technology columnists (who, with amazing consistency and precision, get most excited about the stupidest things). If all you do is write a 20-line example of how to read in an XML address book and print it out again, you’re not going to run into any real-world problems.