XHTML vs HTML

From Habari Project

This page has been created to house points relating to the XHTML vs HTML debate. In the context of this page, we're talking about REAL XHTML, served with the correct mime-type and all the fun and problems that come with it.

Let's start it off with a lovely quote from a post to the Habari-dev mailing list by Mark Pilgram:

"If you want to produce application/xhtml+xml, you are free to do so. If you get it right, no one will notice. If you get it wrong, no one will forgive you."

Faux XHTML: "XHTML" sent as text/html. It may look like XHTML, it may even validate (unlikely, but possible), but it's parsed as HTML by browsers. It conveys none of the advantages for which XHTML was invented. Since it's usually not well-formed, it cannot easily be converted to "real" XHTML.

Real XHTML: XHTML sent as application/xhtml+xml to compatible browsers (and, perhaps, as text/html to Internet Explorer). Parsed as XML, it must be well-formed, or the user receives a fatal parsing error. Its XML nature allows, among other things, extensions by other XML dialects, like SVG and MathML.

To focus the discussion, let us look at some blogs which actually do use real XHTML.

a) Both have a use-case which justifies the extra effort of producing real XHTML.

b) Both assemble their pages by string concatenation.

c) Both go to extraordinary lengths to ensure well-formedness, in software.

If you are going to build your pages by concatenating strings, achieving well-formedness is difficult. And it's fragile. A misbehaving plugin, or a simple failure to properly sanitize user input and ... boom! ... visitors are staring at a Yellow Screen of Death.

If you are going to produce real XHTML in a tool usable by ordinary users, then you cannot do it by string concatenation. You need to assemble your content by serializing an XML DOM tree.

If you want to allow plugins, then your plugin API cannot allow plugin authors to stick arbitrary strings in the output. Rather, they should be allowed to add nodes to the DOM tree, or to manipulate existing ones.

And so forth... It requires a programming discipline entirely different from that employed in the Habari project heretofore.

I don't have any examples of blog software constructed this way.

I can, however, point to some Wiki software that assembles its content that way. It's written in Ruby-on-Rails, and so it still uses templates (i.e., string concatenation) for its output. But the content is assembled by serializing an XML tree using REXML.

Again, there's a compelling use-case -- the ability to freely write equations in LaTeX, rendered to MathML -- for going to the extra trouble software-wise. But, more relevant, it's much less fragile, and was much easier to implement, than the previous examples, which used string concatenation.

It seems to me that, if you are going to produce XHTML, that's what you need to do. If you are going to produce faux XHTML (served as text/html) by crappy old string-concatenation techniques, then you might as well produce HTML4. Browser are going to consume it as malformed HTML4 anyway.

Probably, there are too few potential users who are interested in MathML or SVG (or whatever) to make the extra effort required to produce real XHML worth it. (Personally, I think there's a chicken-or-egg aspect to that question: the only way to find out how many people would like having SVG on their blog is to provide a blogging tool which allows them to do it.)