What HTML5’s parsing algorithm means for DITA

“Tag soup” is neither satisfying nor nutritious. This pejorative name describes the form of unstructured HTML that browsers have had to consume since day 1 of the World Wide Web. As the name implies, it is a mixture of markup in various states of unclosed, misused, or badly nested elements. The expectation on the Web is that everything should just work, so browsers dutifully consume all content and try to give out a rendering without complaint. Until now, each browser has had its own approach to the problem, with correspondingly different results for some types of tag soup formulations. HTML5 effectively brings a certified chef back into the kitchen.

The “secret ingredient” in the HTML5 specification is not even any new markup. Rather, it is a chapter devoted to defining a standard way for all conforming browsers to approach the resolution of various “tag soup” parsing issues (as in English grammar, “parsing” is simply the picking apart of the parts of text–separating the markup from the readable content before handing the pieces off for rendering). In short, if you hand the same malformed document to each vendor’s conforming HTML5 browser, the parsed view of that document will be identical in each browser. This parsing algorithm is described in the HTML5 specification, http://dev.w3.org/html5/spec/parsing.html (“Parsing HTML documents).

This applies to DITA in several key ways:

Because weird tag soup structures are decomposed more consistently, the resulting view of the document is more like Well Formed XML. This algorithm effectively does much the same cleanup as the handy TIDY tool at Sourceforge (http://tidy.sourceforge.net/). In theory, you could now apply an XSLT stylesheet to this internal model and get more consistent migration from this formerly toxic mish-mash into a handy structure like DITA.

For in-browser editors, this inherent cleanup can simplify the usual HTML cleanup steps for content being written back out to the saved format. I have not tested exactly how the parsing algorithm affects content that is rendered and then saved out from contentEditable contexts, but it bears looking at by in-browser editors that could reasonably require HTML5-compliant browsers as their hosts. Most users who have upgrade options selected for their browsers are already getting this capability since it is now in all major browsers, so it is worth putting into a feature road map.

The algorithm is described so that it can be implemented in standalone tools as well as by the parsers integrated into browsers. I expect that we will soon see “HTML5 well-formed mode” as a parsing option in popular XML parsers. In short, rather than sending an HTML document through the initial TIDY step to coax it into a well-formed state, the parser can take that input directly and do much the same, bypassing an admittedly klunky part of a normal HTML conversion pipeline.

These outcomes may seem theoretical and might not tangibly benefit DITA writers or content owners right away. But I expect that the effect of this normalizing behavior on the world’s HTML content will create more efficiencies in the tools that normally handle HTML flows in and out of our DITA writing and production services. At the very least, wouldn’t it be nice if the task of converting “HTML tag soup” to “delectable DITA topic” were at the kitchen recipe level instead of the Iron Chef challenge it is today?