What HTML5’s outlining feature means for DITA

Before XML and SGML, IBM’s Generalized Markup Language (GML) represented document structure in a tag-like way, expressing the semantics of the content while separating that content from the underlying formatting controls. GML’s structural elements included the express heading level tags, :h1. through :h6., which informed on HTML’s early use of those same names for its visual heading elements. Both markup languages lacked formal containers for the scope of content under a heading, but GML had a run-time capability that writers could use to check the organization of their content: a generated Table of Contents (or ToC) view.

I’ve seen several Javascript utilities that attempted to represent HTML headings in the browser as presumed ToCs, but the tools cannot distinguish between the use of headings for figures vs headings for hierarchy. With no standard in the language for separating the presentational usage of headings from the structural usage, each implementation was different.

The designers of HTML5 recognized the need for consistent representation of structural hierarchy in HTML by introducing 1) some very DITA-like markup (<section> and <article> in particular), 2) a wrapper for intentional presentational usage (<hgroup>), 3) and, like its GML inspiration, a formal ability for the browser to infer a Table of Contents from the indicated structural markup. This outlining algorithm is described in the HTML5 specification, http://dev.w3.org/html5/spec/sections.html#headings-and-sections (Headings and sections).

This so-called “HTML5 outlining algorithm” encourages markup usage that an HTML5-conforming browser can represent in a formal ToC right on the web page:

Another common HTML pattern of use is for various heading combinations to represent formal titles with “kicker” subheadings (as in <h1>Blog title</h1><h4>Tagline</h4>). A kicker is not a formal part of the hierarcy, but it can be kept out of the structural outline by placing both elements within an <hgroup> container. The outlining algorithm now selects only the higher-level heading as the structural heading, and ignores the subheading as a misleading indicator of hierarchy:

<hgroup>
<h1>Blog title</h1> <--structural in this context
<h4>Tagline</h4> <--presentational in this context
</hgroup>

These capabilities are exciting to see, but I expect it will take some time for the larger world of HTML content creators to reliably embrace these outlining conventions. I foresee these implications:

Migration tools for HTML-to-DITA can now mimic the same outlining algorithm in order to infer a DOM that is structured in the way the browser sees it. From there, the migration tool may apply more appropriate final output restructuring, but at least there is no guessing about the basic input model–it will be a reliable starting place across browsers. All migration tools should try to provide “HTML5 outlining” as one of the available modes for disambiguating the apparent structure in original HTML content.

Schema-driven writing tools for HTML authors cannot directly enforce these structures at the point where they might be desired, as these containers are optional in the HTML5 content model. Tools will need to offer explicit assists to help writers create a properly nested structure for outlining. Of course, DITA authors get this structuring guidance by default since nesting is such a strong design pattern in DITA’s topic orientation. The ability of DITA editors, even the “baby DITA editors,” to guide in creating properly nested structure is a marketable benefit for using DITA tools for creating structured web content.

Can web content strategists expect brighter days ahead for well-structured content that can be reused somewhat more interoperably with their DITA counterparts? Could the HTML world recognize the value of creating such structures initially in DITA editors? Could DITA take over the world? I have no crystal ball clues for any of these, but I expect at least “an improvement in diplomatic relationships” between these two formerly estranged markup relatives.