So, it’s always a good idea when you design a language or data format to provide a way for instance documents to include something like a version attribute, probably near the beginning of the document, to indicate what version of the language is being used.

Or is it always a good idea?

What does a version identifier convey?

In fact, do we even agree on what it means to put something like a language version marker on a document? Let’s imagine a simple XML language designed for setting down recipes. In the first version of the language, the markup looks like this:

The allowed markup in version 1.0 of the recipe is just what’s shown above: an outer <recipe> containing <ingredients> and <steps>, etc.
Eventually it’s decided that it would be useful to provide optional pictures for ingredients or steps. So in version 2 of the language we can do things like:

What’s the best value to put in the version attribute? I know that version 2.0 is the latest version of the recipe language. In fact, that’s the only version of the specification I have next to me, so maybe I should use that?
There’s a problem, though. That version="2.0" marker might not work with software that’s written to version 1.0, and in fact, my document would otherwise be a fine 1.0 recipe document.

So, maybe I should label it 1.0? Unfortunately, that’s a bit hard for me. I don’t want to have to go through the specifications for every version of the recipe language that’s ever existed just to find the oldest that works. I really don’t want to do that if the language has been revised a lot! Also, these sample recipes are small, but if I were using software to write very long documents, then that software would either have to keep track of the latest features used, or else search the entire document before writing it to a file, in order to get that version identifier at the front.

Indeed, just these complexities have proven troublesome for the deployment of
languages like XML 1.1. XML 1.1 is similar to XML 1.0, but it enables the use of some new Unicode characters (just as recipe language V2 allows for use of new image tags.)
The XML 1.1 Recommendation suggests that:

XML Programs which generate XML SHOULD generate XML 1.0, unless one of the specific features of XML 1.1 is required.

In fact, it has often proven difficult to write software that generates documents labeled as XML 1.1 only when necessary: it’s much easier for XML 1.1-compatible software to label all output as <xml version="1.1">, resulting in documents that are unusable with widely deployed XML 1.0 software.
Perhaps for reasons like this, adoption of XML 1.1 has been slow.

Returning to the recipe example, maybe the version attribute should take a list of versions, and I should put in both 1.0 and 2.0? That could be helpful to consuming software, but it still means that I (or my software) must be familiar with all the previous versions of the specification.

So, we need to ask, is the version identifier used to convey:

The earliest version of the language with which the document is compatible (1.0 in the recipe example)?

The version of the specification I used as a guide when writing the document (2.0)?

A list of versions with which the document is compatible?

Something else?

The best answer is probably different depending on the language, how often it’s revised, whether revisions tend to maintain backwards compatibility, etc.

Is having some sort of version identifier always a good idea?

That Good Practice Note quoted above says “provide a version indicator”, but we’ve just shown that we’re not always quite clear on what that would do anyway. Is it still good advice to suggest that surely each language should provide for something in the instance? If so, should its use be required or optional?

As shown above, it’s common for the same instance document to be legal in many versions of a language.
As long as such documents are likely to have the same or sufficiently compatible meanings per the different versions, then it may be better to omit any indication of version in the instance, and leave it to the receiving software to decide whether the document can be processed. After all, with the second recipe above, the receiver will soon enough discover that it can or can’t process picture attributes, and if not, it either will or won’t know that they can be safely ignored. Version attributes can be helpful in giving early warning of incompatibilities, or as a crosscheck for catching errors, but they’re usually not essential to correct operation.

One important exception is in the case where the language is likely to change in incompatible ways. If the same document means different things in different versions of a language, then it’s very important to indicate which version the author had in mind when creating the document. Putting that version indicator into the document itself is one good way to do it. So maybe the right advice is:

If a language or data format will change in incompatible ways, then indicate the language version used for each instance.

Are namespaces a good way to identify language versions?

If version identifiers aren’t always a good bet, what about namespaces? Many modern languages allow the creation of globally unique names, identifiers, tags, etc. In XML this is done through use of Namespaces. In RDF, it’s done by using URIs as identifiers, etc..

Sometimes it’s appropriate to use new identifiers for each version of a language, and mechanisms like namespaces can make that easier:

In this example, the element with expanded name {http://example.org/recipeLanguage2, step} allows a picture attribute, but {http://example.org/recipeLanguage1, step} does not.

A full discussions of the pros and cons of using namespaces this way is beyond the scope of this note. One important advantage of using namespaces is that they can be easily applied not just to the root element for the language as a whole, but to mixtures of compound document markup, in which each sublanguage evolves with its own namespaces. Also, because namespace names are URIs, you can use the Web itself to get information about them.

Namespaces do have drawbacks. Imagine if there were 50 different namespaces for a language just because 50 separate bugs had been fixed in different errata. Would you republish all the markup in 50 namespaces? Would each document have lots of namespaces, with each element named with the last namespace in which it had been revised? Namespaces can be very useful for designating language versions, but there’s no one idiom that’s right for all languages. We note that most widely deployed tag-based languages for the Web (HTML, XML Schema, XSLT) have chosen either to use the same namespace(s) across multiple versions, or in the case of some flavors of HTML, not to use namespaces at all.

Conclusions

So, the TAG is having second thoughts about the suggestion that all data formats SHOULD provide for version identification. Sometimes it’s a good thing to do, but sometimes not.
Perhaps the right advice will be what’s proposed in the revised Good Practice Note above.
In any case, the TAG has been working for several years on a finding that will explore in detail many issues relating to versioning, and version attributes are likely to be among the topics covered. In the meantime, we thought we’d take the opportunity to signal that we’re not so sure that the advice in the Architecture Document is as good as we thought.

By the way, TAG member David Orchard has covered some of the same topics as well as many others relating to versioning in his personal blog. Links to a few of his postings follow my signature below. Dave is also the principle author of the TAG’s draft finding on versioning. Working drafts covering Terminology, Strategies, and Versioning of XML Languages are available for review. New drafts come out every few months, and we’re hoping to have something more or less complete, well, real soon now.

Noah Mendelsohn

Note: unless otherwise indicated, opinions expressed in the TAG’s blog are those of the individual authors, and do not necessarily represent consensus of the TAG as a whole.

4 Responses to Version Identifiers Reconsidered

I think you are right: version numbers should be bumped only in the case of incompatible changes — but then, of course, you bump up against the notion of “incompatible change.” Since the semantics of most formats are defined basically by the programs that use them (with a few honorable exceptions like regular expressions), is a change to such a program an incompatible change?
As for XML 1.1, my motive in pushing it through the W3C was justice. Since I failed in doing it the right way, I’m now engaged in doing it the wrong way. Google for “XML 1.0 Fifth Edition”.

Paul Walsh asks:
> “Would it not be possible to
> recompile the incompatible
> software that doesn’t understand
> “version 2″?”
I think the answer depends on the circumstance. Many of the languages used on the Web are consumed by programs that have been deployed by many millions of users. Even in situations where recompiling or even rewriting the application is easy, it often takes years before even a majority of users upgrade their individual copies. For example, from the time a new browser release is made available by its authors, how long does it typically take until the majority of users of the earlier versions have upgraded? Usually, on the Web, we have to anticipate that mismatches between deployed versions of software will be common, and will persist for long periods of time. That’s one of the reasons that versioning issues are particularly important in the context of the Web.
Noah Mendelsohn

Hi,
I wonder if we shouldn’t have actually two mechanisms. A version number whose sole purpose would be to process a particular tag according to an old standard (for example when reusing in a new spec something what was previously deprecated).
Perhaps HTML could also include a mechanism to allow to be more explicit about what to do if a particular tag is not supported by the user agent. Something like allowing the user agent to use a replacement (perhaps based on a transform for maximum flexibility). It would allow to explicitely define a markup fallback when a particular tag is not provided by a user agent. The replacement won’t be used any more as soon as the user agent starts supporting the feature…