DSRL: A new standard that can remove the English-fluency tax on XML

I see from a mail out from the Japanese Secretariat of SC34 today that IS 19757-8, Information technology -- Document Schema Definition Languages (DSDL) -- Part 8: Document Semantics Renaming Language (DSRL) has been approved as a standard by national votes and is ready for publication. Congratulations to editor Martin Bryan, who also has an open-source XSLT-based implementation available.

DSRL is the kind of standard that really needs to be part of the furniture, because it provides a really simple solution to a couple of really important problems.

The first problem is not one that English-speakers face. We live in an age of standard schemas. It is supposed to be a good thing when the whole world can get behind a common standard. However, one of the selling points of XML is how beneficial it is for people to be able have documents with comprehensible tags rather than obscure codes. A Spanish speaker can use element names with ñ and a Chinese can make XML documents completely in Chinese and they don't need to know any English or use any alphabetic characters.

Can you see the contradiction there? If you have an international standard using English, it won't be comprehensible to our Chinese, or at least it will be comprehensible as an obscure code again depending on the familiarity of the words.

This is not an abstract issue. The Chinese office standard UOF uses Chinese characters for its element names. Murata-san gave a great presentation about three years ago at the XML conference in Amsterdam where he showed the RELAX NG schemas he had made for Japanese local government use: all Japanese element names. And one of the concerns raised by Malaysia and others about OOXML was (IIRC) that its use of contractions made it impossible to use standard dictionaries to figure out what tag names mean.

And this is where DSRL comes in. It is a simple mapping language to let you say "When you find element X in a document, it is the same as element Y in our end system". A typical use would be merely to use this DSRL schema to transform the document from the old names to the new names (and back again: it is reversible). Or an implementation could, in the fashion of an XSD equivalence group, merely use it to accept documents with the new names.

Now the immediate comments that DSRL conjures up is But who is going to make DSRL schemas for every standard language? and this is completely impractical and Doesn't this make it into an N to N problem? (the ultimate put down, it seems! even worse that saying that a Taiwanese politician lacks warmth.)

Now of course the answer is that this should be available to everyone who needs it. And the people who need it are principally those with writing systems and languages far from the latin alphabet and the English language, or without an English-fluent programmer community. So I'd say China and even Japan, because of the distance. Probably not India, because of the wide fluency in English, though perhaps a reliance on English fluency does perpetuate a technological divide even there.

Certainly it would be useful for education and training, and not only when there are differences in writing systems: I am sure that a course in XBRL would be more clearer to Indonesians if the elements used were in Bahasa Indonesia, to use our neighbours as an example. Ditto with Xhosa, Portuguese, Greek, Arabic and so on!

Now the only way I see this being really practical is if it is built in. And the place to piggyback it is, it seems to me, the ZML-in-ZIP container formats. In particular, to build it into the Open Packaging Convention system so that when a part is first read, the OPC libraries will by default perform the transformation. I'd like to ODF's ZIP and manifest system more layered from ODF and with DSRL built-in too (actually, I'd like ODF to adopt OPC, or some converged packaging system, but that is another matter.)

The other use of DSRL is for small-scale maintenance. Very often in a schema's life there is a desire to fix the naming somehow. For example, where an attribute formerly could have values 'black', 'white' or 'grey' it is decided to adopt the US spelling 'gray'. Or it turns out that an attribute's name is too general and you want to make it clear that it only has a more specific use. It is just plain fiddly to have to go through each script that uses or generates documents to fix everything up. So DSRL allows the approach of just transforming between the old and the new as is more convenient.

I am always hopeful, but I don't predict we will see a large uptake in DSRL soon. It is the kind of thing where it takes, for example, e-government policy makers to realize that Native Language Markup is an important issue, and that consequently it needs to be built in. I think the ideographic script countries are most likely to take a lead here, but any of the countries who see a need to reduce their economic dependency on English fluency (while probably at the same time encouraging English-language fluency) could see it as a win. Nations who want to enourage regional standards, for example because they don't feel they have the skills or entrances to participate effectively in vendor consortia or international standards bodies might find it attractive.

It will be interesting to see whether DSRL eventually gets much buy-in. It is one of those standards which needs to be in place ready for use (it would have been great to add to OOXML, and perhaps still would be) if the call comes, but is not guaranteed of adoption.

I like it because it follows my strategy of neutralization which is that many problems exist merely because the infrastructure forces an unnecessary choice at a point: in this case the choice between a supra-national standard schema and Native Langauage Markup.

I believe several of your comments were raised (and, indeed some were raised on the 2007 draft review.) I will ask WG1 to look at whether comments fell the cracks. SNAFU perhaps.

ISO rules on keeping the last draft private (to prevent implementation confusion...a rather paper-dependent holdover isn't it?...the move to have standards development handled by consortia with JTC1 moving to an independent review mode of operation dilutes the necessity of agility to some extent I suppose) mean that the FDIS is protected on the committee site, but it should be published in a few weeks. (You could email Martin if you need a quick copy, which is kind of allowed before the publication, and the published version should be open and free as part of DSDL.)

I tend to be OK with mild underspecification for initial versions of blue-sky standards, because it gives implementers wiggle room and it lets experience dictate what information is really needed and which features have legs: completeness often is the excuse for cruft after all. A man's got to know his limitations! Tim Bray is very keen that standards should only follow implementation, but I think keeping blue-sky standards as minimal as necessary to get to the stage where the standard can be lead by experience not expertise is not unreasonable.)