Not My Type: Sizing Up W3C XML Schema Primitives

Continuing our occasional series of opinion pieces from members of the
XML community, Amy Lewis takes a hard look at W3C XML Schema datatypes.

Since the application of XML to data representation first gained public
visibility, there has been a movement to enhance its type system beyond that
originally provided by DTD. Several attempts were made (SOX, XML Data and XML
Data Reduced, Datatypes for DTDs, and others) before the W3C handed the
problem to the XML Schema Working Group.

What is the goal of data type definitions for XML? For one thing, it
establishes "strong typing" in XML in a fashion that corresponds with strong
typing in programming languages. Various commercial interests have been vocal
supporters of strong typing in XML because they see typed generic data
representation as their best hope for interoperability and increased
automation. With typing in schemas extended into the textual content of
simple types, and not just the structural content of complex types, businesses
can enforce contracts for data exchange. In other words, strong typing
enables electronic commerce.

To phrase it a little differently, the data types defined in DTDs were
considered inadequate to support the requirements of electronic commerce or,
more generally, of commercially reliable electronic information exchange.

The publication of W3C XML Schema (or WXS), in which one half of the
specification was devoted to the definition of a type library (part two),
seemed to resolve the problem. Certainly, with forty-four built-in data
types, nineteen of them primitive, it seemed at first glance to cover the
field. The increasing visibility of WXS and the efforts to layer additional
specifications on top of it -- XML Query, the PSVI, data types in XPath 2.0,
typing in web services -- have begun to raise serious questions about WXS part
two, even among proponents of strong types, including the author of this
article.

There are two fundamental problems with WXS datatyping. The first is its
design: it's not a type system -- there is no system -- and not even a type
collection. Rather, it's a collection of collections of types with no coherent
or consistent set of interrelations. The second problem is a single sentence
in the specification: "Primitive datatypes can only be added by revisions to
this specification". This sentence exists because of the design problem;
lacking a concept for what a primitive data type is, the only way to define
new types is by appeal to authority. The data type library is wholly
inextensible, internally inconsistent, bloated in and incomplete for most
application domains.

Not a type system

The data type library defined in WXS part two is not a type system. It's
not possible to examine the built-in types and determine the guiding
principles which dictated which types were to be defined and which were to be
defined as primitives.

Consider a contrasting example. The type system used by C and related
languages is clearly based on bit patterns and register sizes. The bit
pattern 10011001 fits into registers of a certain size, but has different
meaning based on its type: character, unsigned or signed byte. The type
assigned to a bit pattern determines certain behaviors. If the above pattern
is X, and Y is the bit pattern 00010001, then X > Y if both are unsigned
bytes, and X < Y if both are signed bytes. The same bit patterns may
represent character (or strings of characters), integers of various sizes, and
floating point numbers (again with various constraints), but the fundamental
limitation is the number of bits that can be stuffed into a register. By
interpreting the identical bits in different fashions, the languages achieve
different effects.

One mandate for WXS was that it should reproduce the limited type system of
the DTD plus the namespace extensions. It stands to reason that, given the
definition of QName and NCName in the namespaces
specification and Name in the original XML 1.0 specification, these
types would be found in some rational relationship to one another. In the WXS
definition, NCName is a subtype of Name, which is a subtype
of token, which is a subtype of normalizedString, which is a
subtype of string, which is a primitive type. However,
QName is also a primitive type, implying that it is not a string, not
a normalizedString, not a token, and not a name,
even though it is composed lexically of NCName + : + NCName.

WXS also represents numbers of various sorts. Given the requirement to
support decimal, integer, float, and double, which should be considered
primitive types, and which derived? What criterion should be used for
derivation? Your answer should allow for the further derivation of various
bounded-range integers, but needn't worry about number systems solely of
interest to fusty ivory-tower academics. Data typing isn't particularly
useful in science, of course.

Nine times too many

Why is anyURI a primitive type? Why are there nine separate and
unrelated primitive types all concerned with measurement of time? Even though
early drafts of WXS included three time instant measuring types (dateTime,
date, and time, which are not, despite lexical and conceptual
overlap, related to one another by derivation in WXS), in the last stages of
specification drafting one or more interested constituencies raised such a
fuss that five more time instant measuring types were added. Despite lexical
and conceptual overlap, all five were made primitive types, unrelated to one
another by derivation. Clearly, the committee was too exhausted to fight
about it any more, so gHorribleKludge (gYearMonth, gYear, gMonthDay,
gMonth, gDay...the "g" stands for "Gregorian," not "good") made it into
the specification.

At least three constituencies are easily identifiable with type
subcollections in WXS: the original XML/DTD collection (rooted at
string, and one of two derivation trees, plus unrelated primitives);
the strongly-typed programming language collection (rooted at decimal, and the
other derivation tree, plus unrelated primitives); and the database collection
(mostly available in the strongly typed tree, plus the time instant
primitives, and assorted others). Why are the chosen primitives primitive?
Why aren't base64Binary and hexBinary related? Why aren't
float and double related to each other or to the rest of the numbers?
Certainly if derivation in the integer tree can proceed based on register size
(which it does), then one ought to be able to derive float from double. Isn't
anyURI a token? No? normalizedString? No? Not even a
string?

No? Really, all these date and time thingies don't have any relation to
one another at all? No. There's no method to this madness. There is no way to
guess whether a particular built-in type will be declared primitive or derived
from another type. Nor is there any apparent value to derivation of built-in
types, since validity according to the least-derived type does not guarantee
validity according to most-derived type.