Recently, I proposed a TAG issue under the (unfortunate) name "PSVI
Considered Harmful". This kicked off lot of discussion on the www-tag
mailing list, and the TAG has subsequently accepted the issue. This
has led to further discussion over in the xml-schema-wg, and I think
we've made enough progress to try to lay out what the architectural
issues are, along with some proposed solutions. My understanding of the
issues has been helped immensely by thoughtful contributions from (among
others) Noah Mendelson, Dave Ezell, and Mary Holstege which I don't cite
here because some of them are in member-only space; however, I do not
claim that any of these contributors agree with this note.
1. XML Schema Validation generates information
Validation takes as input an XML instance and one or more XML Schema
instances, and produces potentially a lot of output. This includes:
- whether the instance is valid
- whether each element and attribute are valid
- details about the validation process, e.g. this attribute is valid
because it's a union type, one of the options is integer, and it
qualifies as an integer
- schema types of elements and attributes
- elements and attributes that are defaulted, i.e. not actually present
in the instance
Currently, all of this stuff is lumped together and placed in the "PSVI".
2. Use of PSVI contents
The XML Schema WG is currently engaged in investigating which pieces of
the PSVI are of potential interest and assembling use cases. Presumably
if it emerges that there is wide interest in access to particular PSVI
items, someone will have to take on the work of publishing an API and
serialization for them.
3. The PSVI contents are heterogeneous
The PSVI's contents have the sole defining characteristic that they are
generated as a result of schema validation. It's hard to think of any
other meaningful shared characteristic. The way we talk about types is
different from the way we talk about validation outcomes is different
from the basic elements-and-attributes additions by defaulting.
4. Do the PSVI contents belong in the infoset?
Clearly the element and attribute items produced by defaulting are
(logically) just like other elements and attributes, and the infoset is
pre-cooked to accept them, so it seems like the infoset is a good place
to put them.
On the other hand, it's not obvious that the infoset's framework of
"items" and "properties" is a good way to describe things like
validation outcomes and type information. Let's assume we decide that
some of this stuff needs to be made available to other parties - is it a
useful or necessary step to go through the infoset to get there? I'm
not being rhetorical here, this is just not obvious to me.
5. The PSVI type information is itself heterogeneous
This falls naturally out of the richness of the XML Schema type system.
As someone (I think Noah) pointed out, it's easy to imagine sharing
the semantics of built-in primitive types across a broad spectrum of
specifications and applications (e.g., "this is an integer"). It's
plausible but not as obvious to think about sharing restrictions of
primitive types (e.g. "this is an integer greater than 3"). The notion
of sharing complex and derived types starts to get pretty hairy pretty
fast - anything that did this would have to have the semantics of XML
Schema wired in pretty deeply.
I'll be interested to see if there are use cases for sharing the
semantics of complex types outside of the validation application.
6. Type naming is tricky
This falls naturally out of the previous point. XML Schema (correct me
if I'm wrong) allows its types to be identified by qname. But the
semantics that come with saying "this is an xsi:int" are obviously
wildly different from some complex type that's been through several
levels of derivation. In particular, the former are widely shareable
without knowledge of XML Schema semantics.
7. Type information is useful outside of validation applications
There is an existence proof for this: XML Query. Queries can make use
both of type names for matching elements and attributes, and of
particular type semantics (ordering and equality) for matching character
data. It's not hard to imagine lots of other use-cases.
8. Why not standardize on XML Schema's primitive data types?
XQuery (and I suspect many other facilities) are going to find it
essential to hard-wire in the semantics of primitive types (numbers,
dates, URIs). W3C has invested a huge amount of effort in building a
primitive-type system as a part of XML Schema. I personally think it's
too big and some gHorribleKludge types got in, but they're done and
stable and I don't see any reason why they shouldn't serve as a basis
for XQuery and anyone else who needs this kind of thing.
Question: are the specs well-enough modularized that it's easy to
normatively reference in basic types by reference?
Proposal: let's issue a TAG finding saying that if you need primitive
data types, use XML Schema's, don't invent your own.
Question: For things that are this widely shareable, I think it's
architecturally essential to have actual URIs, not just qnames; is this
hard to achieve?
9. Type names and type semantics exist independent of schemas
Let's consider an example; a system where large numbers of business
transactions are encoded in XML and interchanged and stored in a
database, and need to be accessed by XQuery. In this particular case,
schema validation is not done at run-time, all parties do
application-specific validation and trust each other to encode numeric
types and dates correctly. Much of the markup is generated like this:
fprintf(xmlStream, "<detail unitPrice='%.2f' quantity='%3d'/>",
unitP, quant);
The XQuery processor that's accessing this database will know from some
sort of data dictionary implementation that a <detail> element has
unitPrice= and quantity= attributes, and the primitive data types of
each attribute. While it uses primitive type names from XML Schema, it
is possible in principle and plausible in practice that no schema has
ever been written, let alone applied.
Note that in this case the type information is found neither in the
schema (because there isn't one) nor in the instance. This doesn't in
the slightest get in the way of, for example, XQuery semantics.
10. Coupling specs to PSVI as it exists today is architecturally unsound
The PSVI is a grab-bag of stuff that's defined as being the outcome of a
particular operation; any attempt to pretend that all its contents can
be talked about, addressed, or used in a uniform way is just misguided.
Also it needs to be crystal-clear that you can have types without
having a schema or doing validation.
Proposal: Let's issue a TAG finding that types ought to be addressible
by name, and work with some WG to write architectural principles for
naming them, (qname or URI or both); getting this right is nontrivial,
see item 6 above. XQuery seems like a good example of a sensible way to
use these names.
==========
Conclusion
Where I'd like to end up is is:
- we have a list of well-known base types with well-known names that
everybody uses consistently
- we have a generic naming system for types including complex types
- we have a well-defined API and serialization for those parts of the
output of the schema validation process that are used in non-validation
applications
- we have multiple different schema facilities aimed at different kinds
of applications, which however exhibit consistency in (a) their use of
base primitive types, (b) the way they name types, and (c) the way they
expose their output to the world.
This seems achievable and not all that ambitious.
-Tim