Colleagues:
With apologies for the delay, I transmit to you herewith the comments
of the XML Schema Working Group on the various RDF documents published
in Last Call recently. We congratulate you on the progress of your
work and hope our comments are useful to you. An HTML version of our
comments may be found at
http://www.w3.org/XML/Group/2003/03/xml-schema-rdf-notes.html
and I append an ASCII-only version for the convenience of those who
find it more convenient.
-C. M. Sperberg-McQueen
Co-chair, W3C XML Schema Working Group
...........................................................
W3C XML Schema Working Group
Comments on RDF documents
ed. Charles Campbell, C. M. Sperberg-McQueen, Henry S. Thompson
10 March 2003
_________________________________________________________________
* 1. [1]Notes on RDF Primer
+ 1.1. [2]Design question, complexity (substantive)
+ 1.2. [3]Whitespace handling (schema-related)
* 2. [4]Notes on RDF Concepts and Abstract Syntax
+ 2.1. [5]Mapping from lexical forms to values (schema-related,
terminological)
+ 2.2. [6]Values without lexical forms (schema-related,
important)
+ 2.3. [7]Lexical forms, strings, and character sequences
(schema-related, editorial)
+ 2.4. [8]Strings for natural-language data (substantive)
+ 2.5. [9]Typos and minor editorial notes
* 3. [10]Notes on RDF Semantics
+ 3.1. [11]The "meaning" of literals (editorial)
+ 3.2. [12]Types as lexical mappings (schema-related)
+ 3.3. [13]Miscellaneous editorial notes
* 4. [14]Notes on RDF/XML Syntax Specification (Revised)
+ 4.1. [15]Manifest typing in the instance (policy)
+ 4.2. [16]QNames (Editorial, but important)
+ 4.3. [17]Miscellaneous editorial notes
+ 4.4. [18]Normative specification of XML grammar (policy,
substantive)
+ 4.5. [19]On the relation between RDF and off-the-shelf XML
tools (policy, substantive)
_________________________________________________________________
NOTE:
[These notes have been considered and approved by the W3C XML Schema
Working Group, and are transmitted to the RDF Core Working Group as
comments on the last-call drafts of various RDF-related documents.]
$Id: xml-schema-rdf-notes.html,v 1.11 2003/03/10 21:31:34 cmsmcq Exp $
The XML Schema Working Group congratulates the RDF Core Working Group
on progressing its several documents to Last Call; we apologize for
the late submission of these comments, and hope that they prove
helpful.
Our comments include some which bear directly on the use of XML
Schema's simple types by RDF, to which we believe you wished us to
give particular attention. In the text which follows, these are
labeled "schema-related". Some other comments, in contrast, relate to
important and difficult technical and policy questions relating to
language design and tool usage; these are labeled "policy". We hope
that you will give these comments very serious consideration, but we
do not pretend to any special standing in raising them, other than as
representative members of the XML community at large. Finally, there
are some other questions which are not directly related to XML Schema
or to XML in general, and for which we therefore pretend to no
particular expertise or standing, but which we happened to notice and
which we call to your attention, as any technically minded reader
might do, in the hopes that doing so may be useful to you; these are
labeled "substantive" or "editorial" as the case might be.
1. Notes on RDF Primer
RDF Primer, section 2.4 Typed literals
[20]http://www.w3.org/TR/rdf-primer/#typedliterals
[20] http://www.w3.org/TR/rdf-primer/#typedliterals
1.1. Design question, complexity (substantive)
The introduction of pairs consisting of a lexical form and a type (or,
strictly speaking, a lexical form and a type label) seems at first
glance to complicate the RDF model somewhat. We have had the
impression that in other parts of RDF, typing is handled by adding
further arcs and nodes. If the type of a resource is identified by
having an arc labeled rdf:type from it to (the URI of) its (RDF) type,
and if the type of an arc is similarly identified by an arc, then
surely a reason ought to be given for shifting to a different method
for typing literal strings. It seems like a dramatic shift in the
infrastructure of RDF, from "everything is a node, an arc, or a
literal value" to "everything is a node, an arc, or a typed literal
value". Perhaps not quite so dramatic, after all. But the question of
design consistency remains: why not "everything is a typed node, a
typed arc, or a typed literal"?
1.2. Whitespace handling (schema-related)
Some members of the XML Schema WG have expressed concern that XML
Schema's rules for whitespace handling may interfere with expected
behavior in other contexts. This may be the appropriate place to bring
this question up.
In brief, XML Schema's simple types each define a whitespace facet,
which governs the kind of whitespace pre-processing done by an XML
Schema processor before the lexical form is checked for type validity.
Since the point of whitespace normalization is to simplify subsequent
processing, the lexical spaces of XML Schema's simple types are (like
those in many programming languages) defined without reference to the
preceding whitespace normalization. Integers, for example, are
represented by sequences of decimal digits; sequences containing
blanks are not legal lexical forms for integers. Indeed, strictly
speaking it is only after the whitespace pre-processing is done that
the XML Schema processor can be said to be working with a lexical form
at all.
For example, the integer type has a value of collapse for the
whitespace facet, which means leading and trailing whitespace is
stripped, and internal whitespace sequences are reduced to a single
blank (x20) character. In an XML document in which the element
exterms:age is defined as having type xs:integer, the following
instances of exterms:age will all be type-valid:
<exterms:age>27</exterms:age>
<exterms:age>
27
</exterms:age>
<exterms:age> 27 </exterms:age>
<exterms:age> 2<!--* ha, ha, fooled your full-text indexer!
*-->7 </exterms:age>
The input information set, in each case, contains a character
information item for "2" followed by a character information item for
"7", with character information items for whitespace characters, and a
comment information item, present in some of the examples. In all
cases, the lexical form proper is the character sequence "27" (i.e.
the sequence of characters after white space handling, and ignoring
comments, processing instructions, entity boundaries, and other
distractions). This is a legal lexical form for an integer, so all the
examples are type valid.
Some members of the XML Schema WG have worried that it may not be
obvious that the whitespace processing is not part of the process of
checking lexical forms for type validity, but part of the process of
extracting the lexical forms from the XML information set presented to
the processor. If an RDF document contains
<exterms:age> 27 </exterms:age>
and a processor hands the contents of the element to a generic
type-checker for XML Schema's simple types, saying in effect "this
purports to be the lexical form of an integer; is that OK?", that type
checker will be required (if it conforms to the XML Schema spec's
definition of the simple types) to say "no, the character sequence
` 27 ' is not a legal lexical form for an integer."
It's not clear whether RDF, being type-system neutral, can directly
address this concern (e.g. by specifying that an RDF processor should
do the appropriate whitespace pre-processing, or by warning users that
they should not include vagrant whitespace in typed literals), or
whether it suffices for developers of RDF software with built-in
support for XML Schema's simple types to deal with it, e.g. by
performing it themselves before handing the resulting lexical form to
a type checker.
As noted, some members of our WG feel that you need to be alerted to
this as a possible source of confusion and unexpected results. Other
members of the WG feel that it verges on disrespect to assume that you
need instruction on this point. We compromised by agreeing to point
out the issue to you, and to leave you to draw your own conclusions.
2. Notes on RDF Concepts and Abstract Syntax
2.1. Mapping from lexical forms to values (schema-related, terminological)
In [21]http://www.w3.org/TR/rdf-concepts/#section-Datatypes:
[21] http://www.w3.org/TR/rdf-concepts/#section-Datatypes
A datatype mapping is a set of pairs whose first element belongs to
the lexical space of the datatype, and the second element belongs
to the value space of the datatype:
We agree that it is useful to define a term to denote such mappings;
in the interests of inter-specification consistency, we wonder whether
you would be willing to consider using the term lexical mapping, which
we are introducing in our forthcoming draft of XML Schema 1.1. The
term datatype mapping seems unlikely to be usable in the XML Schema
specification, where it would suggest to some readers a mapping from
one datatype to another, rather than as here a mapping from lexical
space to value space. (XML Schema 1.0 got by without a term for this
concept.)
2.2. Values without lexical forms (schema-related, important)
In [22]http://www.w3.org/TR/rdf-concepts/#section-Datatypes:
[22] http://www.w3.org/TR/rdf-concepts/#section-Datatypes
* Each member of the value space may be paired with any number
(including zero) of members of the lexical space (lexical
representations for that value).
The provision for values without corresponding lexical forms
contradicts an assumption to which the XML Schema spec appeals from
time to time. The lexical space of any simple datatype in XML Schema
is the domain of the type's lexical mapping; the value space is its
domain. There are no meaningless lexical forms in the lexical space of
the type, nor are there ineffable values in the value space. By
eliminating values from the value space (e.g. by setting minimal and
maximal values), the type definer may indirectly also eliminate
lexical forms from the lexical space; conversely, by eliminating some
items from the lexical space (e.g. by setting a pattern), the type
definer may eliminate items from the value space.
Are there crucial aspects of RDF which will break if the list item
quoted above is changed to read "paired with one or more members of
the lexical space"?
2.3. Lexical forms, strings, and character sequences (schema-related,
editorial)
In [23]http://www.w3.org/TR/rdf-concepts/#section-Datatypes:
[23] http://www.w3.org/TR/rdf-concepts/#section-Datatypes
With one exception, the datatypes used in RDF have a lexical space
consisting of a set of strings.
Since "string" is used as the local name for a particular simple type
in the XML Schema namespace, we believe it will be less confusing for
users, in the long run, if the lexical representations of
simple-datatype values are described not as "strings" but as
"character sequences".
This comment also applies to other uses of the term string to denote
the members of a lexical space.
2.4. Strings for natural-language data (substantive)
In [24]http://www.w3.org/TR/rdf-concepts/#section-Datatypes:
[24] http://www.w3.org/TR/rdf-concepts/#section-Datatypes
* A plain literal is a string combined with an optional language
identifier. This should be used for plain text in a natural
language. As recommended in the RDF formal semantics
[RDF-SEMANTICS], these plain literals are self-denoting.
We do not believe that simple strings are likely to be adequate for
the representation of arbitrary natural-language text. Even in
English, natural-language utterances (such as this document) may need
some degree of inline markup for clarity and adequate presentation; in
natural-language utterances requiring bidirectional display or ruby,
the best authorities (including the W3C I18n Working Group) recommend
the use of markup within the natural-language utterance. We thus
suggest that you may wish to moderate this recommendation that
natural-language material be represented by literals.
This is not an area in which we claim particular technical expertise;
we merely call it to your attention in the hopes that doing so may be
useful to you.
2.5. Typos and minor editorial notes
In [25]http://www.w3.org/TR/rdf-concepts/#section-Literal-Value, for
"the datatype mapping is applied to the pair form by the lexical form
and the language identifier" read "the datatype mapping is applied to
the pair formed by the lexical form and the language identifier".
In the same section, for "Such a case, while in error, is not
syntacticly ill-formed " read "Such a case, while in error, is not
syntactically ill-formed" (et passim).
In section [26]http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral,
for "root element tag" read "root element".
In the same section, for "XML element content" read "XML data" (the
term element content is used in some markup-related specs as a
complement of mixed content to denote the content of elements which
can contain other elements but cannot contain parsed character data).
[25] http://www.w3.org/TR/rdf-concepts/#section-Literal-Value
[26] http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral
3. Notes on RDF Semantics
3.1. The "meaning" of literals (editorial)
The meaning of a literal is principally determined by its character
string: it either refers to the value mapped from the string by the
associated datatype, or if no datatype is provided then it refers
to the literal itself, which is either a unicode character string
or a pair of a string with a language tag.
Some members of the XML Schema WG are made nervous by the appeal to
the notion of "meaning" here. [N.B. our task force read this section
out of context, and were not aware of any foregoing elucidation. So
this comment may be out of place.] There is also some concern about
the apparent conflation here of the notions of meaning and reference.
We wonder whether this discussion would be weakened by replacing
references to meaning and reference by references to denotation; we
are inclined to think it would be an improvement, but recognize that
the RDF Core WG's views may differ.
3.2. Types as lexical mappings (schema-related)
A datatype is an entity characterized by a set of character strings
called lexical forms and a mapping from that set to a set of
values.
We have a couple of reservations concerning this characterization.
* Elsewhere (e.g. in Concepts and Abstract Syntax, section 3.3,
[27]http://www.w3.org/TR/rdf-concepts/#section-Datatypes), the RDF
specs say that there may be values in a value space which are not
in the range of the lexical mapping; we have suggested that if
possible those statements should be changed, but if they are
retained, then a datatype cannot be characterized solely by the
lexical space and the lexical mapping, because such ineffable
values appear in neither of these.
* The statement describes (with the exception of the problem just
noted) simple datatypes, but not the class of complex datatypes
which can be defined by XML Schema, nor all the types (or
type-like constructs) definable in various other schema languages
for XML.
[27] http://www.w3.org/TR/rdf-concepts/#section-Datatypes
3.3. Miscellaneous editorial notes
In [28]http://www.w3.org/TR/rdf-mt/#dtype_interp, for "which we will
refer to as XSD and use the Qname prefix xsd:" read "which we will
refer to as XSD and denote using the Qname prefix xsd" (or something
similar).
In [29]http://www.w3.org/TR/rdf-mt/#dtype_interp:
[28] http://www.w3.org/TR/rdf-mt/#dtype_interp
[29] http://www.w3.org/TR/rdf-mt/#dtype_interp
For example, XML Schema requires that the value spaces of
xsd:string and xsd:decimal to be disjoint ...
This sentence is not exactly wrong, but it seems slightly unusual to
use the verb require here, instead of define or something similar. We
suggest recasting this as "For example, XML Schema defines the value
spaces of xsd:string and xsd:decimal as disjoint ..." (Note, for the
record, that the value spaces of all the primitive simple datatypes of
XML Schema 1.0 are pairwise disjoint.)
In ,
any literal of the form "sss"@ttt^^ddd, where ddd is not
rdf:XMLLiteral, treated as identical to the same literal without
the language tag, "sss"@ddd
is "sss"@ddd a typo for "sss"^^ddd?
In [30]http://www.w3.org/TR/rdf-mt/#dtype_entail, for "it is valid to
add any number of leading zeros to any numeral and still be a correct
lexical form for xsd:integer", perhaps read "it is possible to add any
number of leading zeros to any lexical form for xs:integer without it
ceasing to be a correct lexical form for xsd:integer"
[30] http://www.w3.org/TR/rdf-mt/#dtype_entail
4. Notes on RDF/XML Syntax Specification (Revised)
RDF/XML Syntax, [31]http://www.w3.org/TR/rdf-syntax-grammar/
[31] http://www.w3.org/TR/rdf-syntax-grammar/
4.1. Manifest typing in the instance (policy)
RDF allows Typed Literals to be given as the object node of arcs.
These consist of a literal string (with optional language) and a
datatype RDF URI Reference. This is handled ... with an additional
rdf:datatype="datatypeURI" attribute on the property element.
We believe there are probably good reasons for using an rdf:datatype
attribute, instead of re-using the existing xsi:type attribute which
has (when the type is defined in a schema defined by XML Schema 1.0)
the same semantics. In particular, rdf:datatype does not assume or
assert the existence of the type named as a type in a schema defined
by XML Schema, so it would be problematic to use xsi:type.
We do fear, however, that users are likely to find this
near-duplication of the meaning and function of xsi:type confusing. It
is not clear to us what, if anything, can or should be done to
minimize this danger.
4.2. QNames (Editorial, but important)
We were unable, on a first reading, to determine whether the default
namespace declaration, and thus unprefixed names, were or were not
allowed in order to encode 'RDF URI References'. Indeed the
introductory prose about QNames (2nd para of
[32]http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-intro])
does not seem to connect up with the relevant (?) production in
[33]http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar]
, which we take to be
[34]http://www.w3.org/TR/rdf-syntax-grammar/#URI-reference].
This can and should be cleared up.
[32] http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-intro
[33] http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar
[34] http://www.w3.org/TR/rdf-syntax-grammar/#URI-reference
4.3. Miscellaneous editorial notes
In
[35]http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-empty-prop
erty-elements, the sentence
[35]
http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-empty-property-elements
When an arc in an RDF Graph points to an object node which has no
further arcs, which appears in RDF/XML as an empty node element
sequence such as the pair <rdf:Description rdf:about="...">
</rdf:Description>, this form can be shortened.
seems less clear than it might be. Different readers prove to have
different views on what is meant by "the pair <rdf:Description
rdf:about="..."> </rdf:Description>"; perhaps it can be replaced by
something like "the empty element <rdf:Description rdf:about="..."/>"
without loss of precision? Perhaps the sentence could read
When an arc in an RDF Graph points to an object node which has no
further arcs, which appears in RDF/XML as an empty node element
such as <rdf:Description rdf:about="..."/>, this form can be
shortened.
4.4. Normative specification of XML grammar (policy, substantive)
We note with admiration the excellent tutorial introduction to the
striped syntax in Section 2
[36]http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax]. We are
less happy with the nature of the syntax, and with the approach taken
to its normative statement
[37]http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar]
.
As regards the syntax itself, we would much prefer to have seen a move
to a single canonical syntax with much less variablity. With respect,
the current design suggests that the value of XML has been
misunderstood. The range of alternative forms of expression provided
for in the current design make it very difficult to use the broad
range of generic XML tools (e.g. syntax-directed editors, XSLT) which
could give so much benefit to RDF users. (More on this below.) At the
very least we would encourage you to specify a single canonical form,
probably strictly striped, which could be defined by an XML Schema or
DTD. We would be happy to work with you to develop a schema for such a
subset.
As regards the approach taken to defining the syntax, in our view,
layering of specs has very high value, and so defining an XML document
type by way of what is very nearly a character-level BNF is at best a
missed opportunity and at worst a serious mistake. It obscures the
important aspects of the document type behind a welter of irrelevant
detail about e.g. whitespace and start-tag/end-tag matching. It makes
it very difficult for the reader to actually understand what is and
isn't actually allowed -- what an RDF/XML document actually looks
like.
Not only does this confuse levels and thus readers, it also runs the
risk of inadvertently defining an XML subset. It also appears, on a
strict reading, to rule out XML documents not derived from the parsing
of character streams as possible RDF/XML (so that it would be
illegitimate to regard a data structure created using a DOM interface,
for example, as RDF/XML).
The use of event-triggered data-model construction actions to specify
the relationship between XML representation and corresponding data
objects is innovative and compelling, but surely it would be
straight-forward to associate these events with a pre-order traversal
of an infoset independently constrained by a DTD, XML Schema schema or
other appropriate definition of the canonical document type. If
continued support for alternative forms is considered essential, then
a two-step approach where the semantics of any non-canonical form is
defined in terms of a canonical form to which it corresponds would
still be far simpler than the current approach.
[36] http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax
[37] http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar
4.5. On the relation between RDF and off-the-shelf XML tools (policy,
substantive)
With some diffidence, we conclude by raising what may be a sensitive
issue.
It does not seem to us that the XML serialization of RDF shows RDF to
advantage. At the level of the underlying graph model, RDF information
has a simple and regular structure, which appears in the XML
serialization to be anything but simple and so irregular as to bring
the words "capricious" and "arbitrary" to the lips of unprejudiced
observers. Tastes in markup style differ, but we believe that the root
of the problem is the high degree of variability with which the same
underlying graph structures may be serialized, according to the rules
given in this document.
Owing in part to the variability itself, and in part to the specific
forms taken by that variability, it is not feasible to write an XML
Schema schema, or (if the comments in Appendix A.1 are accurate) a
Relax NG schema, or an XML 1.0 DTD, which defines the set of correct
serializations of correct RDF graphs. It is not convenient to run XSLT
processes over arbitrary RDF serializations, nor to query or process
arbitrary RD data using XQuery. Arbitrary RDF data is similarly
inconvenient for other standard XML tools to process.
There is, as a result, something of a cleft between the RDF community
and the set of RDF tools on the one hand, and the community of users
and tools employing what some have called colloquial XML. The parallel
development of query languages, schema languages, object models, APIs,
editors, display tools, and so on does offer relatively harmless ways
for a large number of people to employ their time, but it does not
seem to us to serve the larger Web community well.
The cleft between RDF and colloquial XML does not seem to us to be
required by the RDF data model. A graph in which nodes have certain
properties and arcs have certain properties is not, in itself, a
peculiarly difficult structure to render in XML or to process with
off-the-shelf XML tools. An XML vocabulary in which nodes may appear
as elements, or as attributes, or as attribute values, or as the
PCDATA content of elements, and in which property names may appear as
three of the same four constructs, on the other hand, seems a rather
less straightforward XML representation of the underlying graph
structure than most XML vocabularies for graphs have chosen.
The result is that not just arbitrary RDF data, but data encoded using
vocabularies defined in RDF terms (for which current W3C work provides
a number of examples), will be hard to process using off-the-shelf
tools. We believe this difficulty represents a lost opportunity, and
we believe the opportunity could readily be seized if the XML
serialization were modified to capture more of the regularity of the
RDF data model.
We are ready to work together with the Working Groups in the Semantic
Web Activity and with other interested parties to formulate an XML
serialization which captures the information in the RDF model and
which is more readily amenable to processing with off-the-shelf XML
tools.