Dear IRI and XML experts,
Some additional comments on the issues raised by the HRRI draft.
I discovered these when trying to create some definitions in
the new IRI draft for XML and friends to use.
The core of the issue is the following:
- The XML Core WG wants to concentrate the definitions of
IRI-like syntax in a single document, without having to
normatively change XML or the various related specs that
currently use an "any Unicode character goes" definition
for what's allowed in a resource identifier.
- The HRRI draft
(http://www.ietf.org/internet-drafts/draft-walsh-tobin-hrri-01.txt)
gives the following for the conversion procedure:
To convert a Human Readable Resource Identifier to an IRI reference,
the following characters MUST be percent encoded:
o the control characters #x0 to #x1F and #x7F to #x9F
o space #x20
o the delimiters "<" #x3C, ">" #x3E, and """ #x22
o the unwise characters "{" #x7B, "}" #x7D, "|" #x7C, "\" #x5C,
"^" #x5E, and "`" #x60
These characters are percent encoded by applying steps 2.1 to 2.3 of
Section 3.1 of RFC 3987[3] to them.
It also says: "A string is a legal Human Readable Resource Identifier
if and only if the string generated by applying the encoding rules
above is a legal IRI."
- The current XML spec gives the following procedure of how to convert
from a system identifier to an URI (summarized):
Convert all the above characters, plus all characters above 0x7F,
to %HH-encoding via UTF-8.
- The IRI spec excludes private use characters from all but the query part.
(there are other smaller differences, but for the moment, this is enough)
As a consequence, what we end up with is that the definition in the HRRI
draft isn't backwards compatible with the definition in the XML spec,
or in other words, it results in a normative change.
There are various ways to deal with this:
- Accept a normative change to XML. In that case, my guess would
be that at least general control characters should also be removed.
Neither general control characters nor private use characters
should be used at all in the wild, at least not on purpose.
- Refine the definition of conversion to an IRI in the HRRI spec.
My guess is that this can be done, but will look ugly.
- Change the IRI spec, to allow private use characters in other places.
Any comments wellcome.
Regards, Martin.
#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp