2008-11-17

What's allowed in a URI?

Java 1.4 introduced the java.net.URI which provides RFC 2936-compliant URI handling. I thought I should try to fix Jing and Trang to use this. So I've been looking through all the relevant specs to figure out to what extent I can leave things to java.net.URI.

It's convenient to begin with XLink. Section 5.4 requires the value of the href attribute to be a URI reference after certain characters that are disallowed by RFC 2396 are escaped. These are described as

all non-ASCII characters, plus the excluded characters listed in Section 2.4 of IETF RFC 2396, except for the number sign (#) and percent sign (%) and the square bracket characters re-allowed in IETF RFC 2732

If we look at 2.4.3 of RFC 2396 (why does XLink reference section 2.4 rather than 2.4.3?), we see the following sets of characters excluded:

XSLT 1.0 just references RFC 2396 and doesn't say anything about escaping (as regards xsl:include and xsl:import). That seems like a bug to me. Erratum E39 adds the following to the first paragraph of the spec:

For convenience, XML 1.0 and XML Names 1.0 references are usually used. Thus, URI references are also used though IRI may also be supported. In some cases, the XML 1.0 and XML 1.1 definitions may be exactly the same.

This seems to be intended to extend it to allow IRIs, though it seems like a bit of a hack: there's no reference to the IRI spec, and I don't see how it's "Thus, ". In any case, XSLT 2.0 gets it right: it references xs:anyURI.

RFC 2396 has been updated by RFC 3986. This no longer has a section describing excluded characters, but I believe I am right in saying that the set of Unicode characters that cannot occur anywhere in a URI as defined by RFC 3986 is precisely the union of my categories 1 through 5.

I can buy controls and noncharacters being excluded, but the other two seem like over-engineering to me. The arguments for excluding these could equally be applied to various other weird Unicode characters. You don't want to have to change the definition of an IRI whenever Unicode adds some new weird character.

LEIRIs seem like a very useful innovation. XML-related specs such as RELAX NG that referenced or incorporated the XLink wording will be able to simply reference RFC 3987bis and say that URI references MUST be LEIRIs and SHOULD be IRIs.

Finally we are ready to look at java.net.URI. This allows URIs to contain an additional set of "other" characters which consist of non-ASCII characters with the exception of:

C1 controls (#x80 - #x9F)

Characters with a category of Zs, Zl or Zp

This means that if you want to give an LEIRI such as an XML system identifier to java.net.URI you first need to percent encode any of the following:

the following ASCII graphic characters: <>"{}|\^`

C0 control characters (#x00 - #x1F); of these only #x9, #xA and #xD are allowed in XML documents

space (#x20)

delete (#x7F)

C1 controls (#x80 - #x9F)

Characters with a category of Zs, Zl or Zp

All except the first can be tested with Character.isISOControl(c) || Character.isSpace(c).

Note that you don't want to blindly percent encode all non-ASCII characters because that will unnecessarily make IRIs containing non-ASCII characters unintelligible to humans.

I wonder if writing a compliant implementation to some spec has to be this hard. In particular the areas related to XML schema always seem horribly confusing.

Some specs are certainly better written than others. Still, it might be nice if someone could wrap up all the confusing cross-references every now and then... this would probably greatly increase the number of conforming implementations (but who has the time ...).