URI and IRI Templates, Oy

When I first started looking at URI templates
I was surprised no one had written a specifiction
for them yet. It seemed so simple, "just" add {name}
to the URI and then substitute with a value at a later
time. After bashing my head against the wall for
a couple weeks, here is a synopsis of the
character encoding issues involved in doing
URI and IRI Templates.

We have several open issues:

Deciding which characters to escape.

Reserving some character in template variable names for future use, ala ':' for XML namespaces.

While this is a long post, I will only cover the issues involved in #1.

My over-arching goal of URI-Templates, and I believe this is necessary to make them
a success, it to make URI Templates simple by being opinionated, as Sam described it.

Grounding

First let's dispel the notion that you can come up with
the perfect URI-Template to URI translation mechanism
that will always produce a valid URI regardless of the
scheme. That last part, "regardless of the scheme", is the
crux of the problem. While RFC 3986 defines what a
URI looks like, schemes may impose further restrictions. For
example, while

tel:bitworking.org

matches the ABNF in RFC 3986, it is not a valid tel: URI,
and it never will be.

We have two choices:

Define a mechanism that is only guaranteed to meet the UR
syntax (i.e. RFC 3986), and thus potentially generate
URIs that are invalid in some schemes.

Restrict ourselves to URIs of a particular scheme such
as http: or mailto:.

Serendipity

As an aside, it turns out that the regular expression given in
Appendix B of RFC 3986 is capable of
parsing up URI Templates, but only if the characters
allowed in template variable names are restricted, and
only if template variables are not allowed to span
components.

This is important because it makes it easy to parse up
a URI Template if we want to impose
different escaping requirements on different components.

What to %-encode

Certain characters are going to have to be %-encoded
to ensure that filling in a URI-Template
doesn't destroy the structure of the URI. For both
URIs and IRIs the 'reserved' set of characters are the
ones that are going to cause trouble and need to be
escaped.

The rules are the same for URIs, except drop
all the 'i's off the beginning of the names,
and drop iprivate.

So let's begin with a simple approach, how about
escaping all the characters in 'reserved'?
If we do then you can't do this:

http://example.org?{fred}
fred="q=2"
http://example.org?q=2

That might seem too restrictive, so let's make
that example concrete.

http://www.google.com/search?q={term}
term="Ben&Jerrys"

If reserved characters are escaped then
the URI Template expands to:

http://www.google.com/search?q=Ben%26Jerrys

That search gives you the results you
would expect. If reserved characters are NOT escaped then you
get a very different search result:

http://www.google.com/search?q=Ben&Jerrys

And that does *not* give the expected results.

So let's always escape, right? Not so fast. If
we always escape reserved characters we get

mailto:{address}
address="joe@bitworking.org"

expanding to

mailto:joe%40bitworking.org

which is not what you want to happen.

Like I said, we can't come up with something guaranteed to
generate only valid URIs unless we restrict ourselves to a particular
scheme, which isn't as useful as defining templates for all URIs.
So what if we pick a subset of 'reserved' that does not
get %-encoded? Can we pick a subset that produces
the least surprising results? Here is my suggestion, to escape
all the characters in 'reserved' except the following three:

Which is clearly an invalid URI. So do we
give special escaping rules for authority?
That at least makes the results match the
URI syntax, but for the HTTP scheme the string
a%2Fb.example.org isn't a valid domain name.
And don't even get me started on how this could go
bad if you allowed template varibles in the
scheme:

{scheme}://bitworking.org
scheme="gopher"
gopher://bitworking.org

On the other hand, I could see useful
applications:

http{ssl}://bitworking.org
ssl="s"
https://bitworking.org

So we have a few possibilities:

Escape all 'reserved' characters except @, :, and / across every component, realizing
we may not end up with a valid URI.

Escape all 'reserved' characters except @, and :, realizing that our 'path' example
will then break since '/' will get escaped.

Escape all 'reserved' characters except @, :, and /, but only allow template variables in path, query and fragment components.

IRIs

The Algorithm

Let's start with IRIs since those are actually simpler, and let's
also assume that we choose #1 of the options above:

Escape all 'reserved' characters except @, :, and / across every component, realizing
we may not end up with a valid URI.

Algorithm:

Start with an IRI Template (noting that URIs are also IRIs):

http://example.org/{blah}

Percent encode every character in the values of the template variables that aren't in ( iprivate | iunreserved | '@' | ':' | '/' )

Substitute variables with their values, which produces an IRI.

Note that we could use the same algorithm for URI Templates
as long as we add a fourth step:

Convert the IRI to a URI following Section 3.1 of RFC 3987.

Hopefully reading this has been as
helpful for you as writing it has been for me, and
some of the subtle issues in character handling
that need to be more strictly specified in the
next revision of the specification
are clearer.
I also posted this to the W3C URI mailing list
so feel free to follow up there with any comments.