(2) Thu May 31 2007 09:35Acceptance:
I can't find the link [UPDATE: it was Paul Sandoz], but someone posted a weblog entry that mentioned something I say in RWS about choosing different URIs for different representations. The example I (and the unknown weblog entry writer) used was a press release which is available in two languages (English and Spanish) and two data formats (HTML and plain text). The URI to this resource is /releases/104, and you can choose a representation format by setting the Accept and Accept-Language headers. But I also recommend exposing a URI for each language and format:

/releases/104.en

/releases/104.txt

/releases/104.en.html

/releases/104.es.txt

etc.

When a client requests one of those URIs they can leave out one or both of the Accept- headers and still get what they want. The Content-Location response header is set to /releases/104 so that you have a URI to use when talking about the press release in general, rather than a specific version of it.

The question in the weblog entry I can't find is more or less this: what about the other two Accept- request headers, Accept-Charset and Accept-Encoding? Why don't I recommend exposing URIs like /releases/104.es.txt.gzip.UTF8?

The reason I put any special levers in the URI is because we pass around URIs, not URIs plus headers. Lots of programs and services take a URI, perform a GET on it, and expect that they got what you told them to get. One example I give in the book is the W3C HTML validator. If the only URI I exposed was /releases/104, there'd be no way to validate the Spanish version separate from the English version. If the default representation format for /releases/104 was plain text (unlikely, but this is just a thought experiment), there'd be no way to validate the HTML formats at all.

To the extent that this reason applies to some piece of information, I argue for putting a lever for it in the URI. Obviously it applies to things like which press release you want. I think it also clearly applies to the language and the data format. I think it doesn't apply to compression or character encodings. Here's a quote from page 243, in the section "Compression" where I talk about Accept-Encoding:

You probably remember that I think different representations of a resource should have distinct URIs. Why do I recommend using HTTP headers to distinguish between compressed and uncompressed versions of a representation? Because I don't think the compressed and uncompressed versions are different representations. Compression, like encryption, is something that happens to a representation in transit, and must be undone before the client can use the representation. In an ideal world, HTTP clients and servers would compress and decompress representations automatically, and programmers should not have to even think about it.

We can argue over what counts as a different representation (as you can see I take a fairly high-level view), but even if you think the compressed and uncompressed data are different representations, this isn't a difference that needs to go into the URI. A client can be automatically programmed to detect a compressed representation and uncompress it. The key is this ability. If it were possible to algorithmically translate any human language into another, or any data format into another, there'd be a much weaker case for extending the URI with levers for language and data format.

The same logic holds for character encodings, because a client can be automatically programmed to convert any character encoding into Unicode. The case is weaker because 1) the programming is difficult unless there happens to be a library for your language, and 2) you only get compression if you ask for it, but you get a character encoding whether you like it or not.

If you're exposing a resource in multiple encodings, and you have reason to believe that a URI-driven client might choke on your default encoding, then sure, put an encoding selection lever in the URI. But at the risk of sounding provincial (if being too cosmopolitan can be a kind of provinciality), your default encoding ought to be UTF-8 or UTF-16.

Paul, I got a weird "comment authentication failed!" error when I tried to post this to your weblog:

I've been thinking of ways to link the canonical URI to a more specific URI. My two ideas were using 300 ("Multiple Choices") when you requested the canonical URI with no conneg settings, and coming up with a header that has the opposite meaning of Content-Location, which the client would see whenever it requested the canonical URI with conneg settings. The advantage of the latter is that you wouldn't have to follow a redirect, but I like how the 302 idea doesn't make up anything new.