On Wed, 8 Jan 1997, Larry Masinter wrote:
> # For HTML language tagging
> # (the LANG attribute), we explicitly overruled this (see RFC2070).
> # For HTTP, using a similar overruling would make sense. This would
> # mean that a server would check for "en-us", and if not found, for "en".
>
> Please review section 14.4 of RFC 2068 (HTTP/1.1). I still haven't
> quite understood if anyone thinks this section is wrong or should be
> changed.
Thanks for this hint! This not only gives a (partial) answer to
the problem of language tag matching, but also some inditations
in other areas that have been discussed in this thread.
I'll take these issues first. For Accept-Language, RFC2068 says
explicitly what q=0 means:
>>>>
The language quality factor assigned to a language-tag by the
Accept-Language field is the quality value of the longest language-
range in the field that matches the language-tag. If no language-
range in the field matches the tag, the language quality factor
assigned is 0. If no Accept-Language header is present in the
request, the server SHOULD assume that all languages are equally
acceptable. If an Accept-Language header is present, then all
languages which are assigned a quality factor greater than 0 are
acceptable.
<<<<
This clearly means that q=0 means NOT ACCEPTABLE. Whether this
has to be interpreted as being a special case for Accept-Language,
or an example of a general principle, is beyond my knowledge
of the RFC and its creation process.
Another very revealing detail is the following:
>>>>
Note: As intelligibility is highly dependent on the individual
user, it is recommended that client applications make the choice of
linguistic preference available to the user. If the choice is not
made available, then the Accept-Language header field must not be
given in the request.
<<<<
So the browser that was mentionned in an earlier mail not just
contains a design problem, it also ignores recommendations
made here. I don't think it is of use to demand flexibility
for browser behaviour where the browsers ignore the specs.
Otherwise, we could just stop to write specs at all.
Now for the question of prefix matching. The RFC indeed defines
prefix matching, very clearly and consistently. But this prefix
matching works only one way:
>>>>
The Accept-Language request-header field is similar to Accept, but
restricts the set of natural languages that are preferred as a
response to the request.
Accept-Language = "Accept-Language" ":"
1#( language-range [ ";" "q" "=" qvalue ] )
language-range = ( ( 1*8ALPHA *( "-" 1*8ALPHA ) ) | "*" )
Each language-range MAY be given an associated quality value which
represents an estimate of the user's preference for the languages
specified by that range. The quality value defaults to "q=1". For
example,
Accept-Language: da, en-gb;q=0.8, en;q=0.7
would mean: "I prefer Danish, but will accept British English and
other types of English." A language-range matches a language-tag if
it exactly equals the tag, or if it exactly equals a prefix of the
tag such that the first tag character following the prefix is "-".
The special range "*", if present in the Accept-Language field,
matches every tag not matched by any other range present in the
Accept-Language field.
<<<<
To give an example, we have the following situation:
Accept-Language Document Match?
language-range language-tag
en en YES
en-us en-us YES
en en-us YES
en-us en NO?!
en-us en-uk NO?!
The idea is that Accept-Language defines language-ranges,
whereas the documents will be tagged exactly. I don't know
exactly how the group arrived at this asymmetry, but I
guess the basic thought was that for documents, it would
be clear whether it was US or British English (and
likewise in other cases), whereas the user would in
general not care much about the difference. Prefixes
(ranges) would therefore be used in Accept-Language, but
not in document tags.
Several points lead to the fact that the situation is not
(or should not be) as asymmetric as described in the RFC.
- Rarely both en-us and en-uk documents are prepared, and
thus the authors don't care about distinguishing
and just tag them with "en".
- In some cases, there may be no actual difference, and it
would be strange to label a document as en-us if it
is just as well en-uk.
- Tagging is in many cases done via file names. Something
such as text.en.html and text.fr.html is preferred
to text.en-us.html and text.fr-ch.html.
- In many cases, language selections on the browser side
are connected to locales. These include a lot of
details where small differences matter, and are
therefore finely granulated. I don't think Windows
or the Mac have something like a "generic English"
configuration.
So probably, a symmetric solution, with prefix matching on
both sides, is highly preferable. In this respect, the
HTML solution (RFC 2070) is not exactly clear, because it
only says that language tags are interpreted hierarchically,
and gives one way of prefix matching as an example. It is
not specified whether the other way of prefix matching is
also allowed, or not.
Apart from the small terminology
problem that "language-range" and "language-tag" don't make
sense anymore to distinguish the Accept side from the document
side, the following part:
A language-range matches a language-tag if
it exactly equals the tag, or if it exactly equals a prefix of the
tag such that the first tag character following the prefix is "-".
probably has to be changed as follows:
A language-range matches a language-tag if a prefix of the
language-range matches a prefix of the language-tag, such that
for both prefixes, the prefix is equal to the whole identifier
or the first character following the prefix is "-".
There is the possibility that this goes too far. In the case of
matching en-us with en-uk, it makes sense. But if we consider
generic prefixes, such as "x-" for experimental, it wouldn't
make sense to just return any kind of experimentally defined
language just because the user has specified one particular
kind of such tag. Matching of prefixes does not imply that
the denoted languages are mutually intellegible. So an alternative
would be:
A language-range and a language-tag match if they are equal, or
if a prefix of one of them exactly equals the other, such that
the first character following the prefix is "-".
Regards, Martin.