Not only CJK and Cyrillic, also Hebrew and I suppose many other non-Latin
languages.
Jony
-----Original Message-----
From: www-international-request@w3.org
[mailto:www-international-request@w3.org] On Behalf Of Henri Sivonen
Sent: Monday, June 01, 2009 10:49 AM
To: M.T.Carrasco Benitez
Cc: Travis Leithead; Erik van der Poel; public-html@w3.org;
www-international@w3.org; Richard Ishida; Ian Hickson; Chris Wilson; Harley
Rosnow
Subject: Re: Auto-detect and encodings in HTML5
On May 31, 2009, at 11:18, M.T. Carrasco Benitez wrote:
> Near to Erik, but UTF8 in worse case:
>
> 1) Best: HTTP charset; unambiguous and "external"
> 2) Agree on ONE public detection algorithm
> 3) Mandatory declaration as near to the top as possible; if in META,
> the first in HEAD; within a certain range of bytes (e.g., 512)
We tried "first in HEAD" as a document conformance requirement, and it
was way too annoying with validator messages when updating old sites.
"Within a certain range of bytes" strikes the right balance between
performance and existing authoring practices.
> 4) Default UTF8 could be part of the algorithm; perhaps the last
> option
This is not feasible considering out "support existing content" design
principle. If there's only a single last-resort default, it must be
Windows-1252 to have the best world-wide coverage of existing content.
(Future non-Latin and Latin content can explicitly opt into UTF-8.)
Unfortunately, having only Windows-1252 as the default without having
a sniffing algorithm that makes CJK and Cyrillic content not reach the
default would be bad for market share in CJK and Cyrillic locales. If
we want to get rid of the locale-dependent variability of the last-
resort default, we need to have a single normative heuristic detection
algorithm that is so good that CJK and Cyrillic encodings are guessed
right from the first 512 to 1024 bytes (i.e. mostly <title>).
And then we'd need a desktop browser vendor who is willing to be the
first one to remove the UI for setting the last-resort default
encoding--for all modes that the browser has for text/html.
UTF-8 never makes sense as the last resort for text/html, because
UTF-8 in text/html has always been opt-in for authors, so logically
there should be much less unlabeled existing content that assumes
UTF-8 than that assumes a legacy encoding.
> 5) No BOM or similar
Compatibility with existing implementations requires the UTF-16 BOMs
and the UTF-8 BOMs to be treated as encoding signatures whose
authority ranks higher than <meta>.
On May 31, 2009, at 20:37, Erik van der Poel wrote:
> I agree that it would be interesting if major HTML5 implementers and
> (the) HTML5 spec writer(s) would agree on a UTF-8 default charset.
>
> Just to make the HTML5 "version indicator" a bit more explicit, might
> this be something like the following HTTP response header?
>
> Content-Type: text/html; version=5; charset=gb2312
Content-Type: text/html; version=5 doesn't default to UTF-8 is
existing user agents but if you can set headers, you can already do
Content-Type: text/html; charset=utf-8.
The HTML WG discussed versioning at length in March and April 2007. I
suggest not reopening that debate.
--
Henri Sivonen
hsivonen@iki.fihttp://hsivonen.iki.fi/