On Apr 24, 2008, at 23:11 , Frank Ellermann wrote:
> Henri Sivonen wrote:
>
>> Considering the real Web content, it is better to pick Windows-1252
>> than a hypothetical generic encoding.
>
> A good strategy for browsers, not necessarily for validators
> IFF it could accept wild mixtures of Latin-1 and UTF-8 as
> "valid" windows-1252.
[...]
> Your proposal "just assume windows-1252" is an idea for the
> validation step,
That wasn't the proposal. The proposal was: Assume Windows-1252 but
treat the upper half as errors.
> but it could have rather odd effects for the
> UTF-8 output of other errors, when the input contains any octet
> in the range 0x80..0x9F, or worse, if the input in fact was
> UTF-8, not windows-1252.
Would mere U+FFFD be better?
> Jukka's proposal avoids most surprises - all octets 0x80..0xFF
> are accepted as "unknown garbage".
I think a quality assurance tool should not *accept* unknown garbage
but emit an error on non-declared non-ASCII.
--
Henri Sivonen
hsivonen@iki.fihttp://hsivonen.iki.fi/