Anne van Kesteren, Wed, 21 Nov 2012 22:04:22 +0100:
> I saw http://www.w3.org/International/questions/new/qa-byte-order-mark-new
> in the minutes.
I have no objections to Anne’s comments. Especially that the BOM
overrides anything else, is important. But instead of removing the
warnings, perhaps you could say that, as of today, not yet all HTML UAs
let the BOM override the HTTP. Also, of course, one should not
encourage anyone to make BOM and HTTP disagree!
Here are some comments of my own:
(I.) While I often speak well about the BOM, I heard a good, critical
comment from Martin, in the Unicode mailing list this summer: [1]
"The problem with the BOM in UTF-8 is that it can be quite
helpful (for quickly distinguishing between UTF-8 and
legacy-encoded files) and quite damaging (for programs that use
the Unix/Linux model of text processing), and that's why it
creates so much controversy."
This informative note would be a good statement to include, directly
or edited - e.g. when you start to describe the problems of the BOM.
(My hunch is, as well, that the "linux model of text processing" is
ultimately one reason why PHP doesn't handle the BOM so well.)
(II.) Positivity! The page tells much about disadvantages of the BOM.
Could you please also describe some advantages to including the BOM?
Speaking about the UTF-8 BOM, then those advantages are
a) It is an UTF-8 _signature_ - thus it prevents the page from
defaulting to to - well - the default encoding,
b) It has effect in both XML/XHTML and HTML.
c) It is small/short,
d) It is very safe: Per Anne's Encoding spec - as well as implemented
in IE (I have not tested released IE10), Webkit and (as promised by
Henri) upcoming versions of Firefox (and since Anne wrote it, I must
assume in Opera too), it is impossible to - by accident or otherwise -
override the encoding of pages that include the BOM. NOTE: Accidental
overriding can happens as a side effect of overriding the current page
since HTML browsers - to various degree - remember manual encoding
overriding also for other pages that you open in the same Tab/Window.
If you like, you could as well add that these advantages are not as
important for XML documents, since the ultimately defaults to UTF-8
anyhow.
(III.) Under the subheading 'Quirks mode in Internet Explorer' (beneath
'Potential issues with the UTF-8 BOM'[2]), please replace 'Internet
Explorer 6' with 'Internet Explorer 5.5'. (I verified - again - today,
using the fine service as http://netrenderer.de.) (If one follows the
link to the article on 'Serving HTML & XHTML', then you already makes
clear that IE6 is _not_ affected:[3] "With Internet Explorer 6,
however, if anything other than a byte-order mark appears before the
DOCTYPE declaration the page is rendered in quirks mode." You should
bring the new BOM article in alignment with that.)
(IV.) Under the subheading 'Transcoding', it is said:
"If you change the encoding of a UTF-8 file from a Unicode encoding
to something else, you must ensure that the BOM is removed.
If you don't either the browser will continue to treat your
content
as UTF-8, or you will see strange characters at the beginning of
the page."
Remarks. To say "If you change the encoding of a UTF-8 file from a
Unicode encoding to something else", sounds strange, for
various reasons:
a) It is obvious that a 'UTF-8 file' is using a "Unicode encoding'.
b) 'non-Unicode encoding' is better than 'something else'.
Suggested reformulation: "If you change the encoding of a
Unicode encoded file to a non-Unicode encoding, then …".
(V.) Also, regarding the sentence that goes, quote: "You should also
be aware that, although ASCII is a subset of UTF-8, a file that starts
with a BOM is no longer ASCII-compatible." Here I would propose to
change "a file that starts [etc]" with "an otherwise ASCII encoded file
that starts with a BOM is no longer ASCII-compatible".
But it is tempting to add that it can also be ADVANTAGE that the
BOM this way makes the page ASCII-incompatible. Just imagine: A simple
BOM, and voila, we are in Unicode land rather than in ISO-8859-1 land.
Because ASCII is interpreted as ISO-8859-1 - and friends - on the Web.
(Yes, if you declare the page to be ASCII, the browser still interprets
it as Latin-1.) Thus, a ASCII encoded page on the Web is, strictly
speaking, not ASCII-compatible! But for the BOM, it would - from that
angle - be more ASCII-compatible if you *added* the BOM. This e.g.
matters if the page accepts input form the user (via a form). Thus,
essentially, we are back at the ADVANTAGES of the BOM. Strictly
speaking, if the BOM creates a probllem with regard to
ASCII-compatibility, then we are at the subject of *transcoding*, which
should be a rare and academical rehearsal! See below.
(VI.) Also, it seems like "Sometimes the encoding of a file is changed
('transcoded')" should be moved to right under the subheading
'Transcoding'.
(VII.) And I think the Transcoding section could do well in
dis-recommending to transcode Unicode/UTF-8 encoded documents. And
thus, in that connection, you could add that section on transcoding
relates to rare/academic situations.
(VIII.) Btw, the current text also seems to pre-assume that the reader
knows that he/she must - in addition to removing the BOM, *also*
replace the BOM with a (correct) <meta> charset declaration etc. I
think you should not pre-assume that! You do too much fuss out of the
problems of the BOM here, I feel …
[1] <http://www.unicode.org/mail-arch/unicode-ml/y2012-m07/0333.html>
[2]
<http://www.w3.org/International/questions/new/qa-byte-order-mark-new.en.php#problems>
[3]
<http://www.w3.org/International/articles/serving-xhtml/#declaration>
--
leif halvard silli