At 22:55 03/10/08 -0700, Johnny Stenback wrote:
>Francois Yergeau wrote:
>[...]
>>While this is sufficient for strict interoperability, it is not for
>>compatibility of code. If there is not at least one required encoding, it
>>is not possible to write a DOM program that will work over any DOM
>>implementation. We insist that at least UTF-8 be required. Furthermore,
>>since XML 1.0 did it back in 1998, it cannot be so onerous to require all 3.
>>Please reconsider.
>
>Agreed, the spec now requires that those 3 encodings must be supported
>when dealing with XML data.
This is very valuable progress. However, I wonder how the DOM is able
to make the distinction between little-endian and big-endian versions
of UTF-16.
Please note that simply using UTF-16LE and UTF-16BE does not solve this
issue, because neither UTF-16LE (as defined in RFC 2781) nor UTF-16BE
(as also defined in RFC 2781) take a BOM, but UTF-16 for XML requires
a BOM.
There are the following four variants for UTF-16?? and XML:
UTF-16, big-endian:
- requires BOM
- parsers required to parse
- if labeled (Content-Type header or 'encoding' pseudo-attribute),
label is "UTF-16"
UTF-16, little-endian:
- requires BOM
- parsers required to parse
- if labeled (Content-Type header or 'encoding' pseudo-attribute),
label is "UTF-16"
UTF-16BE:
- BOM prohibited
- parsers may or may not parse (like any legacy encoding)
- Needs to be labeled:
Content-Type: type/subtype;charset=UTF-16BE or
<?xml version='1.0' encoding='UTF-16BE'?>
UTF-16LE:
- BOM prohibited
- parsers may or may not parse (like any legacy encoding)
- Needs to be labeled:
Content-Type: type/subtype;charset=UTF-16LE or
<?xml version='1.0' encoding='UTF-16LE'?>
My recommendation is that the DOM requires that DOM implementations
are able to write out 'UTF-8' and 'UTF-16', and that the choice of
endianness for UTF-16 is left to the implementation (because
XML Processors can deal with both endiannesses, and there is
always a BOM).
The alternatives would be:
- To introduce an additional parameter to be able to specify
the endianness if the encoding choosen is UTF-16. Probably
too much work for what it gets you.
- To redefine 'UTF-16BE' and 'UTF-16LE' to mean 'UTF-16 big
endian' and 'UTF-16 little endian' only for the specific
case of DOM save. This would create a rather confusing special
case, would make it impossible to actually write out something
like 'UTF-16BE' (as defined in RFC 2781), would require
somebody who wants to write out UTF-16 to change from
'UTF-16' to 'UTF-16BE' (most people would probably forget to
do so), and would still require to define what happens on
output if the encoding is 'UTF-16' (anything from 'always
big endian' to 'implementation defined' to 'forbidden').
Regards, Martin.