Re: [xml-dev] Is it a well-formedness error to use a character not in the encoding specified by the XML declaration?

From: rjelliffe@allette.com.au

To: xml-dev@lists.xml.org

Date: Fri, 19 Mar 2010 20:21:54 +1100

> Rick,
> Unicode tech reports 22 and 36 both describe transcoding producing both 1A
> and FFFD characters as a result of character mismatches depending on
> context
> and direction. It appears to me that 1a can be introduced when
> transcoding
> either into or out of Unicode, but this is not my area of specialisation.
> Could you point me at where the XML standard says that transcoding
> problems
> that result in the introduction of substitution characters into transcoded
> text should "cause processing to report an error"? .I had a look for
> exactly
> this earlier and must have missed it. The W3C document seems to leave
> transcoding issues to the Unicode standards. U+FFFD is apparently a valid
> XML character so there should be no issue with processing it. .
There are two issues:
1) What should an XML processor do when faced with a bad byte sequence?
The answer is very clear: s4.3.3.
"It is a fatal error if an XML entity is determined (via default,
encoding declaration, or higher-level protocol) to be in a certain
encoding but contains byte sequences that are not legal in that encoding.
"
2) Is the character FFFD allowed in data?
Again, the answer is very clear: s2.2
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]
So as I said, XML can have U+FFFD in data, but not put there by a transcoder.
So I don't think it is correct behaviour to fall back to any character,
including U+FFFD, especially silently. Silently failure undercuts XML
approach.
(I will modify this: however, an implementation could choose to put in a
SUB or FFFD or any other signal anywhere it likes, as long as it is clear
that the DOM or stream or whatever is not WF XML and there has been a
fatal error. But this is not something "allowed" by XML or Unicode,
because by this stage you don't have XML.)
On the issue of what to do if you are using some magical encoding has
characters that are not in Unicode, it is a really specialist topic and
should not be confused with the general case. (There are a few CJK
dictionary character repertoires which have more characters than Unicode,
for example. However, these are not in any off-the-shelf transcoders so it
is not this case.)
Cheers
Rick Jelliffe