The Character Model Working Draft proposes early uniform normalization
of XML documents on the web [1]. This is intended mainly to solve the
problem of string identity matching [2].
We believe that the proposal relies on an unenforceable social contract,
and falls short of a complete solution.
These are our concerns:
1) The model proposed is opposite to the successful model currently in
use on the web, which is that the consumer be prepared to accept inputs
from producers of unknown quality. For HTML browsers "bad" data is
subjected to whatever fixup necessary to display something. For XML
"bad" data is rejected by strict well-formedness and possibly validity
checks. If we could trust all producers to supply well-formed XML and
valid HTML, well-formedness checks or structural fix-ups would be
unnecessary. Conversely, why should a consumer trust that producers
will universally supply correctly normalized data when the consumer
can't trust them to produce well-formed XML or valid HTML?
2) The model requires a level of trust in producers. Unnormalized
content may result in errors when for some reason this trust is
violated. As a social contract rather than a technical one, early
normalization is impossible to enforce.
3) Without enforcement, the problems the early uniform normalization
purports to solve are not in fact solved. The assumption is that there
is currently a small amount of unnormalized data that has the potential
to mess up applications. Early uniform normalization seems to be
designed to keep the amount of unnormalized data small, not to deal with
the consequences of such data. In fact it may hinder dealing with
unnormalized data, since it precludes certain strategies for coping with
it (namely, late normalization).
4) The performance penalty imposed upon producers is severe for
high-speed XML generators (speed) and constrained environments (memory).
Since the model is unenforceable and doesn't offer a complete solution,
there is little incentive for a particular product to incur this
penalty.
5) The character model should consider augmenting early uniform
normalization with enforcement by consumers. For instance, an XML
processor could reject unnormalized input in the same way it rejects
well-formedness violations. Not only does this protect applications
from errors to some extent, but it provides the most powerful type of
encouragement for producers to normalize early. Assuming that verifying
correct normalization is much cheaper than normalizing itself, this
seems like a reasonable compromise on the perf side too. Did the I18N
group consider this? A single surgical change to XML 1.0 to fail
unnormalized documents as mal-formed would go an incredibly long way to
encouraging proper behavior by producers (including HTML as it evolves
into XHTML).
6) We note that use of a text editor which outputs NFC-normalized does
not guarantee that W3C-normalized XML or HTML is produced, because of
the necessity to rewrite entities as illustrated in [3]. Thus, the
responsibility for creating W3C-normalized text cannot be fully placed
upon text editors, but requires specific knowledge (in this case the
individual author) of normalization constraints. Suggestions that most
products already produce NFC-normalized output does not apply to
W3C-normalized XML and HTML created with text editors.
- Jonathan Marsh
Microsoft
[1] http://www.w3.org/TR/charmod/#sec-Normalization
[2] http://www.w3.org/TR/WD-charreq#2.1
[3] http://www.w3.org/TR/charmod/#sec-TextNormalization