Reading the status of the document, one would believe that the erratum E9
change to the characters allowed in tag names and attribute names is the
only substantive change.
But E11 seems to increase the number of available characters in actual
content by increasing from Unicode 3.x to Unicode 5.
Some have commented that they believed the sentence "XML processors MUST
accept the UTF-8 and UTF-16 encodings of Unicode 3.1" meant that encodings
for characters not in Unicode 3.1 were not allowed. I don't read it that
harshly, but I can see how they would claim that characters not in Unicode
3.1 should be avoided in content because XML processors are not required
to support them, so interop cannot be guaranteed.
Now the sentence is changing so that Unicode 3.1 is effectively being
replaced with Unicode 5. Wouldn't it be easier to nip this in the bud now
by converting an UTF-8 encoding into the corresponding 32-bit value,
regardless of whether or not it maps to something in Unicode K (where
K>=5)? Then, you could say which of those 32-bit values are illegal (e.g.
the permanently undefined Unicode characters), and which should be avoided
(e.gt. the compatibility characters).
John M. Boyer, Ph.D.
Senior Technical Staff Member
Lotus Forms Architect and Researcher
Chair, W3C Forms Working Group
Workplace, Portal and Collaboration Software
IBM Victoria Software Lab
E-Mail: boyerj@ca.ibm.com
Blog: http://www.ibm.com/developerworks/blogs/page/JohnBoyer
Blog RSS feed:
http://www.ibm.com/developerworks/blogs/rss/JohnBoyer?flavor=rssdw