Before getting into the details of a schema for an XML syntax for declaring
character entities, I think we should step and ask what the real
requirements are.
What XML did to SGML was preserve SGML's extensibility where it was really
needed (for elements and attributes) but remove it where people could get
by without it (eg delimiter syntax). Which category do character entity
names for in? It is not obvious to me that there is a requirement that
character entities be user extensible to the same extent that elements and
attributes are. Consder the following points:
- in SGML days most people used the standard entity sets
- at any point in time the set of things that are being referenced by
character entities is closed (i.e. the set of Unicode characters) modulo
private use characters (which are typically deprecated on the Web),
although it may evolve over time; this is quite different from the
situation with elements and attributes
- Unicode provides a standard set of names for all Unicode characters
- I don't see the compelling user requirement for different users to be
able to user different names for the same character
- having the 5 builtin entities in XML has worked out pretty well; in
particular, there is no need to clutter the infoset or DOM with them; they
are just generated as needed on output
- if you have user-defined character entity names, then users will start
demanding the ability to preserve those names, which means that the
DOM/SAX/Infoset will need to record which entity name if any was used for a
character
So I'm wondering whether a more constrained approach to character entities
would work. Suppose for example there is a standard W3C-defined builtin
entity set; this would have a version number and would add new characters
from time to time (but never change existing entity names). There would be
a standard mapping from a version number to a URI where a XML specification
of the entity set would be available. However, parsers wouldn't have to
fetch and parse this, they could just recognize the version number and
refer to an appropriate compiled-in table. The XML declaration would
declare the version number of the builtin entity set that was being used;
if the XML declaration didn't specify a version number, only the 5 XML 1.0
builtin entities could be used. Just as now, the SAX/DOM/infoset wouldn't
record whether a particular character was entered literally or using a
builtin entity reference. Instead programs that serialize XML (like XSLT)
would have options saying when to use builtin entity references to
represent characters.
For the first version of the standard builtin entity set we could start with
- HTML entities
- MathML entities
- maybe a set of entity names algorithmically generated from the standard
Unicode names in Unicode 3.2; 0xe01; which has a Unicode name of "THAI
CHARACTER KO KAI" might be entered as &thai_character_ko_kai;.
James