L2/01-194
Source: Eric Muller on
Date: 05/07/2001 09:20:17 AM
Proposal: Formalizing the Unicode Private Use Area
I'd like to add the following to the UTC meeting agenda: Formalizing
the Unicode Private Use Area. I have attached a draft proposal I wrote
a while back. Given some recent feedback, let me preface this document
a little bit:
I firmly believe in the idea of Unicode being a universal collection
of characters, that every character will ultimately be part of that
collection, and that the first order of business when encountering a
not-yet registered character is to give it a Unicode semantics (i.e.
values of properties) and submit it for inclusion in Unicode.
We also have to recognize that this process is not instantaneous, and
there is currently a void in the interim period. When the user
community for the characters is small, well-connected and uses a small
number of tools (e.g. a group of academics working on a new script),
an informal agreement on the meaning of the characters is usually
enough. However, there are cases where the informal process breaks
down. For example, the PRC is on track to mandate the support of GB
18030 for products sold in the PRC; this standard includes a (small)
number of characters that are not part of Unicode 3.1, and defines
mappings to the PUA. There are simply too many players for the
informal agreement to occur.
I believe it would be advantageous to have a description of the PUA
characters that gives them a complete Unicode semantics. If Unicode
had a mechanism similar to the one proposed here, it would have
encouraged the PRC to provide Unicode semantics for the
not-yet-in-Unicode characters. Products could have mechanisms to
extend their built-in Unicode database to those characters, helping in
the interim period. As importantly, the proposal for inclusion of the
characters in Unicode would be half done. And there would be something
to help the processing of legacy documents after the inclusion of the
characters.
Eric.
Table of Content
1. Motivation
2. Terminology
3. Requirements
4. Overall Structure
5. Characters
6. Collections
7. Related work
1. Motivation
The Unicode standard is a constantly evolving character collection, and
there may be times when one needs a character that is not yet part of the
standard. Unicode recognizes this situation:
[p23] A contiguous area of codes has been set aside for private use.
Characters in this area will never be defined by the Unicode Standard.
These codes can be freely used for characters of any purpose, but
successful interchange requires an agreement between sender and receiver on
their interpretation.
Indeed, a document that uses PUA code points does not have a meaning by
itself, just like a document where the encoding is not specified has no
meaning by itself.
First and foremost, this note provides a mean to build those agreements.
The idea is that a document could specify a semantics for the Private Use
Area characters it contains, at the same level as Unicode specifies a
semantics for the assigned characters (i.e. those that are part of the
Unicode repertoire). Just like Unicode, part of the semantics is formalized
and represented in a machine readable form, and part of it is informal.
2. Terminology
A gaiji character is a character that is not part of the Unicode repertoire
and is encoded in the PUA. In this document, there is no intention to
restrict gaiji characters to ideographs. Of course, this notion is relative
to a particular version of the Unicode standard.
3. Requirements
The design goals are:
Define a syntax to describe the formal part of the Unicode semantics of
characters. By describing gaiji characters in that way, they can become
full participants in Unicode processing. For example, one could indicate
that a new character COMBINING REVERSE SOLIDUS OVERLAY is in combinining
class 1 and processors that deal with combinations would do the right
thing on this character. (By the way this is not an innocent example:
this character was accepted for inclusion in June 1999, but will be part
of Unicode only after version 3.0; so right now, it's a gaiji.)
Make that syntax extensible, so that additional properties can be
attached. For example, there could be indications for Input Method
Editors on how they should let the user input those characters.
Define a syntax to organize character descriptions in collections and to
combine collections. Consider the case where Alice's document uses one
collection of private characters, and Bob's document uses another one,
and Charles creates a document that combines Alice's Bob's documents.
While this example may seems contrieved, replace persons by machines and
it suddenly looks a lot more real.
Make that syntax extensible, so that additional properties can be
attached. For example, a collection could indicate where an appropriate
font could be found.
Make these descriptions human-legible and easy to process by programs.
Practically this means that descriptions can be built using simple tools
such as text editors, yet they can be incorporated in sophisticated
document processing systems.
Allow two collections to overlap (i.e. to assign the same value to one
code point) to avoid central administration, and provide a mechanism to
reconcile them. What if Alice and Bob both used U+E732 in their
collections?
Support the naming and referencing of character collections, in
particular over the Internet. Clearly, there will be collections of
gaiji characters that will be used in a number of documents. Repeating
all the character descriptions in all the documents would be a
logistical nightmare. In addition, it would make it difficult to know if
the code value U+E732 represents the same character in two different
documents. At least, if both documents reference the same collection (or
more precisely, if the code value was assigned by the same
subcollection), this guarantee can be given.
Define mechanisms to incorporate or attach references to character
collections to documents.
4. Overall Structure
The goals dealing with extensibility, human readability, and machine
processing are easily satisfied by using XML.This document describes a DTD.
Open:
Should we go directly to XML Schema
instead?
Open:
Usual questions about DTDs: what
characters should we use in element
names (-, _, camelCase)? elements
or attributes?.
Open:
Use namespaces for the extensions?
5. Characters
The unicode-name element encloses the Unicode name of that character. It is
not applicable to gaiji characters.
The name element is used for non-Unicode characters.
Exactly one of unicode-name and name must be present.
The unicode-1.0-name element encloses the Unicode 1.0 name of the
character, if it exists.
The alternative-names element encloses a set of alternative-name elements,
which in turn enclose alternative names for this character.
The code element encloses the Unicode code value of the character, using
the U+xxxx syntax.
The char element contains a single character, which is the character
itself.
The cross-references element encloses a set of cross-ref elements. Each
cross-ref element contains a code element and a name element for the
character which is referenced. The cross-ref element has a role attribute
which can take the values inequal or other. The default value for that
attribute is other.
The compatibility-decomposition element contains a sequence of characters
into which the character being described can be compatibly decomposed.
The canonical-decomposition element contains the characters into which the
character being described is canonically decomposed.
case can have the values UPPERCASE, TitleCase or lowercase.
combining-class encloses the combining class (in its numeric form).
directionality encloses the directionality property.
jamo-short-name encloses the Jamo short name property. It can be present
only for Unicode conjoining Hangul jamo characters.
general-category
numeric-values is present if the character is a number. It encloses the
numeric value as recorded in section 4.6. In addition, the attribute value
is the numeric value represented as a decimal number, without ',' to
separate the character groups. The attribute decimal can take the values
yes or no.
mirrored is present for those characters that have the mirrored property.
mathematical is present for characters that have the mathematical property.
Open:
Look at the other properties on the
Unicode cdrom in proplist.txt.
decimal CDATA #IMPLIED>
The informative-note element contains an informative note.
Open:
What should be the DTD in there? A
fragment of docbook? The itsy bitsy
dtd?
Finally, these elements are assembled in a character element:
Here are some examples:
LATIN CAPITAL LETTER AU+0041ALRCOMBINING REVERSE SOLIDUS OVERLAYU+E0001DOLLAR SIGNmilreisescudoU+0024$LRcurrency sign0A4Glyph may have one or two vertical bars. other
currency symbol characters: 20A0 â' - 20AF â'¯
6. Collections
Collections are formed by grouping characters and by combining collections.
A collection is well-formed iff:
No two characters have the same name, where the name of a character is
defined as the value of the unicode-name element or the value of name
element, whichever is present.
No two characters have the same code.
An enumerated-collection is just a set of character elements.
A ref-collection references a external collection (that is, external to the
resource in which this reference occurs). It must have a system identifier,
an URI, which may be used to retrieve the referenced collection. Relative
URIs are relative to the location of resource within which the
ref-collection occurs. In addition, there may be a public identifier. A
processor attempting to retrieve the referenced collection may use the
public identifier to try to generate an alternative URI. If the processor
is unable to do so, it must use the URI specified in the system identifier.
A union-collection groups the characters of multiple collections. If the
set-wise union of those collections are not well-formed, characters of the
later collections are removed from the union.
A subsetted-collection removes some the characters of a base collection.
The characters to remove are identified by their code value.
A remapped-collection reassigns new code points to the characters of a base
collection.
A simple-map just lists pairs of code points. Characters which are not
listed as the source of a pair are mapped to their original code point. No
two pairs should map from the same character. The map should not assign two
different characters to the same code point.
A shift-map adds an offset (positive or negative to each code point. By
construction it preserves well-formedness.
These are the only maps:
And this complete the means of constructing collections:
Here are some examples.
COMBINING REVERSE SOLIDUS OVERLAYU+E0001
Here is another collection that uses the same PUA code point, but defines
it differently:
Adobe LogoU+E0001
Let's assume that our first collection is accessible via the URI
http://atm.corp.adobe.com/chc/eric.chc and the second is accessible via the
URI http://oranda.corp.adobe.com/chc/adobecorp.chc. Just forming the union
of those collections will drop one of the two PUA characters (the one in
the collection mentionned second). The following collection can be built
for documents that need both PUA characters:
In documents that use this collection, the code point U+E000 refers to the
Adobe Logo character, and the code point U+E001 refers to the COMBINING
REVERSE SOLIDUS OVERLAY characters.
7. Related work
The first source of inspiration is the XML world. In an XML document, the
element names that are used have no particular meaning by themselves, just
like the PUA code points have no meaning. But in the XML world, this is the
norm rather than the exception and mechanisms have been designed to cope
with that. In fact, these were a major source of inspiration: DTD and XML
schemas are similar character collections, namespaces correspond to the
collection bases, and the collection naming and referencing is based on DTD
naming and referencing.
The W3C NOTE A Notation for Character Collections for the WWW by Martin
Dürst is an XML DTD to describe sets of character code values. The main
objective is to be able to answer the question "Is this character code in
this collection?". Particular attention is paid to support efficient
implementation when the set descriptions are resources on the network.
While this is useful when the sets are made of standard characters, it's
really not enough to deal with private use characters, as it does not
attach a meaning to them.
The ConScript Unicode Registry by John Cowan and Michael Everson is a
registry of Private Use Area uses. The goal of this effort is really to
have a centralized allocation of the private use area. It does not attempt
to record semantics of the characters.
Copyright (c) 2001 Adobe Systems Inc.