L2/08-206
Title: Code Point Labels -- Clarification of a Character Name Issue
Source: Ken Whistler
Date: May 2, 2008
**************************************************************
Background
During the run-up to the release of Unicode 5.1, an issue
regarding character names -- or more precisely, the values
of the Unicode character property Name -- came up and
saw some discussion amongst the editorial committee.
The issue came to a head precisely because Unicode 5.1 saw
the release of the UCD in XML, and the XML data files need
an unambiguous determination of the value to be attached
to the "na=??" attribute listings for code points, even
in those cases where it isn't so obvious whether or not
the code point even *has* a name -- namely for control
codes, unassigned code points, noncharacters, etc., which
we don't typically think of as having Unicode character
names.
The question then became just exactly what should be entered
in the XML data for the na attribute for a code point like U+0009.
Should it be na="" or na="" or na=""
or even possibly something else?
The issue is a fraught one, because Name is formally an
immutable property and because it is one of the key values
that is maintained in synchrony with ISO/IEC 10646 and
is additionally subject to longstanding syntactic constraints
limiting the allowed characters in names -- which certainly
don't include "", for example.
In this document, I summarize what existing practice is
and then adopting the definition of a "Code Point Label" -- as distinct
from a Unicode Character Name -- as a way to both encompass
current practice and needs for such labels and to avoid
confusion and destabilization for the character names per se.
**************************************************************
Existing Practice
As it stands now, both in the Unicode Standard and in
ISO/IEC 10646, only graphic and format characters officially
have character names.
If you refer to Table 2-3, Types of Code Points, in TUS 5.0
(p. 27), Controls (gc=Cc), Private-use (gc=Co), Surrogate
code points (gc=Cs), Noncharacters (gc=Cn, Noncharacter_Code_Point=T),
and Reserved unassigned (gc=Cn, Noncharacter_Code_Point=F) do not have
names.
This fact, however, has not prevented either standard from printing
*strings* at code point locations for such code points, in the
same slot that one would expect a character name to occur -- and
this has occasionally been taken as an indication that those strings
are in fact names for those code points.
In the printed versions of the Unicode names list, we have the
following conventions:
Controls (gc=Cc), print the string ""
Noncharacters (gc=Cn, Noncharacter_Code_Point=T), print the string
""
Reserved (gc=Cn, Noncharacter_Code_Point=F), print the string ""
Surrogate code points and Private-use code points are simply
never listed, so have no such conventions.
Traditionally, ISO/IEC 10646 has used the following conventions:
Reserved, print the string "(This position shall not be used)"
Noncharacters, print the string "(This position is permanently reserved)"
Controls, Surrogate code points, and Private-use code points
are never listed, so have no such conventions.
Nobody ever mistook the 10646 strings as "names", in part because
they spelled out complete sentences. But the Unicode names list
conventions appear more as name-like labels -- and there is a
further complication involved. The issue is this: the Unicode
names list, NamesList.txt, is itself a data file that both
drives the typesetting of the actual code charts for Unicode,
but is also itself derived from other data files -- most
importantly the core UCD data file, UnicodeData.txt.
UnicodeData.txt has some conventional use of fields to assist
in the derivation of NamesList.txt. To wit:
Field 1 is where the normative Unicode name appears for an
ordinary (Graphic or Format) character in UnicodeData.txt. So:
002C;COMMA;Po;0;CS;;;;;N;;;;;
^^^^^
However, in order to carry information about other types of
assigned code points, UnicodeData.txt also contains entries
for Controls, with values in field 1 corresponding to what
gets printed in the code charts. So:
0009;;Cc;0;S;;;;;N;CHARACTER TABULATION;;;;
^^^^^^^^^
It also contains entries for Surrogate code points and
for Private-use code points, with values in field 1
corresponding to yet another set of conventions, including
strings which are *not* printed in the code charts. So:
DC00;;Cs;0;L;;;;;N;;;;;
^^^^^^^^^^^^^^^^^^^^^^
E000;;Co;0;L;;;;;N;;;;;
^^^^^^^^^^^^^^^^^^^^
There are no entries whatsoever in UnicodeData.txt for
Noncharacters or Reserved unassigned code points.
In the derivation of NamesList.txt, the string ""
for Noncharacters is inserted by the program that is used
to generate NamesList.txt, rather than being parsed from
UnicodeData.txt. The string "" for Reserved
unassigned code points is also inserted by that program --
but only for the few code points that actually require explicit
listing because they need to be there as placeholders for
cross-reference annotations. Any other "" strings
that appear printed in the actual code charts are inserted
by an entirely distinct program, unibook, which is used
for chart formatting, based on logic that handles the
the formatting for ranges of unassigned code points within
printed blocks.
O.k., to this point what should be clear is that field 1
values in UnicodeData.txt cannot be taken verbatim as
being equivalent to the values of the Name property value.
Values in field 1 using angle brackets are not actually
names, but serve as kinds of meta-labels for use in other
conventions.
**************************************************************
Unique Code Point Labels
Over the years, there actually has developed yet another
set of conventions for labelling code points. These are
the conventions used by Mark Davis' suite of tools that
generate derived property files and related data files.
The pattern Mark's tools follow is to use a unique label
for *every* code point, so that any listing of properties
can include identical format comments for listing
"names" for each code point or range of code points, whether
or not it involves an ordinary character with an ordinary
Unicode character name.
The conventions Mark uses derived from the meta-labels long
used in the names list, namely "", "",
and "", but he extended and modified them to
have labels for each different code point type, including
surrogates and private-use, and by adding the code point
as part of the label, to make them unique.
Here is a summary of the actual current usage:
Graphic & Format: Use Unicode character name
Control:
Reserved:
Noncharacters:
Private-Use:
Surrogates:
Where the NNNN gets turned into the code point in hex, using
our usual 4 - 6 digit convention for code point values.
This style of labelling code points uniquely has proven useful,
and nobody has really objected to it. But as yet it has no
official status other than simply praxis in comment fields
in the UCD.
A problem arose, however, when it was asserted that these
unique code point labels should then be taken as the values
of the na (Name) attribute in the XML data files for the UCD.
**************************************************************
Summary
It seems clear that at least some implementations have seen
a need for having unique, identifier-like labels for *all*
Unicode code points, not merely assigned Unicode characters
with the familiar unique Unicode character names. I don't
see any reason not to accomodate this need, but feel it is
important not to have these labels confused with the
actual Unicode character names.
Accordingly, I'd like the UTC to nail these issues down,
formally distinguish between Unicode character names
and code point labels, and fully specify both. To this
end, the following consists of a proposal for discussion
and (I hope) adoption.
**************************************************************
Proposal
1. Unicode Character Name
The UCD Name property (short alias: "na") is a string property.
Its value for all Graphic and Format characters is the
Unicode character name as generally understood.
For Graphic and Format characters other than ideographs and Hangul
syllables, the name is as listed in field 1 of UnicodeData.txt.
For Hangul syllables, the name is derived by rule, as specified
in Section 3.12, under "Hangul Syllable Name Generation",
making use of the values of the Jamo_Short_Name property.
For ideographs, the name is derived by rule, by concatenating
the string "CJK UNIFIED IDEOGRAPH-" or "CJK COMPATIBLITY IDEOGRAPH-"
(or other as specified, e.g. "TANGUT IDEOGRAPH-") to the code
point, expressed in hexadecimal, with the usual 4 to 6 digit
convention.
For all *other* Unicode code points of all types, the
value of the UCD Name property is the null string. I.e., na="".
Note that the Unicode Name property values are unique for
all non-null values, but not every Unicode code point has
a unique Unicode Name property value. Furthermore, the
Name property value uniqueness requirement interacts with
name assignment rules for formal aliases and for
named character sequences: Unicode character names, formal
aliases, and named character sequences constitute a single,
unique namespace.
As corollary to this specification, it should be noted that
the value of field 1 (the string of characters between the
semicolon separators) is to be taken as the normative specification
of the UCD Name property only for Graphic and Format
characters other than ideographs and Hangul syllables. All
other values which occur in field 1 are to be understood
as meta-labels that serve other functions in the generation
of names lists and charts, or to label abbreviated ranges of
property definitions, but do *not* constitute values of the
UCD Name property.
2. Unicode Code Point Type Label
For each of the seven major types of Unicode code points,
there is a unique string label, as follows:
Graphic: graphic
Format: format
Control: control
Reserved: reserved
Noncharacters: noncharacter
Private-Use: private-use
Surrogates: surrogate
3. Unicode Code Point Label
The Unicode code point label is a unique label for *every*
Unicode code point in the entire range: U+0000..U+10FFFF.
The code point label is distinguished from the
expression of the code point per se (i.e. "U+0000"
or "U+0061"), which itself is also a unique identifier,
as described in Appendix A, Notational Conventions.
(Or see also Clause 6.5 Short identifiers for code positions
(UIDs) in ISO/IEC 10646.)
The Unicode code point label is a unique string value
defined as follows:
For any Unicode code point for which the value of the
UCD Name property value is non-null, the code point label
is identical to the Unicode character name. This will
be the case for all Graphic and Format code points.
Otherwise, the code point label is constructed as follows:
Concatenate the code point type label for the code
point, "-", plus the 4 to 6 digit representation of
the code point.
More specifically, the code point labels are as follows:
Control: control-NNNN
Reserved: reserved-NNNN
Noncharacters: noncharacter-NNNN
Private-Use: private-use-NNNN
Surrogates: surrogate-NNNN
When displayed in mixed contexts with Unicode character
name values, to avoid any possible confusion with actual,
non-null Unicode Name values, constructed Unicode code point labels
are displayed between angle brackets: ,
, etc.