At UTC #116, a decision was taken to prepare a Public Review Issue on the
topic of Code Point Labels.

The following material consists of the suggested edits to the Unicode
Standard to accomplish the formal introduction of Code Point Labels. This text
is taken from L2/08-263, with minor additions as suggested during discussion.

If approved, the text would then be remanded to the editorial committee for
the detailed editorial work for eventual insertion into the text of Unicode 5.2
(or Unicode 6.0). Additional clarifying text would be inserted into the text of
UAX #44, for which there is a separate PRI.

Code Point Labels are suggested as a means of clarifying what exactly is
meant by the normative Unicode Name property (the "na" attribute, as recorded in
the XML version of the UCD), as opposed to strings constructed to label code
points that don't actually have assigned Unicode characters. They would then
also formally define the conventions already widely used in the UCD (and
elsewhere) for referring to Unicode code points without assigned Unicode
characters.

The UTC is seeking feedback from the public regarding the general approach
here, as well as any detailed suggestions on the wording proposed.

Outstanding issue: The UTC will need to determined whether Code Point Labels,
as defined here, will be considered immutable. That is, would such labels be
considered formally a Unicode code point property, and if so, be unchangeable
once assigned. This would parallel the way Unicode character names, per se, are
handled. (Note that there would need to be an obvious exception for reserved
code points, which can get new characters assigned to them, and thus acquire an
actual Unicode character name.)

[[ As a 4th bullet under definition D4 Character Name in Chapter 3, insert ]]

The detailed specification of the Unicode character names, including rules
for derivation of some ranges of characters, is given in Section 4.8, "Name --
Normative". That section also describes the relationship between the normative
value of the Name property and the contents of the corresponding data field in UnicodeData.txt in the Unicode Character Database.

[[Incorporate the following
text in Section 4.8, "Name -- Normative", as a subsection, with appropriate
editorial adjustments to other existing text in that section. ]]

Unicode Character Name

The Name property (short alias: "na") is a string property. Its value for all
Graphic and Format characters is the Unicode character name as generally
understood.

For Graphic and Format characters other than ideographs and Hangul syllables,
the name is as listed in field 1 of UnicodeData.txt.

For Hangul syllables, the name is derived by rule, as specified in Section
3.12, under "Hangul Syllable Name Generation", making use of the values of the
Jamo_Short_Name property.

For ideographs, the name is derived by rule, by concatenating the string "CJK
UNIFIED IDEOGRAPH-" or "CJK COMPATIBILITY IDEOGRAPH-" (or other as specified,
e.g. "TANGUT IDEOGRAPH-") to the code point, expressed in hexadecimal, with the
usual 4 to 6 digit convention. The exact ranges subject to these name
derivations are specified by a name range convention used in field 1 of
UnicodeData.txt.

For all other Unicode code points of all types, the value of the UCD Name
property is the null string. In other words, na="".

Note that the Unicode Name property values are unique for all non-null
values, but not every Unicode code point has a unique Unicode Name property
value. Furthermore, the Name property value uniqueness requirement interacts
with name assignment rules for formal aliases and for named character sequences:
Unicode character names, formal aliases, and named character sequences
constitute a single, unique namespace.

As corollary to this specification, it should be noted that the value of
field 1 (the string of characters between the semicolon separators) is to be
taken as the normative specification of the UCD Name property only for Graphic
and Format characters other than ideographs and Hangul syllables. All other
values which occur in field 1 are to be understood as meta-labels that serve
other functions in the generation of names lists and charts, or to label
abbreviated ranges of property definitions, but do not constitute values of
the UCD Name property per se.

[[ In TUS 5.0, on page 79, after the existing definition D10 Code Point,
insert the following new definitions. ]]

D10a Code Point Type: Any of the seven fundamental classes of code points in
the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter,
Reserved.

See Table 2-3, "Types of Code Points" for a summary of the meaning and
use of each class.

For Noncharacter, see also D14 Noncharacter.

For
Reserved, see also D15 Reserved code point.

For Private-Use, see also D49
Private-use code point.

For Surrogate, see also D71 High-surrogate code point
and D73 Low-surrogate code point. D10b Code Point Type Label: A unique label for
each code point type.

Each code point type label is a lowercase string, defined according to
the following table.

Table 1: Code Point Type Labels

Type

Label

Graphic

graphic

Format

format

Control

control

Reserved

reserved

Noncharacter

noncharacter

Private-Use

private-use

Surrogate

surrogate

D10c Code Point Label: A unique label for each code point in the Unicode
codespace.

[[ Edit the following specification for the code point label to an
appropriate set of bullets and/or body text, to fill out the definition. ]]

The code point label is distinguished from the expression of the code point
per se (for example, "U+0000" or "U+0061"), which itself is also a unique
identifier, as described in Appendix A, Notational Conventions. (See also Clause
6.5 Short identifiers for code positions (UIDs) in ISO/IEC 10646.)

The Unicode code point label is a unique string value defined as follows:

For any Unicode code point for which the value of the UCD Name property value
is non-null, the code point label is identical to the Unicode character name.
This will be the case for all Graphic and Format code points.

Otherwise, the code point label is constructed as follows:

Concatenate the code point type label for the code point, "-", plus the 4 to
6 digit representation of the code point.

Table 2: Construction of Code Point Labels

Type

Label

Control

control-NNNN

Reserved

reserved-NNNN

Noncharacter

noncharacter-NNNN

Private-Use

private-use-NNNN

Surrogate

surrogate-NNNN

When displayed in mixed contexts with Unicode character name values, to avoid
any possible confusion with actual, non-null Unicode Name values, constructed
Unicode code point labels are displayed between angle brackets: <control-0009>,
<noncharacter-FFFF>, etc.

APIs which return the value of a Unicode "name" for a given code point might
vary somewhat in their behavior. An API which is defined as strictly returning
the value of the Unicode Name property (the "na" attribute), should return a
null string for any Unicode code point other than graphic or format characters,
as that is the actual value of the property for such code points. On the other
hand, an API which returns a name for Unicode code points, but which is expected
to provide useful, unique labels for unassigned, reserved code points and other
special code point types, should return the Unicode Code Point Label, instead.
As defined above, this will be the same as the Unicode Name property value for
all graphic and format characters.