ISO
INTERNATIONAL ORGANIZATION FOR STANDARDIZATIONORGANISATION INTERNATIONALE DE NORMALISATION

ISO/IEC JTC 1/SC 2/WG
2

Universal
Multiple-Octet Coded Character Set
(U C S)

ISO/IEC JTC1/SC2/WG2 N2536L2/02-440Date: 2002-11-24

Title:

Constraints on Character Names for Loose Matching

Source:

US national body

Status:

Submission

Action:

Request for addition to Policies and Procedures

For property names, the Unicode Consortium recommends loose string
matching: only letters and digits should be taken into account when matching.
In particular, spaces and hyphens are disregarded in loose matching. The
Unicode Character Property and Property Value aliases are vetted to make sure
that this does not cause collisions: that the aliases will always remain
distinct even if only letters and digits are considered in matching.

Such loose matching can be used in a variety of environments. They are
especially useful in Regular Expressions, where sets of characters are built up
using processes.

It is very useful to do loose matching for Unicode character names as well,
for such environments. There are currently only three cases where loose matching fails:

With such a limited number of exceptions, one can still match loosely, by
special-casing these three exceptions. As it turns out, the match can even be
slightly looser than with property aliases: one can also remove all instances
of the letter sequences "LETTER", "CHARACTER",
"DIGIT", and still not have collisions; those are essentially
"noise" words (in terms of loose matching).

The US National Body recommends that the UTC and WG2 adopt a constraint on
future character names, so that loose matching can be easily performed (with
the exception of the above three characters). The Unicode Technical Committee
has accepted this proposal, and also recommends the adoption by WG2 in the
policies and procedures.

The specific proposal is:

Whenever a character name is assigned to a new character,
that name will be distinct from all existing character names, even if the following transformation were to be
performed:

Remove all characters except
for letters and decimal digits

Letters and decimal
digits are those with general-category = L or Nd
in the Unicode Character Database.

Remove all instances of the
letter sequences "LETTER", "CHARACTER",
"DIGIT"

This is only
applicable to the English normative character names, not to translated
names.

Case-fold all characters.

This is only
applicable to translated names that may contain both uppercase and
lowercase characters.

Note: clause 2 does notexclude the words
LETTER, CHARACTER and DIGIT from future names. Instead, it just ensures that
those words they are not required in order to distinguish two character
names. That is, one couldn't have both of the following, although one could
have either one: