Known Anomalies inUnicode Character Names

Summary

This document provides information on many known anomalies in the
formal character names in the
Unicode Standard.

Status

This document is a Unicode Technical Note. Sole responsibility
for its contents rests with the author(s). Publication does not imply any
endorsement by the Unicode Consortium. This document is not subject to the
Unicode Patent
Policy.

Introduction

In this document we list all Unicode character
names with known clerical errors in the spelling of their names at the time
of its writing. In addition, we have compiled
information on many misnamed characters, misleading character
names, and characters with other known problems with their names.

Because Unicode Standard is a character encoding
standard and not the Universal Encyclopedia of Writing
Systems and Character Identity, the stability and uniqueness
of published character names is far more important than the correctness of
the name. The published character names are normative for the purposes of the
Unicode standard and the large number of other IT standards that
reference it. These standards require stable identifiers and character names must
therefore be immutable — any change of character names is almost
as disruptive of the standards as changing code points for
characters would be. Accordingly, the Unicode Consortium has adopted the
Unicode Standard Stability Policy,
preventing changes in character names. As a result, errors in character names
cannot be corrected. Instead, important character name anomalies anomalies are
documented with annotations in the
Unicode Character Code
Charts.

The requirement for a unique and stable character name that can be used as a
formal identifier does not mean that the Unicode Standard dictates to
anyone what the name of any given letter in their writing system
should properly be, whether in English or in any other language. The Unicode
Code Charts provide informative aliases for a large number of characters, the
names of which are not anomalous or defective. This is because different user
communities often use different names for the same character, even in English.

One of the reasons why the Unicode standard publishes many
informative aliases in the Unicode names list is because there often are
much better, more communicative names for particular characters, even in English than the normative names in the data file.
For example, U+002F SOLIDUS is more widely known among its American users as
slash. Informal aliases are useful in describing a character, but cannot be
used as identifiers, because they are not guaranteed to be unique or stable. Users are free
to use such aliases and other names, as long as they are not mis-represented as corrections
to the standard, but instead used as alternative, more useful
names for characters in the standard.

For character names that were encoded with misspelled words as part of their
name, or that exhibit other serious errors,
The Unicode Standard has adopted normative character name aliases. These aliases can be used as a
alternative, normative identifier for the character without the need to preserve
the original spelling or other error in the character name.While this
means that some characters can have more than one identifier, each identifier
continues to uniquely refer to a single character. Formal aliased are documented
in the NameAliases.txt file in the Unicode
Character Database.Formal name aliases also documented in the
Unicode Code Charts. We have not documented them here, instead, we merely
indicate for which characters formal aliases exist at the time of this writing.

In some cases, annotations have been added to the names list in the Unicode
Standard to document various lesser problems, but to date there has been no full
listing of all known problems.

The authors therefore intend this Technical Note to serve as a convenient summary of the information about character
name anomalies in the Unicode Standard at the time of its writing. It will be updated from
time to time as additional anomalies become known. While the information in this
technical note is based on information published in the Unicode Standard,
the selection and manner of presentation in this document reflect choices
made by its authors; it does not in any way supersede the information in the
Unicode Standard.

List of Known Anomalies and Explanations

This section lists character names with known anomalies, including those for
which a formal alias has been defined. It
provides further information about some names that have been the objects of
discussion or inquiry. As issues are reported, additional entries may be
added at any time and without notice. While many of the explanations below
are based on annotations in the Unicode Code charts, they have been edited or
re-stated by the authors.

U+0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE

Even though this is encoded as single character, it is not usually
considered a single letter.

U+01A2 LATIN CAPITAL LETTER OIU+01A3 LATIN SMALL LETTER OI

These should have been called letter GHA. They are neither
pronounced 'oi' nor based on the letters 'o' and 'i'.

U+01BE LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE

This is actually based on a ligation of "ts", not an inverted glottal stop.

The "caron" should have been called hacek and
combining hacek. The term "caron"
is suspected by some to be an invention of some early standards
body, but it has also been claimed by others to have been in use at Linotype before the days of digital
typography. Its true origin may be lost in the mists of time.

U+034F COMBINING GRAPHEME JOINER

The name does not describe the function of this character. Despite its name, it does not join graphemes.
For more infomation, see Section 7.9 Combining Marks, of the
Unicode Standard.

U+039B GREEK CAPITAL LETTER LAMDA
U+03BB GREEK SMALL LETTER LAMDA

The use of the spelling lamda derives from ISO 10646. This
does not mean that it
is more correct than lambda, merely that the spelling without the 'b' is
the one used in the formal character names.

Despite the fact that these characters have "DEVANAGARI" in their
names, these punctuation marks are intended for common use for the
scripts of India.

U+0A01 GURMUKHI ADAK BINDI

The spelling of the word Adak with a single 'd' is inconsistent
with U+0A71 GURMUKHI ADDAK and should really have had two d's.

U+0B83 TAMIL SIGN VISARGA

This character is the aaytham.

U+0CDE KANNADA LETTER FA

There is no Kannada letter 'fa', this character represents the
syllable 'llla'

U+0E9D LAO LETTER FO TAM

The name for this character should have been fo sung, but
that name is already used for U+0E9F. A formal alias LAO LETTER FO FON
correcting this error has been defined.

U+0E9F LAO LETTER FO SUNG

The name for this character should have been fo tam, but that
name is already used for U+0E9D. A formal alias LAO LETTER FO FAY
correcting this error has been defined.

U+0EA3 LAO LETTER LO LING

The name for this character should have been lo loot, but
that name is already used for U+0EA5. A formal alias LAO LETTER RO
correcting this error has been defined.

U+0EA5 LAO LETTER LO LOOT

The name for this character should have been lo ling, but
that name is already used for U+0EA3. A formal alias LAO LETTER LO
correcting this error has been defined.

U+0F0A TIBETAN MARK BKA- SHOG YIG MGO

This character is used to indicate that a document is addressed to a superior (the "petition honorific"), but the Tibetan name actually indicates a superior addressing an inferior ("starting flourish for giving a command").

U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG

The tsheg mark is not restricted to intersyllabic usage, and would have been better named
Tibetan mark tsheg.

U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR

This character is not a delimiter, but is a non-breaking version of the tsheg mark (U+0F0B) that is used exclusively between the letter NGA (U+0F44) and the shad mark (U+0F0D).

U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN

The syllable "BSKA-" does not occur naturally in Tibetan, and is a mistake for "BKA-" (cf. U+0F0A).
A formal alias correcting this error has been defined.

U+156F CANADIAN SYLLABICS TTH

There is no 'tth' syllable. A better name would have been
Canadian Syllabics asterisk.

U+178E KHMER LETTER NNO

As this character belongs to the first register, its correct
transliteration is nna, not NNO.

U+179E KHMER LETTER SSO

As this character belongs to the first register, its correct
transliteration is ssa, not SSO.

U+200B ZERO WIDTH SPACE

This isn't a "space". It is an invisible character
that can be used to provide
line break opportunities.

U+2113 SCRIPT SMALL L

Despite its character name, this symbol is derived from a special
italicized version of the small letter "L".

U+2118 SCRIPT CAPITAL P

Should have been called calligraphic small p or perhaps even
Weierstrass elliptic function symbol, which is what it is used for. It's not a capital "P" at all.

U+262B FARSI SYMBOL

This symbol is so named because as symbol of Iran it
cannot be encoded in ISO standards.

There are two separate
cantillation systems in the Hebrew Bible. One is
used for Psalms, Proverbs and (most of) Job, (the "poetic" books, hence
the "poetic system"), and the other is used everywhere else. The two systems
have structural similarities and share some graphemes, but not all.
In modern printing the accents have roughly the same shape; old manuscripts actually had them written slightly
differently. In the prose system there is an accent called ZARQA,
which is postposed (on or to the left of the last letter), and in the
poetic system there is one called TSINOR (and also zarqa and vice-versa;
each of these has many names) which has the same shape and placement and
even an analogous function in the structure of the cantillations. There
is another accent, only in the poetic system, called the TSINNORIT (a
diminutive of tsinor), which occurs directly above its letter, and is
(almost?) never on the last letter of its word. (More modern printing
tends to put the zarqa right on top of its letter too, but that's just a
printing preference). If you look closely at some old manuscripts, you
can tell that tsinnorit has a slightly different shape than zarqa/tsinor.

As encoded in Unicode, there are ZARQA (U+0598) and ZINOR (U+05AE) [sic]. By
the usual meanings of those names, those should properly be synonyms,
the same accent, but they're not. While the word"zinor" would be mnemonic of "tsinnorit," it's the wrong
way around in the character names: ZINOR has the combining class of
above-postposed, and ZARQA is encoded to go directly above the letter. So,
to encode a zarqa or a tsinor, you need to use ZINOR, and to encode a
tsinnorit, you need to use ZARQA.