Summary

This document describes specifications for four normalized forms of
Unicode text. With these forms, equivalent text (canonical or compatibility)
will have identical binary representations.

Status of this document

This document contains informative material and normative specifications
which has been considered and approved by the Unicode Technical Committee for
publication as a Technical Report and as part of the Unicode Standard, Version
3.0 (forthcoming). Any reference to version 3.0 of the Unicode Standard
automatically includes this technical report.

The content of all technical reports must be understood in the context of the
appropriate version of the Unicode Standard. References in this technical report
to sections of the Unicode Standard refer to the Unicode Standard, Version 3.0.
See http://www.unicode.org/unicode/standard/versions/
for more information.

This technical report may undergo further editorial work before the
release of the Unicode Standard, Version 3.0. Please mail corrigenda and other
comments to the authors.

§1 Introduction

The Unicode Standard, Version 3.0 describes several forms of
normalization in Section 5.7 (Section 5.9 in Version 2.0). Two of these
forms are precisely specified in Section 3.6. In particular, the standard
defines a canonical decomposition format, which can be used as a normalization
for interchanging text. This format allows for binary comparison while
maintaining canonical equivalence with the original unnormalized text.

The standard also defines a compatibility decomposition format, which allows
for binary comparison while maintaining compatibility equivalence with the
original unnormalized text. The latter can also be useful in many circumstances,
since it levels the differences between compatibility characters which are
inappropriate in those circumstances. For example, the half-width and full-width
katakana characters will have the same compatibility decomposition and
are thus compatibility equivalents; however, they are not canonical equivalents.

Both of these formats are normalizations to decomposed characters. While
Section 3.6 also discusses normalization to composite characters (also known as decomposible
or precomposed characters), it does not precisely specify a format.
Because of the nature of the precomposed forms in the Unicode Standard, there is
more than one possible specification for a normalized form with composite
characters. This document provides a unique specification for normalization, and
a label for each normalized form.

As with decomposition, there are two forms of normalization to composite
characters, Form C and Form KC. The difference between these
depends on whether the resulting text is to be a canonical equivalent to
the original unnormalized text, or is to be a compatibility equivalent to
the original unnormalized text. (In KC and KD, a K is used
to stand for compatibility to avoid confusion with the C standing
for canonical.) Both types of normalization can be useful in different
circumstances.

All of the definitions in this document depend on the rules for equivalence
and decomposition found in Chapter 3 of The Unicode Standard and the
decomposition mappings in the Unicode Character Database.

Text containing only
ASCII characters (U+0000 to U+007F) is left unaffected by all of the
Normalization forms. This is particularly important for programming
languages (see Annex 7:
Programming Language Identifiers).

Normalization Form C uses canonical composite characters where possible, and
maintains the distinction between characters that are compatibility equivalents.
Typical strings of composite accented Unicode characters are already in
Normalization Form C. Implementations of Unicode which restrict themselves to a
repertoire containing no combining marks (such as those that declare themselves
to be implementations at Level 1 as defined in ISO/IEC 10646-1) are already
typically using Normalization Form C. (Implementations of later versions of
10646 need to be aware of the versioning issues--see §3
Versioning.) The W3C Character Model for the World Wide Web (http://www.w3.org/TR/WD-charmod)
requires the use of Normalization Form C for XML and related standards (this
document is not yet final, but this requirement is not expected to change). See
also the W3C Requirements for String Identity Matching and String Indexing
(http://www.w3.org/TR/WD-charreq)
for more background.

Normalization Form KC additionally levels the differences between
compatibility characters which are inappropriately distinguished in many
circumstances. For example, the half-width and full-width katakana
characters will normalize to the same strings, as will Roman Numerals and their
letter equivalents. More complete examples are provided in Annex
1: Examples.

Normalization forms KC and KD must not be blindly applied to arbitrary text.
Since they erase many formatting distinctions, they will prevent round-trip
conversion to and from many legacy character sets, and unless supplanted by
formatting markup, may remove distinctions that are important to the semantics
of the text. The best way to think of these normalization forms is like
uppercase or lowercase mappings: useful in certain contexts for identifying core
meanings, but also performing modifications to the text that may not always be
appropriate.

To summarize the treatment of compability characters that were in the source
text:

Both forms D and C maintain compatibility characters.

Neither forms KD nor KC maintain compatibility characters.

None of the forms generate compability characters that were not in the
source text.

Normalization Form KC
does not attempt to map characters to compatibility composites. For
example, a compatibility composition of "office" does not
produce "o\uFB03ce", even though "\uFB03" is a
character that is the compatibility equivalent of the sequence of three
characters 'ffi'.

None of the
normalization forms are closed under string concatenation. For example, if
a first string ends with "â" (a-circumflex) and the
second starts with "." (dot_below) are both in form C,
but the concatenation of the two is not: the normalized form is the
precomposed character a-circumflex-dot_below. Without limiting the
repertoire, there is no way to produce a normalized form that is closed
under simple string concatenation. If desired, however, a specialized
function could be constructed that produced a normalized concatenation.
However, all of the normalization forms are closed under
substringing.

All of the definitions in this document depend on the rules for equivalence
and decomposition found in Chapter 3 of The Unicode Standard and the
decomposition mappings in the Unicode Character Database.

Decomposition must be
done in accordance with these rules. In particular, the decomposition
mappings found in the Unicode Character Database must be applied
recursively, and then the string put into canonical order.

§2 Notation

We will use the following notation for brevity:

Unicode names are shortened, such as the following:

E-grave

=

LATIN CAPITAL LETTER E WITH GRAVE

ka

=

KATAKANA LETTER KA

hw_ka

=

HALFWIDTH KATAKANA LETTER KA

ten

=

COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK

hw_ten

=

HALFWIDTH KATAKANA VOICED SOUND MARK

The combining class of a character X may be written as CC(X)

A sequence of characters may be represented by using plus signs between
the character names, or by using string notation.

"...\uXXXX..." represents the Unicode character U+XXXX embedded
within a string.

A single character which is equivalent to the sequence of characters B + C
may be written as B-C.

The normalization forms for a string X can be abbreviated as D(X), KD(X),
C(X) and KC(X), respectively.

Conjoining jamo of various types (initial, medial, final) are represented
by subscripts, such as ki, am, and kf.

Spacing accents may be used to represent non-spacing accents, such as
"c¸" for c followed by a non-spacing cedilla.

§3 Versioning

Because additional composite characters may be added to future versions of
the Unicode Standard, composition is less stable than decomposition. So that
implementations can get the same result for normalization even if they upgrade
to a new version of Unicode, it is necessary to specify a fixed version
for the composition process, called the composition version.

Decomposition is only
unstable if an existing character's decomposition mapping changes. The
Unicode Technical Committee has the policy of carefully reviewing proposed
corrections in character decompositions, and only making changes where the
benefits very clearly outweigh the drawbacks.

The compositionversion is defined to be Version 3.0.0 of the
Unicode Character Database. For more information, see:

To see what difference the composition version makes, suppose that Unicode
4.0 adds the composite Q-caron. For an implementation that uses Unicode
4.0, strings in Normalization Forms C or KC will continue to contain the
sequence Q + caron, and not the new character Q-caron,
since a canonical composition for Q-caron was not defined in the
composition version. See §7 Composition
Exclusion Table for more information.

§5 Conformance

A process that produces Unicode text that purports to be in a
Normalization Form shall do so in accordance with the specifications in this
document.

A process that tests Unicode text to determine whether it is a in a
Normalization Form shall do so in accordance with the specifications in this
document.

The specifications for
Normalization Forms are written in terms of a process for producing a
decomposition or composition from an arbitrary Unicode string. This is a logical
description--particular implementations can have more efficient mechanisms
as long as they produce the same result. Similarly, testing for a
particular Normalization Form does not require applying the process of
normalization, so long as the result of the test is equivalent to applying
normalization and then testing for binary identity.

§6 Specification

All combining character sequences start with a character of canonical class
zero. For simplicity, we define a term for such characters:

D1. A character S is a starter if it has a
canonical class of zero in the Unicode Character Database.

Because of the definition of canonical equivalence, the order of combining
characters with the same canonical class makes a difference. For example, a-macron-breve
is not the same as a-breve-macron. Characters can not be composed if that
would change the canonical order of the combining characters.

D2. In any character sequence beginning with a
starter S, a character C is blocked from S if and only if there is some
character B between S and C, and either B is a starter or it has the same
canonical class as C.

When B blocks C,
changing the order of B and C would result in a character sequence that is
not canonically equivalent to the original. See Section 3.9
Canonical Ordering Behavior in the Unicode Standard.

If a combining character
sequence is in canonical order, then testing whether a character is
blocked only requires looking at the immediately preceding character.

The process of forming a composition in Normalization Form C or KC involves:

decomposing the string according to the canonical (or compatibility,
respectively) mappings of the Unicode Character Database that corresponds to
the latest version of Unicode supported by the implementation, then

composing the resulting string according to the canonical mappings
of the composition version of the Unicode Character Database, Version 3.0.0,
by successively composing each unblocked character with the last starter.

Figure 1 shows a sample of how this works. The dark green cubes represent
starters, and the light gray cubes represent non-starters. In the first step,
the string is fully decomposed, and reordered. In the second step, each
character is checked against the last non-starter, and combined if all the
conditions are met. Examples are provided in Annex 1:
Examples, and a code sample is provided in Annex 5:
Code Sample.

Figure 1. Composition Process

A precise notion is required for when an unblocked character can be composed
with a starter. This uses the following two definitions.

D3. A primary composite is a character that
has a canonical decomposition mapping in the Unicode Character Database (or is a
canonical Hangul decomposition) but is not in the Composition
Exclusion Table (§7).

D4. A character X can be primary combined with
a character Y if and only if there is a primary composite Z which is canonically
equivalent to the sequence <X, Y>.

Based upon these definitions, the following rules specify the Normalization
Forms C and KC.

R1. Normalization Form C

The Normalization Form C for a string S is obtained by applying the following
process, or any other process that leads to the same result:

Generate the canonical decomposition for the source string S
according to the decomposition mappings in the latest supported
version of the Unicode Character Database.

Iterate through each character C in that decomposition, from first to
last. If C is not blocked from the last starter L, and it can be primary
combined with L, then replace L by the composite L-C, and remove C.

The result of this process is a new string S' which is in Normalization Form
C.

R2. Normalization Form KC

The Normalization Form KC for a string S is obtained by applying the
following process, or any other process that leads to the same result:

Generate the compatibility decomposition for the source
string S according to the decomposition mappings in the latest supported
version of the Unicode Character Database.

Iterate through each character C in that decomposition, from first to
last. If C is not blocked from the last starter L, and it can be primary
combined with L, then replace L by the composite L-C, and remove C.

The result of this process is a new string S' which is in Normalization Form
KC.

§7 Composition Exclusion Table

There are four classes of characters that are excluded from composition.

Script-specifics: precomposed characters that are generally not the
preferred form for particular scripts.

These cannot be computed from information in the the Unicode
Character Database.

Post Composition Version: precomposed characters that are added to
Unicode after the composition version is fixed. This set is currently empty,
but will be updated with each subsequent version of Unicode. See §3
Versioning.

These cannot be computed from information in the the Unicode
Character Database.

These are computed from information in the the Unicode
Character Database.

When two characters have the same canonical decomposition the Unicode
Character Database, one of them is chosen for composition and the other one is
excluded. Here is an example of this:

Source

Decomposition

212B ('Å' ANGSTROM SIGN)

=>

0041 ('A' LATIN CAPITAL LETTER A)
+030A ('°' COMBINING RING ABOVE)

00C5 ('Å' LATIN CAPITAL LETTER A WITH RING ABOVE)

In such cases, the Unicode Character Database will first decompose one of the
characters to the other, and then decompose from there. That is, one of the
characters (in this case ANGSTROM SIGN) will have a singleton
decomposition. Characters with singleton decompositions are included in Unicode
essentially for compatibility with certain pre-existing standards. These
singleton decompositions are excluded from primary composition.

A machine-readable form of the Composition Exclusion Table
for Unicode 3.0.0 is found in ftp://ftp.unicode.org/Public/3.0-Update/.
All four classes of characters are included in this file, although the
singletons and non-starter decompositions are commented out. If your
implementation does not compute these latter classes directly from the Unicode
Character Database, then it can uncomment the appropriate lines.

Annex 1: Examples

This annex provides some detailed examples of the results of applying each of
the normalization forms.

Common Examples

The following examples are cases where the Forms D and KD are identical, and
Forms C and KC are identical.

Original

Form D, KD

Form C, KC

Notes

a

D-dot_above

D + dot_above

D-dot_above

Both decomposed and precomposed
canonical sequences produce the same result.

b

D + dot_above

D + dot_above

D-dot_above

c

D-dot_below + dot_above

D + dot_below + dot_above

D-dot_below + dot_above

By the time we have gotten to dot_above, it
cannot be combined with the base character.

There may be intervening combining marks (see f),
so long as the result of the combination is canonically equivalent.

d

D-dot_above + dot_below

D + dot_below + dot_above

D-dot_below + dot_above

e

D + dot_above + dot_below

D + dot_below + dot_above

D-dot_below + dot_above

f

D + dot_above+ horn + dot_below

D + horn + dot_below + dot_above

D-dot_below + horn + dot_above

g

E-macron-grave

E + macron + grave

E-macron-grave

Multiple combining characters are combined
with the base character.

h

E-macron + grave

E + macron + grave

E-macron-grave

i

E-grave + macron

E + grave + macron

E-grave + macron

Characters will not be combined if they would not
be canonical equivalents because of their ordering.

j

angstrom_sign

A + ring

A-ring

Since Å (A-ring) is the
preferred composite, it is the form produced for both characters.

k

A-ring

A + ring

A-ring

Normalization Forms D and C
Examples

The following are examples of Forms D and C that illustrate how they differ
from Forms KD and KC, respectively.

Different compatibility
equivalents of a single Japanese character will not result in the
same string in Normalization Form C.

q

ka + ten

ka + ten

ga

r

hw_ka + hw_ten

hw_ka + hw_ten

hw_ka + hw_ten

s

ka + hw_ten

ka + hw_ten

ka + hw_ten

t

hw_ka + ten

hw_ka + ten

hw_ka + ten

u

kaks

ki + am +
ksf

kaks

Hangul syllables are maintained under
normalization.

Normalization Forms KD and
KC Examples

The following are examples of Forms KD and KC that illustrate how they differ
from Forms D and C, respectively.

Original

Form KD

Form KC

Notes

l'

"Äffin"

"A\u0308ffin"

"Äffin"

The ffi_ligature
(U+FB03) is decomposed in Normalization Form KC (where it is not in
Normalization Form C).

m'

"Ä\uFB03n"

"A\u0308\ffin"

"Äffin"

n'

"Henry IV"

"Henry IV"

"Henry IV"

Similarly, the resulting
strings here are identical in Normalization Form KC.

o'

"Henry \u2163"

"Henry IV"

"Henry IV"

p'

ga

ka + ten

ga

Different compatibility
equivalents of a single Japanese character will result in the same
string in Normalization Form KC.

q'

ka + ten

ka + ten

ga

r'

hw_ka + hw_ten

ka + ten

ga

s'

ka + hw_ten

ka + ten

ga

t'

hw_ka + ten

ka + ten

ga

u'

kaks

ki + am +
ksf

kaks

Hangul syllables are maintained under
normalization. (In earlier versions of Unicode,
jamo characters like ksf had compatibility mappings to kf
+ sf. These mappings were removed in Unicode 2.1.9 to ensure
that Hangul syllables are maintained.)

Annex 2: Design Goals

The following are the design goals for the specification of the normalization
forms, and are presented here for reference.

Goal 1: Uniqueness

The first, and by far the most important, design goal for the normalization
forms is uniqueness: two equivalent strings will have precisely the same
normalized form. More explicitly,

If two strings x and y are canonical equivalents, then

C(x) = C(y)

D(x) = D(y)

If two strings are compatibility equivalents, then

KC(x) = KC(y)

KD(x) = KD(y)

Goal 2: Stability

The second major design goal for the normalization forms is stability of
characters that are not involved in the composition or decomposition process.

If X contains a character with a compatibility decomposition, then D(X)
and C(X) still contain that character.

As much as possible, if there are no combining characters in X, then C(X)
= X.

Irrelevant combining marks should not affect the results of composition.
See example f in Annex 1: Examples, where the
horn character does not affect the results of composition.

Goal 3: Efficiency

The third major design goal for the normalization forms is that it allow for
efficient implementations.

It is possible to implement efficient code for producing the Normalization
Forms. In particular, it should be possible to produce Normalization Form C
very quickly from strings that are already in Normalization Form C or are in
Normalization Form D.

Composition Forms do not have to produce the shortest possible results,
because that can be computationally expensive.

Annex 3: Implementation Notes

There are a number of optimizations that can be made in programs that produce
Normalization Form C. Rather than first decomposing the text fully, a quick
check can be made on each character. If it is already in the proper precomposed
form, then no work has to be done. Only if the current character is combining or
in the §7 Composition Exclusion Table
does a slower code path need to be invoked. (This code path will need to look at
previous characters, back to the last starter. See Annex
8: Trailing Characters for more information.)

The majority of the cycles spent in doing composition is spent looking up the
appropriate data. The data lookup for Normalization Form C can be very
efficiently implemented, since it only has to look up pairs of characters, not
arbitrary strings. First a multi-stage table (as discussed in Chapter 5 of the
Unicode Standard) is used to map a character c to a small integer i
in a contiguous range from 0 to n. The code for doing this looks like:

i = data[index[c >> BLOCKSHIFT] + (c & BLOCKMASK)];

Then a pair of these small integers are simply mapped through a
two-dimensional array to get a resulting value. This yields much better
performance than a general-purpose string lookup in a hash table.

Since the Hangul compositions and decompositions are algorithmic, memory
storage can be significantly reduced if the corresponding operations are done in
code. See Annex 10: Hangul for more information.

Annex 4: Decomposition

For those accessing this document without access to the Unicode Standard, the
following summarizes the canonical decomposition process. For a complete
discussion, see Sections 3.6, 3.10 and 3.11 of the Unicode Standard.

Canonical decomposition is the process of taking a string, recursively
replacing composite characters using the Unicode canonical decomposition
mappings (including the algorithmic Hangul canonical decomposition mappings, see
Annex 10: Hangul), and putting the result in canonical
order.

Compatibility decomposition is the process of taking a string,
replacing composite characters using both the Unicode canonical
decomposition mappings and the Unicode compatibility decomposition
mappings, and putting the result in canonical order.

A string is put into canonical order by repeatedly replacing any
exchangeable pair by the pair in reversed order. When there are no remaining
exchangeable pairs, then the string is in canonical order. Note that the
replacements can be done in any order.

A sequence of two adjacent characters in a string is an exchangeable pair
if the combining class (from the Unicode Character Database) for the first
character is greater than the combining class for the second and the second is
not a starter; that is, if CC(first) > CC(second) > 0.

Examples of exchangeable pairs:

Sequence

Combining classes

Status

<acute, cedilla>

230, 202

exchangeable, since 230 > 202

<a, acute>

0, 230

not exchangeable, since 0 <= 230

<diaeresis, acute>

230, 230

not exchangeable, since 230 <= 230

<acute, a>

230, 0

not exchangeable, since the second class is zero.

Example of decomposition:

Take the string with the characters "ác´¸" (a-acute, c,
acute, cedilla)

This is because cedilla has a lower canonical ordering value
(202) than acute (230) does. The positions of 'a' and 'c' are not
affected, since they are starters.

Annex 5: Code Sample

A code sample is available for the four different normalization forms. For
clarity, this sample is not optimized. The implementation transforms a string in
two passes: first decomposing, then recomposing that result by successively
composing each unblocked character with the last starter.

In some implementations, people may be working with streaming interfaces that
read and write small amounts at a time. In those implementations, the text back
to the last starter needs to be buffered. Whenever a second starter would be
added to that buffer, the buffer can be flushed.

Annex 6: Legacy Encodings

While the Normalization Forms are specified for Unicode text, they can also
be extended to non-Unicode (legacy) character encodings. This is based on
mapping the legacy character set strings to and from Unicode.

D4. An invertible transcoding T for a legacy
character set L is a one-to-one mapping from characters encoded in L to
characters in Unicode with an associated mapping T-1 such that for
any string S in L, T-1(T(S)) = S.

Typically there is a
single accepted invertible transcoding for a given legacy character set.
In in a few cases there may be multiple invertible transcodings: for
example, Shift-JIS may have two different mappings used in different
circumstances: one to preserve the '/' semantics of 2F16, and
one to preserve the '¥' semantics. If you implement transcoders from
legacy character sets, it is recommended that you ensure that the result
is in Normalization Form C where possible.

The character indexes in
the legacy character set string may be very different than character
indexes in the Unicode equivalent. For example, if a legacy string uses
visual encoding for Hebrew, then its first character might be the last
character in the Unicode string.

D5. Given a string S encoded in L and an invertible
transcoding T for L, the Normalization Form X of S under T is defined to
be the result of mapping to Unicode, normalizing to Unicode Normalization Form
X, and mapping back to the legacy character encoding, e.g., T-1(X(T(S))).
Where there is a single accepted invertible transcoding for that character set,
we can simply speak of the Normalization Form X of S.

Legacy character sets fall into three categories based on their normalization
behavior with accepted transcoders.

Prenormalized. Any string in the character set is already in
Normalization Form X.
For example, ISO 8859-1 is prenormalized in Form C.

Normalizable. Although the set is not prenormalized, any string in
the set can be normalized to Form X.
ISO 2022 (with a mixture of ISO 5426 and ISO 8859-1) is an example of this.

Unnormalizable. Some strings in the character set cannot be
normalized into Form X.
For example, ISO 5426 is unnormalizable in Form C under common transcoders,
since it contains combining marks but not composites.

Annex 7: Programming Language
Identifiers

This section discusses issues that must be taken into account when
considering normalization of identifiers in programming languages or scripting
languages. The Unicode Standard provides a recommended syntax for identifiers
for programming languages that allow the use of non-ASCII languages in code. It
is a natural extension of the identifier syntax used in C and other programming
languages:

That is, the first character of an identifier can be an uppercase letter,
lowercase letter, titlecase letter, modifier letter, other letter, or
letter number. The subsequent characters of an identifier can be any those,
plus non-spacing marks, spacing combining marks, decimal numbers, connector
punctuations, and formatting codes (such as right-left-mark).
Normally the formatting codes should be filtered out before storing or comparing
identifiers.

Normalization as described in this report can be used
to avoid problemswhere apparently identical
identifiers are not treated equivalently. Such problems can appear both during
compilation and during linking, in particular also across different programming
languages. To avoid such problems, programming languages should normalize
identifiers before storing or comparing them, preferably in Normalization Form
KC — especially if the identifiers are caseless. While Normalization Form C
can also be used, Form KC eliminates variations that are probably not relevant
to the specification of programming language identifiers.

If programming languages are using form KC to level differences between
characters, then they need to use a slight modification of the identifier syntax
from the Unicode Standard to deal with the idiosyncrasies of a small number of
characters. These characters fall into three classes:

Middle Dot. Because most Catalan legacy data will be encoded in
Latin-1, U+00B7 MIDDLE DOT needs to be allowed in <identifier_extend>.
(If the programming language is using a dot as an operator, then U+2219
BULLET OPERATOR or U+22C5 DOT OPERATOR should be used
instead. However, care should be taken when dealing with U+00B7 MIDDLE
DOT, as many processes will assume its use as punctuation, rather
than as a letter extender.)

Combining-like characters. Certain characters are not formally
combining characters, although they behave in most respects as if they were.
Ideally, they should not be in <identifier_start>, but
rather in <identifier_extend>, along with combining
characters. In most cases, the mismatch does not cause a problem, but when
these characters have compatibility decompositions, they can cause
identifiers not to be closed under Normalization Form KC. In particular, the
following four characters should be in <identifier_extend>
and not <identifier_start>:

0E33 THAI CHARACTER SARA AM

0EB3 LAO VOWEL SIGN AM

FF9E HALFWIDTH KATAKANA VOICED SOUND MARK

FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

Irregularly decomposing characters. U+037A GREEK
YPOGEGRAMMENI and certain Arabic presentation forms have irregular
compatibility decompositions, and need to be excluded from both <identifier_start>
and <identifier_extend>. It is recommended that all
Arabic presentation forms be excluded from identifiers in any event,
although only a few of them are required to be excluded for normalization to
guarantee identifier closure.

With these amendments to the identifier syntax, all identifiers are closed
under all four Normalization forms. This means that for any string S and any
normalization form F,

isIdentifier(S) == isIdentifier(normalize(F, S))

In addition, those programming languages with case-insensitive identifiers
should also use the case mappings described in Unicode
Technical Report #21, Case Mappings to produce a case-insensitive normalized
form. This means using both the data in UnicodeData.txt and in
SpecialCasing.txt. Identifiers are also closed under lowercasing, so that for
any string S,

isIdentifier(S) == isIdentifier(toLower(S))

In addition, identifiers are preserved by uppercasing; for any string S, if isIdentifier(S)
then isIdentifier(toUpper(S)). The reverse is also true, but only
if the character U+0345 COMBINING GREEK YPOGEGRAMMENI is not at the
start of S. This is because U+0345 is not in <identifier_start>,
but its uppercase is. In practice this is not a problem, because of the way
normalization is used with identifiers.

When leveling distinctions among programming language identifiers by using
compatibility normalization or case mapping, the source text should not be
leveled before parsing. Only once the identifiers are distinguished should they
alone be leveled. Otherwise literal strings and other program text may lose
necessary distinctions.

Sample code in Java that shows parsing for identifiers, including leveling
distinctions using Normalization and case conversion, is available via Normalizer.html.

Annex 8: Trailing Characters

The Trailing Characters table lists the characters in Unicode 3.0 that may
occur in a canonical decomposition of a character, but not as the first
character of that decomposition. The inclusion of this table here is
informative: the table can be generated from the Unicode Character Database.

If a string does not contain characters in the Trailing Characters table or
in the Composition Exclusion Table,
then none of its characters participate in compositions, so the only processing
required for Normalization Form C is to make sure that the characters are in
canonical order. The Other Non-Starters table contains all of the Unicode 3.0
non-starters that are neither in the Trailing Characters table nor in the
Composition Exclusion table. If a string contains no characters from any of
these three tables, then it is in Normalization Form C already.

Annex 9: Conformance Testing

Implementations must be thoroughly tested for conformance to the
normalization specification, especially for Normalization Form C. The following
provides conditions that should be tested for in any implementation.

For every character X in Unicode, let the string Y be D(X), and the string Z
be C(D(X)). Check that the following conditions for these strings are true:

If X does not have a canonical decomposition mapping in the Unicode
Character Database, then X = Y = Z.

otherwise,

Y and Z must be in canonical order

X ≠ Y

No character in Y can have a canonical decomposition mapping in the
Unicode Character Database

To test for canonical order in a string S, check that for each character
index i in the string (except the first), if CC(S[i-1]) > CC(S[i]),
then CC(S[i]) = 0. If this condition fails, the string is not in canonical
order.

Annex 10: Hangul

Since the Hangul compositions and decompositions are algorithmic, memory
storage can be significantly reduced if the corresponding operations are done in
code rather than by simply storing the data in the general purpose tables. Here
is sample code illustrating algorithmic Hangul canonical decomposition and
composition done according to the specification in Section 3.11 Combining
Jamo Behavior. Although coded in Java, the same structure can be used in
other programming languages.

Hangul Composition

Notice an important feature of Hangul composition: whenever the source string
is not in Normalization Form D, you can not just detect character sequences of
the form <L, V> and <L, V, T>. You also must catch the sequences of
the form <LV, T>. To guarantee uniqueness, these sequences must also be
composed. This is illustrated in Step 2 below.

Additional transformations can be performed on sequences of Hangul jamo for
various purposes. For example, to regularize sequences of Hangul jamo into
standard syllables, the choseong and jungseong fillers can be
inserted, as described in Chapter 3. (In the text of the 2.0 version of the
Unicode Standard, these standard syllables were called canonical syllables,
but this has nothing to do with canonical composition or decomposition.) For
keyboard input, additional compositions may be performed. For example, the
trailing consonants kf + sf
may be combined into ksf. In addition, some
Hangul input methods do not require a distinction on input between initial and
final consonants, and change between them on the basis of context. For example,
in the keyboard sequence mi + em + ni + si
+ am, the consonant ni would be reinterpreted as nf,
since there is no possible syllable nsa. This results in the two
syllables men and sa.

However, none of these additional transformations are considered part of the
Unicode Normalization Formats.

Hangul Character Names

Hangul decomposition is also used to form the character names for the Hangul
syllables. While the sample code that illustrates this process is not directly
related to normalization, it is worth including because it is so similar to the
decomposition code.

Annex 11: Intellectual Property

Transcript of letter regarding disclosure of IBM
Technology
(Hard copy is on file with the Chair of UTC and the Chair of NCITS/L2)
Transcribed on 1998-03-10

February 26, 1999

The Chair, Unicode Technical Committee

Subject: Disclosure of IBM Technology - Unicode Normalization Forms

The attached document entitled "Unicode Normalization Forms" does
not require IBM technology, but may be implemented using IBM technology that
has been filed for US Patent. However, IBM believes that the technology could
be beneficial to the software community at large, especially with respect to
usage on the Internet, allowing the community to derive the enormous benefits
provided by Unicode.

This letter is to inform you that IBM is pleased to make the Unicode
normalization technology that has been filed for patent freely available to
anyone using them in implementing to the Unicode standard.

Sincerely,

W. J. Sullivan,
Acting Director of National Language Support
and Information Development

Copyright

The Unicode Consortium makes no expressed or implied warranty
of any kind, and assumes no liability for errors or omissions. No liability is
assumed for incidental and consequential damages in connection with or arising
out of the use of the information or programs contained or accompanying this
technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc.,
and are registered in some jurisdictions.