This section describes the design principles, architecture and coding conventions of the ICU Collation Service.

ICU Collator

To
use the ICU Collation Service, an ICU Collator must first be
instantiated. An ICU Collator is a data structure or object that
maintains all of the property and state information necessary to define
and support the specific collation behavior provided. Examples of
properties described in the ICU Collator are the locale, whether
normalization is to be performed, and how many levels of collation are
to be evaluated. Examples of the state information described in the ICU
Collator include the direction of a Collation Element Iterator (forward
or backward) and the status of the last API executed.

The ICU Collator is instantiated either by referencing a locale or by defining a custom set of rules (a tailoring).

The ICU Collation Service uses the paradigm:

Open an ICU Collator,

Use while necessary,

Close the ICU Collator.

ICU Collator instances cannot be shared among threads. You
should open them instead, and use a different collator for each
separate thread. The safe clone function is supported for cloning
collators in a thread-safe fashion.

The ICU Collation Service follows the ICU conventions for locale designation when opening collators:

NULL means the machine default locale.

The empty locale name ("") means the root locale.

The ICU Collation Service adheres to the ICU conventions described in the "ICU Architectural Design
" section of the users guide.

In particular:

The standard error code convention is usually followed.
(Functions that do not take an error code parameter do so for backward
compatibility.)

The string length convention is followed: when passing an UChar *,
the length is required in a separate argument. If -1 is passed for the
length, it is assumed that the string is zero terminated.

Collation locale and keyword handling

When
a collator is created from a locale, the collation service (like all
ICU services) must map the requested locale to the localized collation
data available to ICU at the time. It does so using the standard ICU
locale fallback mechanism. See the fallback section (§) of the locale chapter
for more details.

If
you pass a regular locale in, like "en_US", the collation service first
searches with fallback for "collations/default" key. The first such key
it finds will have an associated string value; this is the keyword name
for the collation that is default for this locale. If the search falls
all the way back to the root locale, the collation service will us the
"collations/default" key there, which has the value "standard".

If
there is a locale with a keyword, like "de@collation=phonebook", the
collation service searches with fallback for "collations/phonebook". If
the search is successful, the collation service uses the string value
it finds to instantiate a collator. If the search fails because no such
key is present in any of ICU's locale data (e.g.,
"de@collation=funky"), the service returns a collator implementing UCA
and the return UErrorCode is U_USING_DEFAULT_WARNING.

Input values for collation

Collation
deals with processing strings. ICU generally requires that all the
strings should be in UTF-16 format, and that all the required
conversion should done before ICU functions are used. In the case of
collation, there are APIs that can also take instances of character
iterators (UCharIterator). Theoretically, character iterators can iterate strings in
any encoding. ICU currently provides character iterator implementations
for UTF-8 and UTF-16BE (useful when processing data from a big endian
platform on an little endian machine). It should be noted, however,
that using iterators for collation APIs has a performance impact. It
should be used in situations when it is not desirable to convert whole
strings before the operation - such as when using string compare
function.

Collation Elements

As
discussed in the introduction, there are many possible orderings for
sorted text, depending on language and other factors. Ideally, there is
a way to describe each ordering as a set of rules for calculating
numeric values for each string of text. The collation process then
becomes one of simply comparing these numeric values.

This
essentially describes the way the ICU Collation Service works. To
implement a particular sort ordering, first the relationship between
each character or character sequence is derived. For example, a Spanish
ordering defines the letter sequence "CH" to be between the letters "C"
and "D". As also discussed in the introduction, to order strings
properly requires that comparison of base letters must be considered
separately from comparison of accents. Letter case must also be
considered separately from either base letters or accents. Any ordering
specification language must provide a way to define the relationships
between characters or character sequences on multiple levels. ICU
supports this by using "<" to describe a relationship at the primary
level, using "<<" to describe a relationship at the secondary
level, and using "<<<" to describe a relationship at the
tertiary level. Here are some example usages:

Symbol

Example

Description

<

c < ch

Make a primary (base letter) difference between "c" and the character sequence "ch"

<<

a << ä

Make a secondary (accent) difference between "a" and "ä"

<<<

a<<<A

Make a tertiary difference between "a" and "A"

A
more complete description of the ordering specification symbols and
their meanings is provided in the section on Collation Tailoring.

Once
a sort ordering is defined by specifying the desired relationships
between characters and character sequences, ICU can convert these
relationships to a series of numerical values (one for each level) that
satisfy these same relationships.

This series of numeric values,
representing the relative weighting of a character or character
sequence, is called a Collation Element (CE). A Collation Element is a
32-bit value, consisting of a 16-bit primary, 8-bit secondary, 6-bit
tertiary weight and 2 case bits.

16b primary weight

8b secondary weight

2b case bits

6b tertiary weight

The
sort weight of a string is represented by the collation elements of its
component characters and character sequences. For example, the sort
weight of the string "apple" would consist of its component Collation
Elements, as shown here:

"Apple"

"Apple" Collation Elements

a

[1900.05.05]

p

[3700.05.05]

p l e

[3700.05.05] [2F00.05.05] [2100.05.05]

In
this example, the letter "a" has a 16-bit primary weight of 1900 (hex),
an 8-bit secondary weight of 05 (hex), and a combined 8-bit
case-tertiary weight of 05 (hex).

String comparison is
performed by comparing the collation elements of each string. Each of
the primary weights are compared. If a difference is found, that
difference determines the relationship between the two strings. If no
differences are found, the secondary weights are compared and so forth.

With ICU it is possible to specify how many levels should be
compared. For some applications, it can be desirable to compare only
primary levels or to compare only primary and secondary levels.

Sort Keys

If
a string is to be compared repeatedly, it can be more efficient to use
sort keys. Sort keys are useful in situations where a large amount of
data is indexed and frequently searched. The sort key is generated once
and used in subsequent comparisons, rather than repeatedly generating
the string's Collation Elements. The key comparison is a very efficient
and simple binary compare of strings of unsigned bytes.

An
important property of ICU sort keys is that you can obtain the same
results by comparing 2 strings as you do by comparing the sort keys of
the 2 strings (provided that the same ordering and related collation
attributes are used).

ICU sort key is a pre-processed sequence of
bytes generated from a Unicode string. The weights for each comparison
level are concatenated, separated by a "0x01" byte. The entire sequence
is terminated with a 0x00 byte for convenience in C APIs.

The
diagram below represents an uncompressed sort key in ICU for ease of
understanding. ICU actually compresses the sort keys so that they take
the minimum storage in memory and in databases.

Sort key size

One
of the more important issues when considering using sort keys is the
sort key size. Unfortunately, it is very hard to give a fast exact
answer to the following question: "What is the maximum size for sort
keys generated for strings of size X". This problem is twofold:

the
maximum size of the sort key depends on the size of the collation
elements that are used to build it. Size of collation elements vary
greatly and depends both on the alphabet in question and on the locale
used.

compression is used in building sort keys. Most 'regular' sequences of characters produce very compact sort keys.

If one is to assume the worst case and use too big buffers, a
lot of space will be wasted. However, if you use too small buffers, you
will lose performance if generated sort keys are longer than supplied
buffers too often. A good strategy for this problem would be to
manually manage a large buffer for storing sortkeys and keep a list of
indices to sort keys in this buffer (see the "large buffers" (§) Collation Example
for more details).

Here
are some rules of a thumb, please do not rely on them. If you are
looking at the East Asian locales, you probably want to go with 5 bytes
per code unit. For Thai, 3 bytes per code unit should be sufficient.
For all the other locales (mostly Latin and Cyrillic), you should be
fine with 2 bytes per code unit. These values are based on average
lengths of sortkeys generated with tertiary strength - if you need
quaternary and identical strength (you should not), add 3 bytes per
code unit to each of these.

Partial sort keys

In
some cases, most notably when implementing radix sorting, it is useful
to produce only parts of sort keys at a time. ICU4C 2.6+ provides an
API that allows producing parts of sort keys (ucol_nextSortKeyPart API). These sort keys may or may not be compressed; that is, they may or may not be compatible with regular sort keys.

Merging sort keys

Sometimes,
it is useful to be able to merge sort keys. One example is having
separate sort keys for first and last names. If you need to perform an
operation that requires a sort key generated on the whole name, instead
of concatenating strings and regenerating sort keys, you should merge
the sort keys. The merging is done by merging the corresponding levels
while inserting a terminator between merged parts. The reserved sort key byte value for
the merge terminator is 0x02. For more details see UCA section 1.6, Merging Sort Keys.

C API: unicode/ucol.h ucol_mergeSortkeys()

Java API: com.ibm.icu.text.CollationKey merge(CollationKey source)

CLDR 1.9/ICU 4.6 and later map U+FFFE to a special collation element that is intended to allow concatenating strings like firstName+\uFFFE+lastName to yield the same results as merging their individual sort keys. As of ICU 50, this is not yet fully implemented in ICU.

Generating bounds for a sort key (prefix matching)

Having
sort keys for strings allows for easy creation of bounds - sort keys
that are guaranteed to be smaller or larger than any sort key from a
give range. For example, if bounds are produced for a sortkey of string
"smith", strings between upper and lower bounds with one level would
include "Smith", "SMITH", "sMiTh". Two kinds of upper bounds can be
generated - the first one will match only strings of equal length,
while the second one will match all the strings with the same initial
prefix.

CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with the maximum primary weight, so that for example the string "smith\uFFFF" can be used as the upper bound rather than modifying the sort key for "smith".

Collation Element Iterator

The
collation element iterator is used for traversing Unicode string
collation elements one at a time. It can be used to implement
language-sensitive text search algorithms like Boyer-Moore.

For
most applications, the two API categories, compare and sort key, are
sufficient. Most people do not need to manipulate collation elements
directly.

Example:

Consider iterating over "apple" and "äpple". Here are sequences of collation elements:

String 1

String 1 Collation Elements

a

[1900.05.05]

p

[3700.05.05]

p l e

[3700.05.05] [2F00.05.05] [2100.05.05]

String 2

String 2 Collation Elements

a

[1900.05.05]

\u0308

[0000.9D.05]

p

[3700.05.05]

p l e

[3700.05.05] [2F00.05.05] [2100.05.05]

The
resulting CEs are typically masked according to the desired strength,
and zero CEs are discarded. In the above example, masking with
0xFFFF0000 produces the results of NULL secondary and tertiary
differences. The collator then ignores the NULL differences and
declares a match. For more details see the paper "Efficient text
searching in Java™: Finding the right string in any language" by Laura
Werner ( http://icu-project.org/docs/papers/efficient_text_searching_in_java.html
).

Collation Attributes

The
ICU Collation Service has a number of attributes whose values can be
changed during run time. These attributes affect both the functionality
and the performance of the ICU Collation Service. This section
describes these attributes and, where possible, their performance
impact. Performance indications are only approximate and timings may
vary significantly depending on the CPU, compiler, etc.

Although
string comparison by ICU and comparison of each string's sort key give
the same results, attribute settings can impact the execution time of
each method differently. To be precise in the discussion of
performance, this section refers to the API employed in the
measurement. The ucol_strcoll function is the API for string comparison. The ucol_getSortKey function is used to create sort keys.

There is a special attribute value, UCOL_DEFAULT, that can be used to set any attribute to its default value (which is inherited from the UCA and the tailoring).

Attribute Types

Strength level

Collation strength, or the maximum collation level used for comparison, is set by using the UCOL_STRENGTH attribute. Valid values are:

UCOL_NON_IGNORABLE (CLDR/ICU default): variable characters are treated as all the other characters

UCOL_SHIFTED (UCA default): all the
variable characters will be ignored at the primary, secondary and
tertiary levels and their primary strengths will be shifted to the
quaternary level.

Case Ordering

Some conventions require
uppercase letters to sort before lowercase ones, while others require
the opposite. This attribute is controlled by the value of the UCOL_CASE_FIRST.
The case difference in the UCA is contained in the tertiary weights
along with other appearance characteristics (like circling of letters).

The case-first attribute allows for emphasizing of the case
property of the letters by reordering the tertiary weights with either
upper-first, and/or lowercase-first. This difference gets the most
significant bit in the weight.

The case-first attribute does not affect the performance substantially.

Case level

When
this attribute is set, an additional level is formed between the
secondary and tertiary levels, known as the Case Level. The case level
is used to distinguish large and small Japanese Kana characters. Case
level could also be used in other situations. for example to
distinguish certain Pinyin characters.

Case level is controlled by UCOL_CASE_LEVEL attribute. Valid values for this attribute are

UCOL_OFF (default): no additional case level

UCOL_ON : adds a case level

Hiragana Quaternary

Hiragana Quaternary can be set to UCOL_ON, in which case Hiragana code points will sort before everything else on the quaternary level. If set to UCOL_OFF Hiragana letters are treated the same as all the other code points. This setting can be changed on run-time using the UCOL_HIRAGANA_QUATERNARY_MODE attribute. You probably won't need to use it.

Variable Top

Variable
Top is a boundary which decides whether the code points will be treated
as variable (shifted to quaternary level in the shifted mode)
or non-ignorable. Special APIs are used for setting of variable top. It
can basically be set either to a codepoint or a primary strength value.

Performance

ICU collation is designed to be fast, small and customizable. Several techniques are used to enhance the performance:

Providing optimized processing for Latin characters.

Comparing strings incrementally and stop at the first significant difference.

Using a single, shared copy of UCA in memory for the read-only
default sort order. Only small tailoring tables are kept in memory for
locale-specific customization.

Compressing sort keys efficiently.

Making the sort order to be data-driven.

In general, the best performance from the ICU Collation Service is expected by doing the following:

After
opening a collator, keep and reuse it until done. Do not open new
collators for the same sort order. (Note the restriction on
multi-threading.)

Follow the best practice guidelines for generating sort key. Do not call ucol_getSortKey twice to first size the key and then allocate the sort key buffer and repeat the call to the function to fill in the buffer.

Use ucol_strcol when
comparing two strings one time. If it is necessary to compare strings
more than once, create the sort key first and compare the sort keys
instead. Generating the sort keys of two strings is about 5-10 times
slower than just comparing them directly.

Performance and Storage Implications of Attributes

Most
people use the default attributes when comparing strings or when
creating sort keys. When they do want to customize the ordering, the
most common options are the following :

Used to change the ordering of upper vs. lower case letters (as well as small vs. large kana)

UCOL_CASE_LEVEL == UCOL_ON and UCOL_STRENGTH == UCOL_PRIMARY

Used to ignore only the accent differences.

UCOL_NORMALIZATION_MODE == UCOL_ON

Force to always check for normalization. This is used if the input text may not be in FCD form.

UCOL_FRENCH_COLLATION == UCOL_OFF

This is only useful for languages like French and Catalan that turn this attribute on by default.

In
String Comparison, most of these options have little or no effect on
performance. The only noticeable one is normalization, which can cost
10%-40% in performance.

For Sort Keys, most of these options
either leave the storage alone or reduce it. Shifting can reduce the
storage by about 10%-20%; case level + primary-only can decrease it
about 20% to 40%. Using no French accents can reduce the storage by
about 38% , but only for languages like French and Catalan that turn it
on by default. On the other hand, using Shift + Quad can increase the
storage by 10%-15%. (The Identical Level also increases the length, but
this option is not recommended).

All of the above numbers are based on tests run on a particular
machine, with a particular set of data. (The data for each language is
a large number of names in that language in the format
<first_name>, <last name>.) The performance and storage may
vary, depending on the particular computer, operating system, and data.

Versioning

Sort keys
are often stored on disk for later reuse. A common example is the use
of keys to build indexes in databases. When comparing keys, it is
important to know that both keys were generated by the same algorithms
and weightings. Otherwise, identical strings with keys generated on two
different dates, for example, might compare as unequal. Sort keys can
be affected by new versions of ICU or its data tables, new sort key
formats, or changes to the Collator. Starting with release 1.8.1, ICU
provides a versioning mechanism to identify the version information of
the following (but not limited to),

The run-time executable

The collation element content

The Unicode/UCA database

The tailoring table

The version information of Collator is a 32-bit integer. If a
new version of ICU has changes affecting the content of collation
elements, the version information will be changed. In that case, to use
the new version of ICU collator will require regenerating any saved or
stored sort keys. However, since ICU 1.8.1. it is possible to build
your program so that it uses more than one version of ICU. Therefore,
you could use the current version for the features you need and use the
older version for collation.

Programming Examples

See the Collation Examples chapter for an example of how to compare and create sort keys with default locale in C , C++ and Java.