Summary

This annex describes an XML representation of the Unicode
Character Database.

Status

This document has been reviewed by Unicode members and
other interested parties, and has been approved for publication
by the Unicode Consortium. This is a stable document and may be
used as reference material or cited as a normative reference by
other specifications.

A Unicode Standard Annex (UAX) forms an integral
part of the Unicode Standard, but is published online as a
separate document. The Unicode Standard may require
conformance to normative content in a Unicode Standard Annex,
if so specified in the Conformance chapter of that version of
the Unicode Standard. The version number of a UAX document
corresponds to the version of the Unicode Standard of which it
forms a part.

Please submit corrigenda and other comments with the
online reporting form [Feedback]. Related
information that is useful in understanding this annex is found
in Unicode Standard Annex #41, “Common
References for Unicode Standard Annexes.”. For the
latest version of the Unicode Standard see [Unicode]. For
a list of current Unicode Technical Reports see [Reports]. For
more information about versions of the Unicode Standard, see [Versions]. For any errata which may apply to this annex, see [Errata].

In working on Unicode implementations, it is often useful to
access the full content of the Unicode Character Database
(UCD). For example, in establishing mappings from characters to
glyphs in fonts, it is convenient to see the character scalar
value, the character name, the character East Asian width, along
with the shape and metrics of the proposed glyph to map to;
looking at all this data simultaneously helps in evaluating the
mapping.

Directly accessing the data files that constitute the UCD is
sometimes a daunting proposition. The data is dispersed in a number
of files of various formats, and there are just enough
peculiarities (all justified by the processing power available at
the time the UCD representation was designed) to require a fairly
intimate knowledge of the data format itself, in addition to the
meaning of the data.

Many programming environments (for example, Java or ICU) do give
access to the UCD. However, those environments tend to lag behind
releases of the standard, or support only some of the UCD
content.

Unibook is a wonderful tool to explore the UCD and in many
cases is just the ticket; however, it is difficult to use when the
task at hand has not been built-in, or when non-UCD data is to be
displayed as well.

This annex presents an alternative representation of the
UCD, which is meant to overcome these difficulties. We have chosen
an XML representation, because parsing becomes a non-issue: there
are a number of XML parsers freely available, and using them is
often fairly easy. In addition, there are freely available tools
that can perform powerful operations on XML data; for example,
XPATH and XQUERY engines can be thought of as a “grep”
for XML data and XSLT engines can be thought of as
“awk” for XML data.

It is important to note that we are interested in exploring
the content of the UCD, rather than in using the UCD data to process
character streams. Thus, we are not concerned so much by the
speed of processing or the size of our representation.

Our representation supports the creation of documents that
represent only parts of the UCD, either by not representing all
the characters, or by not representing all the properties. This
can be useful when only some of the data is needed.

Our schema can be used to create and validate documents
which are intended to represent properties of Unicode code
points, blocks, named sequences, normalization corrections, and
standardized variants. A document may represent the values
actually assigned in a given version of the UCD, or it may
represent a draft version of the UCD, or a private agreement on
Private Use characters. The validity of a XML document with
respect to the schema defined in this annex does not assert
anything about the correctness of the values.

Valid documents may provide values for only some of the
the code points, or some of the Unicode properties. Furthermore,
they may also incorporate non-Unicode properties.

Our schema is defined using English. However, a useful
subset of the validity constraints can be captured using a
schema language, thereby simplifying the task of validating
documents. We have chosen Relax NG [ISO
19757], in the compact syntax [ISO
19757 Amd1], as the schema language. It is important to
stress that the schema which is defined in English imposes
more constraints on the documents than can be validated
with the Relax NG schema.

An important characteristic
of Relax NG is that its schemas do not modify or augment the
infoset of the documents. Therefore, it is possible to process
our XML representation without using the schema. Also, the
schema is relatively straightforward and can be converted
mechanically to other schema languages.

While our XML representation is not intented to be used
during processing of characters and strings, it is still a
design principle for our schema to support the relatively
efficient representation of the UCD. This is achieved by an
inheritance mechanism, similar to property inheritance in CSS or
in XSL:FO (see section 4.3
Group).

Many invariants impose constraints on the values of the
different properties for a given code point. For example, if the
value of the Numeric Type property is None, then the value of
the Numeric Value property should be the empty string; and if
the value of the Other Alphabetic property is true, then the
value of the Alphabetic property should be true. Those
invariants are not captured in the schema.

Characters are pervasive in the UCD, and will
need to be represented. Representing characters directly by
themselves would seem the most obvious choice; for example, we
could express that the decomposition of U+00E8 is
“&#x0065;&#x0300;”, that is have exactly two
characters in (the infoset of) the XML document. However, the
current XML specification limits the set of characters that can
be part of a document. Another problem is that the various tools
(XML parser, XPATH engine, etc.) may equate U+00E8 with U+0065
U+0300, thus making it difficult to figure out which of the two
sequences is contained in the database (which is sometimes
important for our purposes). Therefore, we chose instead to
represent characters by their code points; we follow the usual
convention of four to six hexadecimal digits (uppercase) and
code points in a sequence separated by space; for example, the
decomposition of U+00E8 will be represented by the nine
characters “0065 0300” in the infoset.

The root element may have a description
child element, which in turn contains any string, which is meant
to describe what the XML document purports to describe.

It is recommended that if the document purports to
represent the UCD of some Unicode version, the
description be selected in accord with the
rules listed in [Versions]; and
conversely, that documents which do not purport to represent the
UCD be described as such.

It is often the case that successive code points have the
the same property values, for a given set of properties. The
most striking example is that of an unallocated plane, where all
but the last two code points are reserved and have the same
property values. Another example is the URO (U+4E00
.. U+9FA5) where all the code points have the same property
values if we ignore their name and their Unihan
properties.

This observation suggests that it is profitable to
represent sets of code points which share the same properties,
rather than individual code points. To make the representation
of the sets simple, we restrict them to be segments in the code
point space, that is a set is defined by the first and last code
point it contains. Those are captured by the attributes
first-cp and last-cp. The
attribute cp is a shorthand notation for the
case where the set has a single code point.

While we already recognized the situation where a set of
code points have exactly the same set of property values, another
common situation is that of code points which have almost all
the same property values.

For example, the characters U+1740 BUHID LETTER A .. U+1753
BUHID VOWEL SIGN U all have the age “3.2”, and all
have the script “Buhd”. On the one hand, it is
convenient to support data files in which those properties are
explicitly listed with every code point, at this makes answering
questions like “what is the age of U+1749?”
easier, because that data is expressed right there. On the other
hand, this leads to rather large data files, and it also tends
to obscure the differences between similar characters.

Our representation accounts for this situation with the
notion of groups. A group element is simply a
container of code points that also holds default values for the
properties. If a code point inside a group
does not list explicitly a property but the group
lists it, then the code point inherits that property from its
group. For example, the fragment with
explicit properties:

The element for U+1740 does not have the
age attribute, and it therefore inherits it
from its enclosing group element,
that is “3.2”. On the other hand, the element for
U+1820 does have this attribute, so the value is
“3.0”.

As this example illustrates, the notion of
group does not necessarily align with the
notion of Unicode block. It is entirely defined and limited to
our representation. In particular, the value of a property for a
code point can always be determined from the XML document alone,
assuming that this property and this code point are expressed at
all. Of course, one may create an XML representation where the
groups happen to coincide with the Unicode blocks.

Groups cannot be nested. The motivation for this
limitation is to make the life of consumers easier: either a
property is defined by the element for a code point, or it is
defined by the immediately enclosing group
element.

Each property, except for the Block and
Special_Case_Condition properties, is represented by an
attribute.

The name of the attribute is the abbreviated name of the
property as given in the file PropertyAliases.txt in version
5.2.0 of the UCD. For the Unihan properties, the name is that
given in the various versions of Unihan.txt (some properties are
no longer present in version 5.2.0).

For catalog and enumerated properties, the values are
those listed in the file PropertyValueAliases.txt in version
5.2.0 of the UCD; if there is an abbreviated name, it is used,
otherwise the long name is used.

The majority of the characters in Unicode have a name
which is of the form CJK UNIFIED IDEOGRAPH-<code
point>. It also happens that character names cannot contain
the character U+0023 # NUMBER SIGN, so we adopted the
following convention: if a code point has the attribute
na (either directly or by inheritence from an
enclosing group), then occurrences of the character # in the
name are to be interpreted as the value of the code point. For
example:

<char cp="3400" na="CJK UNIFIED IDEOGRAPH-3400"/>

and

<char cp="3400" na="CJK UNIFIED IDEOGRAPH-#"/>

are equivalent. The # can be in any position in the value
of the na attribute. The convention also applies
just as well to a set of multiple code points:

Note that we do not express the “Best Fit”
element recorded in BidiMirroring.txt. For one thing, it is
not meant to be machine readable. More importantly, the idea
underlying the mirrored glyph is delicate to use, since it
makes assumptions about the design of the fonts, and the best
fit goes even farther.

The decomposition type and decomposition mapping
properties are represented by the dt and
dm attributes.

Most characters have a decomposition mapping to
themselves. This is very similar to the situation we
encountered with names, and we adopted a similar convention: if
the value of a decomposition mapping is the
character itself, we use the attribute value # (U+0023
# NUMBER SIGN) as a shorthand notation; this enables
those attributes to be captured in groups.

Most characters have a case mapping and case folding
properties that simply map or fold to themselves. This is
very similar to the situation we encountered with names, and
we adopted a similar convention: if the value of a case
mapping or case folding property the character itself, we
use the attribute value # (U+0023 # NUMBER SIGN) as a
shorthand notation; this enables those attributes to be
captured in groups.

The simple case mappings are recorded in the
suc, slc,
stc attributes.

The Case_Ignorable, Cased,
Changes_When_Casefolded, Changes_When_Casemapped,
Changes_When_Lowercased, Changes_When_NFKC_Casefolded,
Changes_When_Titlecased, Changes_When_Uppercased and
NKFC_Casefold properties are recorded in these
attributes:

The normalization-corrections child of
the ucd describes the normalization
corrections. It has one child
normalization-correction element per
correction, with attributes to describe the code point affected,
its old normalization, its new normalization and the version of
Unicode in which the correction was made.

The standardized-variants child of the
ucd describes the standardized variant. It has one
child element standardized-variant per variant. The
attributes on that last element capture the variation sequence,
the description of the desired appearance, and the shaping environment under
which the appearance is different.

The cjk-radicals child of the
ucd describes the CJK radicals. It has one
child element cjk-radical per radical. The
attributes on that last element capture the radical number, the
corresponding CJK radical character, and the corresponding CJK
unified ideograph.