In working on Unicode implementations, it is often useful to
access the full content of the Unicode character database
(UCD). For example, in establishing mappings from characters to
glyphs in fonts, it is convenient to see the character scalar
value, the character name, the character cross-references, the
character east asian width, along with the shape and metrics of
the proposed glyph to map to; looking at all this data
simultaneously helps in evaluating the mapping.

Accessing directly the data files that constitute the UCD is
sometime a daunting proposition. The data is dispersed in a number
of files of various formats, and there are just enough
peculiarities (all justified by the processing power available at
the time the UCD representation was designed) to require a fairly
intimate knowledge of the data format itself, in addition to the
meaning of the data.

Many programming environments (e.g. Java or ICU) do give
access to the UCD. However, those environments tend to lag behind
releases of the standard, or support only some of the UCD
content.

Unibook is a wonderful tool to explore the UCD and in many
cases is just the ticket; however, it is difficult to use when the
task at hand has not been built-in, or when non-UCD data is to be
displayed along.

This paper presents an alternative representation of the
UCD, which is meant to overcome these difficulties. We have chosen
an XML representation, because parsing becomes a non-issue: there
are a number of XML parsers freely available, and using them is
often fairly easy. In addition, there are freely available tools
that can perform powerful operations on XML data; for example,
XPATH and XQUERY engines can be thought of a “grep”
for XML data and XSLT engines can be thought of as
“awk” for XML data.

It is important to note that we are interested in exploring
the content of the UCD, rather than using the UCD data in processing
to character streams. Thus, we are not concerned so much by the
speed of processing or the size of our representation.

Our representation supports the creation of documents that
represent only parts of the UCD, either by not representing all
the characters, or by not representing all the properties. This
can be useful when only some of the data is needed.

Our schema defines a set of valid documents which are
intended to represent properties of Unicode code points and the
characters assigned to them. A document may represent the values
actually assigned in a given version of the UCD, or it may
represent a draft version of the UCD, or a private agreement on
Private Use Area characters. The validity of a document does not
assert anything on the correctness of the values.

Valid documents may provide values for only some of the
Unicode properties. Furthermore, they may also give non-Unicode
properties.

Our schema is defined using English. However, a useful
subset of the validity constraints can be captured using a
schema language, thereby simplifying the task of validating
documents. We have chosen Relax NG as the schema language. It is
important to stress that the Relax NG schema does not define
valid documents.

A design principle for our schema is that it supports the
relatively efficient representation of the UCD. This is
achieved by an inheritance mechanism, similar to property
inheritance in CSS or in XSL-FO.

Characters are pervasive in the UCD, and
will need to be represented somehow. Representing characters directly by
themselves would seem the most obvious choice; for example, we
could express that the decomposition of U+00E8 is
“&#x0065;&#x0300;”, i.e. have exactly two
characters in (the infoset of) the XML document. However, the
current XML specification limits the set of characters that can
be part of a document. Another problem is that the various tools
(XML parser, XPATH engine, etc.) may equate U+00E8 with U+0065
U+0300, thus making it difficult to figure out which of the two
sequences is contained in the database (which is sometimes
important for our purposes). Therefore, we chose instead to
represent characters by their code points; we follow the usual
convention of four to six hexadecimal digits (uppercase) and
code points in a sequence separated by space; e.g., the
decomposition of U+00E8 will be represented by the nine
characters “0065 0300” in the infoset.

In all our examples, we assume that this namespace is the
default one.

Non-Unicode properties can be represented by using elements
in another (possibly empty) namespace, and/or by attribute in a
non-empty namespace. Such elements and attributes are ignored for
the purpose of determining the validity of a document.

To facilitate the identification of a collection, this
element may have an attribute desc, which is
any string. It is recommended that if the document purports to
represent the UCD of some Unicode version, the
desc be selected in accord with the rules
listed at http://www.unicode.org/versions/;
and conversely, that documents which do not purport to represent
the UCD be described as such.

A mandatory
type attribute indicates whether the code
point is reserved, or designated as a
noncharacter, a surrogate
or has been assigned to an abstract
character. A mandatory cp
attribute records the code point in question. Other attributes
and subelements are used to represent the properties of that
code point

It is often the case that many code points share the same
values of some property or properties. For example, the
characters U+1740 BUHID LETER A .. U+1753 BUHID VOWEL SIGN U all
have the age “3.2”, and all have the script
“Buhd”. On the one hand, it is convenient to
support data files in which those properties are explicitly
listed with every code point, at this make answering questions
like “what is the age of U+1749” easier, since
there is no context. On the other hand, this leads to rather
large data files, and it also tends to obscure the differences
between similar characters.

Our representation accomodates both scenarios by having
the notion of groups. A group is simply a
container of code points that also holds default values for the
properties. If a code point inside a group
does not list explicitly a property but the group
lists it, then the code point inherits that property from its
group. For example, the fragment with
explicit properties:

As this example illustrates, the notion of
group does not necessarily align with the
notion of Unicode block. It is entirely defined and limited to
our representation. In particular, the value of a property for a
code point can always be determined from the XML document alone,
assuming that this property and this code point are expressed at
all. Of course, one may create an XML representation where the
groups happen to coincide with the Unicode blocks.

Groups cannot be nested; this simplifies the discovery of
inherited attributes, as they are precisely in the parent
element.

The unified Han ideographs have a very special structure:
they share the same basic properties, and their names are all of
the form “CJK UNIFIED
IDEOGRAPH-cp”, where
cp is their code point. The grouping
mechanism established so far would take care of the basic
properties, but not of the names. To accomodate this, we have a
further convention: if the name property on a group contains the
character U+002A * ASTERISK, then the inherited value is
obtained by replacing this character by the code point. For
example:

We need a final piece of infrastructure to futher reduce
the size of the data files: if a group contains a number of
char elements which only have a
cp attribute each, and the values of those
attributes are an interval (contiguous and no gaps), then it can
equivalently be represented by placing the attributes
first-char and last-char
on the group element:

It is important to stress that the mechanism of groups and
the special syntax for names are entirely defined by our
representation and do not depends on anything in the Unicode
standard itself. It should be possible to build a program that
takes a document that uses groups and creates another equivalent
document that does not use them, and uses only the text of this
section to do so.

The bidirectional category is represented by the
bc attribute. The possible values are
those listed in TUS 4.0, table 3.8

The mirrored property is represented by the
Bidi_M attribute, which can take the values
“Y” or “N”.

If the mirrored property is true, then the
bmg attribute may be present. Its value
is the code point of a character whose glyph is typically a
mirrored image of a typical glyph for the current character.

Note that we do not express the “Best Fit”
element recorded in BidiMirroring.txt. For one thing, it is
not meant to be machine readable. More importantly, the idea
underlying the mirrored glyph is delicate to use, since it
make assumptions about the design of the fonts, and the best
fit goes even farther.

The decomposition type is represented by the
dt attribute. The possible values are
can for characters with a canonical
decomposition, no for characters without
a decomposition (either canonical or compatibility) or the
tag of a compatibility decomposition (using the values
defined by PropertyAliases).

If the decomposition type is not no,
then the decomposition mapping, recorded by the
dm attribute, is meaningful. The value of
this attribute is code point sequence into which this
character decomposes.

If a character is cased (that is, its general category
is Lu, Ll or Lt), then simple case mappings are recorded using
the suc, slc,
stc attributes. These values of these
attributes are the character sequences.

Typically, aligning the groups with the Unicode blocks leads
to fairly compact data, as can be seen above. This is also helps
spot the particularities of individual characters relative to
their group: the non-usual linebreaking of U+20A7 PESETA SIGN, the
non-usual East-Asian width of U+20AC EURO SIGN.

There are a few instances where a block has vastly different
characters and breaking it in multiple groups makes for a much
more readable XML representation. For example, isolating the
C0 and C1 controls in their own groups, or isolating the
noncharacters (especially those in the Arabic Presentation Forms-A
block) is beneficial.

When the Unihan properties are not included in the XML
representation, we get a fairly compact representation:

Another interesting example is the beginning of the group
for Hangul Syllables. Because the space concerns are not paramount
in our representation, we can avoid all the
“built-in” knowledge of those characters:

Yet, the resulting XML files are wasteful on space. In fact,
an experimental version of the 4.0.1 UCD without the Unihan
properties is 1,923,716 bytes, while the corresponding UCD files
are 2,306,655 bytes. Similarly, an experimental version with the
Unihan properties is roughly equal in size to Unihan.txt
itself.