6.1 Introduction

Notation and symbols have proved very important for
mathematics. Mathematics has grown in part because of the
succinctness and suggestiveness of its evolving notation. There
have been many new signs evolved for use in mathematical
notation, and mathematicians have not held back from making use
of many symbols originally developed elsewhere. The result is
that mathematics makes use of a very large collection of
symbols. It is difficult to write mathematics fluently if these
characters are not available for use in coding. It is difficult
to read mathematics if corresponding glyphs are not available
for presentation on specific display devices.

This situation posed a problem for the first W3C Math Working Group
when it was brought into existence. It did not fall naturally within
the purview of developing a specification enabling mathematics to be
used with HTML and producing a DTD for this to worry about more than
the entities allowed in the DTD. However, as experience has shown, a
long list of entities with no means to display them is of little use,
and a cause of frequent frustrations in trying to use a standard. On
the other hand, a large collection of glyphs and fonts representing
characters without a standard way to refer to them is not of much use
either.

The W3C Math Working Group therefore took on directly the task of
specifying part of the full mechanism needed to proceed from
notation to final presentation, and started collaboration with
organizations undertaking specification of the rest.

This chapter of the MathML Specification contains a listing of
character names for use in MathML, recommendations for their use, and
warnings to pay attention to the correct form of the corresponding
code points given in the UCS (Universal Character Set) as codified in
Unicode and ISO 10646 [see [Unicode] and the
Unicode Web site]. For simplicity
we shall refer to this character set by the short name Unicode.
Though Unicode changes from time to time so that it is specified
exactly by using version numbers, unless this brings clarity on some
point we shall not use them. This specification of MathML makes use
of some characters that are not part of Unicode 3.0 but which have
been proposed to the Unicode Technical Committee (UTC), and thus for
inclusion in ISO 10646. They are presently expected to be in the
revisions Unicode 3.1 and 3.2. (For more detail about this see
Section 6.4.4 [Status of Character Encodings].)

While the process of review and adoption by UTC and ISO/IEC of the
characters of special interest to mathematics and MathML is largely
complete (Unicode Work
in Progress) there remains the possibility of some further
modification of the lists of characters accepted, of the code
assignments for those adopted, or of the names given them by Unicode.
To make sure any possible corrections to relevant standards are taken
into account, and for the latest character tables and font information,
see the W3C Math Working Group
home page and the Unicode
site.

6.2 MathML Characters

A MathML token element Section 3.2 [Token Elements], and Section 4.4.1 [Token Elements] takes as content a sequence of MathML
Characters. MathML Characters are defined to be either
Unicode characters legal in XML documents or mglyph elements. The latter are used to represent
characters that do not have a Unicode encoding, as described in
Section 3.2.9 [Adding new character glyphs to MathML
(mglyph)]. Because the Unicode UCS provides
approximately one thousand special alphabetic characters for the use
of mathematics (Unicode 3.1), and will provide over 900
special symbols in Unicode 3.2, the need for
mglyph should be rare.

6.2.1 Unicode Character Data

As always in XML, any character allowed by XML may be used in MathML
in an XML document. The legal characters have the hexadecimal code
numbers 09 (tab = U+0009), 0A (line feed = U+000A), 0D (carriage
return = U+000D), 20-D7FF (U+0020..U+D7FF), E000-FFFD
(U+E000..U+FFFD), and 10000-10FFFF (U+010000..U+10FFFF). The
parenthetical notation beginning with U+ is one recommended by Unicode
for referring to Unicode characters [see [Unicode], page
xxviii]. The exclusions above code number D7FF are of the blocks used
in surrogate pairs, and the two characters guaranteed not to be
Unicode characters at all. U+FFFE is excluded to allow determination
of byte order in certain encodings.

There are essentially three different ways of encoding character data.

Using characters directly: For example, an A may be entered as `A'
from a keyboard (character U+0061). This option is only available
if the character encoding specified for the XML document includes
the character. Most commonly used encodings will have `A' in the
ASCII position. In many encodings, characters may need more than
one byte. Note that if the document is, for example, encoded in
Latin-1 (ISO-8859-1) then only the characters in that
encoding are available directly. Unfortunately, most mathematical
symbols may not be encoded as character data in this way.

Using Numeric XML character references: Using this notation, `A' may be
represented as &#61; (decimal) or &#x41; (hex).
Note that the numbers always refer to the Unicode encoding (and not to
the character encoding used in the XML file). By using Character
references it is always possible to access the entire Unicode range.
For a general XML vocabulary, there is a disadvantage to this approach:
character references may not be used in XML element or attribute
names. However, this is not an issue for MathML, as all element names in
MathML are restricted to ASCII characters.

Using entity references: The MathML DTD defines internal entities that
expand to character data. Thus for example the entity reference
&eacute; may be used rather than the character reference
"&#xE9; or, if, for example, the document is encoded in
ISO-8859-1, the character é. An XML fragment that uses an entity
reference which is not defined in a DTD is not well formed; therefore
it will be rejected by an XML parser. For this reason
every fragment using entity references must
use a DOCTYPE declaration which specifies the MathML DTD, or a DTD
that at least declares any entity reference used in the MathML
instance. The need to use a DOCTYPE complicates inclusion of MathML in
some documents. However, entity references are very useful for small
illustrative examples, and are used in most examples in this document.
For this reason entity references are perhaps not optimal for use in
generated MathML, however they are very useful for small illustrative
examples, as used in this document.

6.2.2 Special Characters Not in Unicode

For special purposes, one may need to use a character which is not in
Unicode, even with the expected additions. In these cases
one may use the mglyph
element for direct access to a glyph from some font and creation of
a MathML character corresponding.
All MathML token elements that accept character data also accept an
mglyph in their content.

Beware, however, that the font chosen may not be available to all
MathML processors.

6.2.3 Mathematical Alphabetic Symbol
Characters.

A noticeable feature of mathematical and scientific writing is the use
of single letters to denote variables and constants in a given
context. The increasing complexity of science has led to the use of
certain common alphabet and font variations to provide enough special
symbols of this letter-like type. These denotations are in fact
not letters that may be used to make up words with
recognized meanings, but individual carriers of semantics themselves.
Writing a string of such symbols is usually interpreted in terms of
some composition law, for instance, multiplication. Many letter-like
symbols may be quickly interpreted by specialists in a given area as
of a certain mathematical type: for instance, bold symbols, whether
based on Latin or Greek letters, as vectors in physics or engineering,
or fraktur symbols as Lie algebras in part of pure mathematics. Again,
in given areas of science, some constants are recognizable letter
forms. When you look carefully at the range of letter-like
mathematical symbols in common use today, as the STIX project
supported by major scientific and technical publishers did, you come
up with perhaps surprisingly many. A proposal to facilitate
mathematical publishing by inclusion of mathematical alphabetic
symbols in the UCS was made, and has been favorably handled.

The new Mathematical Alphabetic characters expected Unicode 3.1 have
provisional code points in Plane 1, that is, in the first
plane with Unicode values higher than 216. This plane of
characters is also known as the Supplemental Multilingual Plane (SMP),
in contrast to the Basic Multilingual Plane (BMP) which has been used
by Unicode so far. Support for Plane 1 characters in currently
deployed software is not always reliable, and in particular support
for these Mathematical Alphabetic characters is not likely to be
widespread until after final positions in Unicode 3.1 have been
confirmed in the standard ISO 10646.

As discussed in Section 3.2.2 [Mathematics style attributes common to token
elements], MathML offers an
alternative mechanism to specify mathematical alphabetic characters,
which will help bridge the time of transition to Unicode revisions and
the associated deployment of implementing software and fonts therefore
required. Namely, one uses the mathvariant
attribute on the surrounding token element, which will most commonly
be mi. In this section we detail the
correspondence that a MathML processor should apply between certain
characters in Plane 0 (BMP) of Unicode, modified by the
mathvariant attribute, and the Plane 1
Mathematical Alphabetic Symbol characters.

The basic idea of the correspondence is fairly simple.
For example, a Mathematical Fraktur alphabet is being added, and
the code point for Mathematical Fraktur A is U1D504.
Thus using these proposed characters, a typical example might be

<mi>&#x1D504;</mi>

However, an alternative, equivalent markup would be to use
the standard A and modify the identifier using the
mathvariant attribute, as follows:

<mi mathvariant="fraktur">A</mi>

The exact correspondence between a mathematical alphabetic character
and an unstyled character is complicated by the fact that certain
characters that were already present in Unicode are not in the
`expected' sequence.

Mathematical Alphabetic Symbol characters should not be used for styled text.
For example, Mathematical Fraktur A must not be used to just select
a blackletter font for an uppercase A. Doing this sort of thing
would create problems for searching, restyling (e.g. for acessibility),
and many other kinds of processing.

6.2.4 Non-Marking Characters

Some characters, although important for the quality of print or
alternative rendering, do not have glyph marks that correspond
directly. They are called here non-marking characters. Below we have
a table of those adopted for the purposes of MathML. Their roles are
discussed in Chapter 3 [Presentation Markup] and Chapter 4 [Content Markup],
respectively. The values of the spaces given are
recommendations. Some of these characters are among those with new
Unicode values, and some are given as combinations of Unicode
characters employing the new special mathematics modifier character
(U0FE00). The correspondence between the spacing amounts mentioned
below and those in the Unicode descriptions is not exact, but the
matches are good.

In MathML 2 control of page composition, such as line-breaking, is
effected by the use of the proper attributes on the mspace element.

The last two characters below, with mnemonic entity names &InvisibleTimes; and &ApplyFunction;, are not simple spacers. They are
especially important new additions to the UCS because they provide
textual clues which can increase the quality of print rendering,
permit correct audio rendering, and allow the unique recovery of
mathematical semantics from text which is visually ambiguous.

6.3 Character Symbol Listings

The Universal Character Set (UCS) of Unicode and ISO 10646
continues to evolve Section 6.4.4 [Status of Character Encodings]. A small
number of the changes recently introduced, relative to those resulting
from the needs of Asian languages, are those designed exactly to
facilitate the use of Unicode by the `equation-writing' community.
This specification is written on the assumption that the code
assignments suggested to ISO/IEC JTC1/SC2/WG2 by the UTC will be
confirmed as they are in public draft forms of Unicode 3.1 and 3.2.
As before, we can only reiterate that for latest developments on
details of character standards as far as they influence mathematical
formalism the Home Page of the W3C Math WG should be consulted.

The characters are given with entity names as well as Unicode
numbers. To facilitate comprehension of a fairly large list of names,
which totals over 2000 in this case, we offer more than one way to find
to a given character. A corresponding full set of entity declarations
is in the DTD in Appendix A [Parsing MathML]. For discussion of entity
declarations see that appendix.

The characters are listed by name, and sample glyphs provided for all
of them. Each character name is accompanied by a code for a character
grouping chosen from a list given below, a short verbal description,
and a Unicode hex code drawn from ISO 10646, now extended in
accordance with the proposal forwarded by the UTC to ISO/IEC WG2 in
March 2000.

The character listings by alphabetical and Unicode order in
Section 6.3.7 [MathML Character Names] are in harmony with the ISO
character sets given, in that if some part of a set is included then
the entire set is included.

6.3.1 Special Constants

To begin we list separately a few of the special characters which
MathML has introduced. These have
been accorded new Unicode values. Rather like the non-marking &InvisibleTimes; and &ApplyFunction; above, they provide very useful
capabilities in the context of machinable mathematics. It might be
imagined there could also be entries below for &true;, &false; and &NotANumber;, but these do not yet have Unicode
points assigned. They can be introduced by the character extension
mechanisms provided by the mglyph and csymbol elements.

Entity name

Unicode

Description

&CapitalDifferentialD;

02145

D for use in differentials, e.g. within integrals

&DifferentialD;

02146

d for use in differentials, e.g. within integrals

&ExponentialE;

02147

e for use for the exponential base of the natural logarithms

&ImaginaryI;

02148

i for use as a square root of -1

6.3.2 Character Tables (ASCII format)

The first table offered is a very large ASCII listing of characters
considered particularly relevant to Mathematics. This is given in
Unicode (or proposed Unicode)
order. Most, but not all, of these characters have MathML names
defined via entity declarations in the DTD. Those that do not are
usually symbols which seem mathematically peripheral, such as dingbats,
machine graphics or technical symbols.

A second table lists those characters that do have MathML entity
names, ordered alphabetically, with
a lower-case letter preceding its upper-case counterpart.

6.3.3 Tables arranged by Unicode block

The tables in this section detail Unicode code points (displayed with
256 code points per table) that have mathematically significant
characters. The sample glyph images link to the table of characters ordered by Unicode given
in the previous section. As shown in the key for each table, the
status of each character (for example in Unicode 3.0 or in the
proposed additions) is indicated by a CSS class on the table cell
(which by default is indicated by varying the background color). The
names of the blocks are those of the Unicode blocks included in the
numerical range given; bracketing indicates characters of that type
are not shown in these tables.

6.3.4 Negated
Mathematical Characters

In addition to the Unicode Characters so far listed, one may use the
combining characters U0338 (/), U20D2 (|) and U20E5 (\) to produce
negated or canceled forms of characters. A combining character
should be placed immediately after its `base' character, with no
intervening markup or space, just as is the case for combining accents.

In principle, the negation characters may be applied to any Unicode
character, although fonts designed for mathematics typically have some
negated glyphs ready composed. A MathML renderer should be able to use
these pre-composed glyphs in these cases. A compound character code
either represents a UCS character that is already available, as in the
case of U0003D+00038 which amounts to U02260, or it does not as is the
case for U02202+00338. The common cases of negations, of both types,
that have been identified are listed in the table

Note that it is the policy of the W3C and of Unicode that if a single
character is already defined for what can be achieved with a combining
character, that character must be used instead of the decomposed form.
It is also intended that no new single characters representing what
can be done by with existing compositions will be introduced.

6.3.5 Variant
Mathematical Characters

Unicode attempts to avoid having several character codes for simple
font variants. For a code point to be assigned there should be
more than a nuance in glyphs to be recorded. To record
variants worth noting there is a special character proposed for
Unicode 3.2, U+FE00 (VARIATION SELECTOR-1), which
acts as a postfix modifier. However the legally allowed
combinations with this variation selector are restricted to a
list recorded as part of Unicode. The VARIATION SELECTOR-1
character may only be applied to the characters listed here.
The resulting combination is not regarded by Unicode as a separate
character, but a variation on the base character. Unicode aware systems
may render the combination as the base if the available fonts do not
support the variant glyph shape.

6.3.6 Mathematical Alphabetic Characters

Here we list the special mathematical alphabets. Note that the names
for these alphabetic runs should be regarded as conventions resulting
from recent tradition in the typesetting of mathematical formulas,
rather than as fixing exactly and forever the styles which are to be
used. Of course, they do correspond to the styles presently most
common. But, for instance, there may be font variations in the glyphs
from double-struck, open-face or blackboard bold fonts, all of which
would naturally be used for the characters in the range here labelled
Double-struck. Similar considerations would apply to appellations
such as fraktur and gothic, or script and calligraphic.

As discussed above, the use of these characters is formally equivalent
to the use of characters in Plane 0, together with a suitable value
for the mathvariant attribute. The
correspondence is given in the character tables. Most of these
characters come from the proposed additions to Plane 1, however a few
characters (such as the double-struck letters N, P, Z, Q, R, C, H
representing common number sets) were already present in Unicode 3.0
and retain their original positions. These characters are highlighted
in the tables.

6.3.7 MathML Character Names

This section corresponds closely with the entity definitions in the DTD
described in Appendix A [Parsing MathML]. All of the entity sets except the
last correspond to entity sets defined by ISO 8879 or ISO 9573-13.

6.4 Differences from Characters in MathML 1

6.4.1 Coverage

We have excluded a very few other characters that may have appeared in
the corresponding lists in MathML 1. Those characters thus
lost will be found to be used very infrequently in the
experience of mathematical publishers, or simply to be completely
unacceptable for inclusion in Unicode. However MathML 2 does provide
the mglyph element to accommodate new
characters that authors may wish to introduce.

6.4.2 Fewer Non-marking Characters

It used to be in MathML 1.0 that there were a number more
non-marking character entities listed. These were concerned with
composition control, such as line-breaking. In MathML 2 such control
is effected by the use of the proper attributes on the mspace element.

6.4.3 ISO Tables

The character listings by alphabetical and Unicode order in Section 6.3.7 [MathML Character Names] have now been brought more into line with
the corresponding ISO character sets than was the case in MathML 1.0,
in that if some part of a set is included then the entire set is
included. In addition, the group ISOCHEM has been dropped as more
properly the concern of chemists. All the ISO mathematical alphabets
are listed, since there are now Unicode characters to point to,
in particular the bold Greek of ISOGRK3. These changes have also been
reflected in the entity declarations in the DTD in Appendix A [Parsing MathML].

6.4.4 Status of Character Encodings

A significant change since MathML 1.0 is the movement toward
adoption of more characters for mathematics in the UCS (Universal
Character Set) and availability of public fonts for mathematics. The
encoding of characters in the UCS (Universal Character Set) is done
jointly by the Unicode Technical Committee and by ISO/IEC
JTC1/SC2/WG2. The process of encoding takes quite some time from the
deliberation of first proposals to the final approval.
The characters mentioned in this chapter and listed in the associated
tables are at various stages of this approval process. This section
gives detailed information about the stages relevant to this
specification and gives an overview of the characters affected. The
lists, as well as other places that discuss characters, mention when
characters are not fully approved or show this graphically.
Updates on the status of the characters will be provided by updates
to this specification, by errata to this
specification, and by notices on the
W3C Math home page.
The final word on all Unicode matters is naturally to be found
at the Unicode Consortium.

The characters relevant for MathML fall at present
into three categories:
Fully accepted characters, characters in final (JTC1) ISO/IEC ballot,
and characters before the final ISO/IEC ballot.

Fully accepted characters include a large number of Latin, Greek, and
Cyrillic letters, a large number of Mathematical Operators and
symbols, including arrows, and so on. Fully accepted characters
currently exactly those that are part of both [Unicode 3.0] and
[ISO/IEC 10646-1:2000], which are identical code point by code point.
Fully accepted characters are not specially marked or mentioned in
this specification; they do not pose any unusual implementation
problems other than possibly finding fonts to display them. Those of
obvious special interest to mathematics number over 1,500,
depending on how you count.

The characters presently in final ballot are the Mathematical
Alphanumeric Symbols with a large number of ideographs and other
characters not directly relevant for mathematics. There are just
about 1,000 of these. The due date of the ballot is early in 2001. If
accepted, the additions will still take some time to be formally
published. At this stage, there can be only acceptance or rejection of
the full proposal without technical changes. The additions are
expected to be published as ISO/IEC 10646-2, and to become part of
Unicode 3.1, which is tentatively scheduled for March 2001.
While acceptance of this ballot seems more likely than rejection,
implementers and users of MathML have to be aware that until the final
acceptance, they are using the code points of characters in final
ballot at their own risk. Entities (see Section 6.3.7 [MathML Character Names]) and the mathvariant attribute (see Section 3.2.2 [Mathematics style attributes common to token
elements]) can be used to avoid that risk.

Characters before final ballot relevant to MathML make up a long list
of operators and symbols, including some special constants and
non-marking characters (see Section 6.2.4 [Non-Marking Characters] and
Section 6.3.1 [Special Constants]). There are about 590 of these. The
proposal going to ballot is the result of repeated refinements by the
UTC; several, possibly final, changes (5) were made at a WG2 meeting
in Athens in September. This document reflects these changes. The
majority of these characters have proved completely uncontroversial.
ISO balloting processes, which involve a PDAM and an FPDAM during
which technical changes are possible, and an FDAM with no changes
allowed, may be expected to end in November 2001. The additions
accepted are expected to be published as an amendment to [ISO/IEC
10646-1], and to become part of Unicode 3.2.
It can therefore be expected that almost all of the characters in this
category will finally be accepted, and encoded at the current
code points. It is possible that a small number of characters may be
renamed, moved, or less likely, ultimately rejected. Until final
acceptance, implementers and users of MathML are using these
characters and code points at their own risk. Entities and the mathvariant attribute are used to avoid that risk.