Glossary of Unicode Terms

Abjad. A writing system in which only
consonants are indicated. The term “abjad” is derived from the first
four letters of the traditional order of the Arabic script: alef,
beh, jeem, dal. (See
Section 6.1, Writing Systems.)

Abugida. A writing system in which
consonants are indicated by the base letters that have an inherent
vowel, and in which other vowels are indicated by additional
distinguishing marks of some kind modifying the base letter. The
term “abugida” is derived from the first four letters of the
Ethiopic script in the Semitic order: alf, bet, gaml, dant. (See
Section 6.1, Writing Systems.)

Accent Mark. A
mark placed above, below, or to the side of a character to alter its
phonetic value. (See also diacritic.)

Acrophonic. Denoting letters or numbers by the first letter of their
name. For example, the Greek acrophonic numerals are variant forms
of such initial letters.

Aksara. (1) In Sanskrit grammar, the term for “letter” in general,
as opposed to consonant (vyanjana) or vowel (svara). Derived from
the first and last letters of the traditional ordering of Sanskrit
letters—“a” and “ksha”. (2) More generally, in Indic writing
systems, aksara refers to a “syllable,” consisting of a consonant
plus vowel sequence, where the vowel may or may not be the inherent
vowel of the consonant letter. When multiple consonants are
involved, the aksara represents the entire orthographic syllable,
which can include two or more leading consonants that may be
visually presented in conjunct forms; in such cases, the aksara may
not be identical to the phonological syllable.

Algorithm. A term used in a broad sense in the Unicode Standard, to
mean the logical description of a process used to achieve a
specified result. This does not require the actual procedure
described in the algorithm to be followed; any implementation is
conformant as long as the results are the same.

Alphabet. A writing system in which both consonants and vowels are
indicated. The term “alphabet” is derived from the first two letters
of the Greek script: alpha, beta. (See
Section 6.1, Writing Systems.)

Annotation. The association of secondary textual content with a
point or range of the primary text. (The value of a particular
annotation is considered to be a part of the “content” of the text.
Typical examples include glossing, citations, exemplification,
Japanese yomi, and so on.)

ANSI. (1) The American National Standards Institute. (2) The
Microsoft collective name for all Windows code pages. Sometimes used
specifically for code page 1252, which is a superset of ISO/IEC
8859-1.

Apparatus Criticus. Collection of conventions used by editors to
annotate and comment on text.

Arabic Digits. The term "Arabic digits"
may mean either the digits in the Arabic script (see Arabic-Indic digits) or the
ordinary ASCII digits in contrast to Roman numerals (see European digits). When the term
"Arabic digits" is used in Unicode specifications, it means
Arabic-Indic digits.

Arabic-Indic Digits. Forms of decimal digits used in most parts of the
Arabic world (for instance, U+0660, U+0661, U+0662, U+0663). Although
European digits (1, 2, 3,…)
derive historically from these forms, they are visually distinct and
are coded separately. (Arabic-Indic digits are sometimes called Indic
numerals; however, this nomenclature leads to confusion with the
digits currently used with the scripts of India.) Variant forms of Arabic-Indic digits used
chiefly in Iran and Pakistan are referred to as Eastern Arabic-Indic
digits. (See
Section 9.2, Arabic.)

ASCII. (1) The American Standard Code for Information Interchange, a
7-bit coded character set for information interchange. It is the
U.S. national variant of ISO/IEC 646 and is formally the U.S.
standard ANSI X3.4. It was proposed by ANSI in 1963 and finalized in
1968. (2) The set of 128 Unicode characters from U+0000 to U+007F,
including control codes as well as graphic characters. (3) ASCII has
been incorrectly used to refer to various 8-bit character encodings
that include ASCII characters in the first 128 code points.

Base Character. Any graphic
character except for those with the General Category of Combining
Mark (M). (See definition D51 in
Section 3.6, Combination.) In a
combining character sequence, the base character is the initial
character, which the combining marks are applied to.

Block. A grouping of characters
within the Unicode encoding space used for organizing code charts. Each block is a uniquely named, continuous, non-overlapping range of code points, containing a multiple of 16 code points, and starting at a location that is a multiple of 16. A block may contain unassigned
code points, which are reserved.

BOCU-1. Acronym for Binary Ordered
Compression for Unicode. A Unicode compression scheme that is
MIME-compatible (directly usable for e-mail) and preserves binary
order, which is useful for databases and sorted lists.

Bopomofo. An alphabetic script used
primarily in the Republic of China (Taiwan) to write the sounds of
Mandarin Chinese and some other dialects. Each symbol corresponds to
either the syllable-initial or syllable-final sounds; it is
therefore a subsyllabic script in its primary usage. The name is
derived from the names of its first four elements. More properly
known as zhuyin zimu or zhuyin fuhao in Mandarin
Chinese.

Boustrophedon. A pattern of
writing seen in some ancient manuscripts and inscriptions, where
alternate lines of text are laid out in opposite directions, and
where right-to-left lines generally use glyphs mirrored from their
left-to-right forms. Literally, “as the ox turns,” referring to the
plowing of a field.

Braille. A writing system using a
series of raised dots to be read with the fingers by people who are
blind or whose eyesight is not sufficient for reading printed
material. (See
Section 21.1, Braille.)

Byte. (1) The minimal unit of
addressable storage for a particular computer architecture. (2) An
octet. Note that many early computer architectures used bytes larger
than 8 bits in size, but the industry has now standardized almost
uniformly on 8-bit bytes. The Unicode Standard follows the current
industry practice in equating the term byte with octet
and using the more familiar term byte in all contexts. (See
octet.)

Canonical. (1) Conforming to the
general rules for encoding—that is, not compressed, compacted, or in
any other form specified by a higher protocol. (2) Characteristic of
a normative mapping and form of equivalence specified in
Chapter
3, Conformance.

Case. (1) Feature of certain
alphabets where the letters have two distinct forms. These variants,
which may differ markedly in shape and size, are called the
uppercase letter (also known as capital or majuscule) and the
lowercase letter (also known as small or minuscule). (2) Normative
property of characters, consisting of uppercase, lowercase, and titlecase (Lu, Ll, and Lt). (See
Section 4.2, Case.)

Case-Ignorable. A character C is defined to be case-ignorable if C has the value MidLetter (ML), MidNumLet (MB), or Single_Quote (SQ) for the Word_Break property or its General_Category is one of Nonspacing_Mark (Mn), Enclosing_Mark (Me), Format (Cf), Modifier_Letter (Lm), or Modifier_Symbol (Sk). (See definition D136 in
Section 3.13, Default Case Algorithms.)

Cedilla. A mark originally placed
beneath the letter c in French, Portuguese, and Spanish to indicate
that the letter is to be pronounced as an s, as in façade.
Obsolete Spanish diminutive of ceda, the letter z.

Character. (1) The smallest
component of written language that has semantic value; refers to the
abstract meaning and/or shape, rather than a specific shape (see
also glyph), though in code tables some form of visual
representation is essential for the reader’s understanding. (2)
Synonym for abstract character. (3) The basic unit of
encoding for the Unicode character encoding. (4) The English name
for the ideographic written elements of Chinese origin. [See ideograph (2).]

Coded Character Set. A
character set in which each character is assigned a numeric code
point. Frequently abbreviated as character set, charset, or
code set; the acronym CCS is also used.

Code Page. A coded character set,
often referring to a coded character set used by a personal
computer—for example, PC code page 437, the default coded character
set used by the U.S. English version of the DOS operating system.

Code Point. (1) Any value in the
Unicode codespace; that is, the range
of integers from 0 to 10FFFF16. (See definition D10 in
Section 3.4,
Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a
character, in any coded character set.

Combining Character
Sequence. A maximal character sequence consisting of either
a base character followed by a sequence of one or more characters
where each is a combining character,
zero width joiner, or
zero width non-joiner;
or a sequence of one or more characters where each is a combining
character, zero width joiner,
or zero width non-joiner.
(See definition D56 in
Section 3.6, Combination.)

Compatibility. (1) Consistency with existing practice or preexisting
character encoding standards. (2) Characteristic of a normative
mapping and form of equivalence specified in
Section 3.7. Decomposition.

Consonant Cluster. A sequence of two or more consonantal sounds.
Depending on the writing system, a consonant cluster may be
represented by a single character or by a sequence of characters.
(Contrast digraph.)

Consonant Conjunct. A sequence of two or more adjacent consonantal
letterforms, consisting of a sequence of one or more dead consonants
followed by a normal, live consonant letter. A consonant conjunct
may be ligated into a single conjunct form, or it may be represented
by graphically separable parts, such as subscripted forms of the
consonant letters. Consonant conjuncts are associated with the
Brahmi family of Indic scripts. (See
Section 12.1, Devanagari.)

Contextual Variant. A text element can have a presentation form that
depends on the textual context in which it is rendered. This
presentation form is known as a contextual variant.

Control Codes. The 65 characters in the ranges U+0000..U+001F and
U+007F..U+009F. Also known as control characters.

Core Specification. The central part of the Unicode Standard–the portion which up until Version 5.0 was published as a separate book. Starting with Version 5.2, this part of the standard has been published online only, rather than as a book. The core specification consists of the general introduction and framework for the standard, the formal conformance requirements, many implementation guidelines, and extensive chapters providing information about all the encoded characters, organized by script or by significant classes of characters. Formally, a version of the Unicode Standard is defined by an edition of this core specification, together with the Code
Charts, Unicode
Standard Annexes and
the Unicode
Character Database.

Decomposition. (1) The process
of separating or analyzing a text element into component units.
These component units may not have any functional status, but may be
simply formal units—that is, abstract shapes. (2) A sequence of one
or more characters that is equivalent to a decomposable character.
(See definition D64 in
Section 3.7. Decomposition.)

Default Ignorable. Default ignorable code points are those that should be ignored by default in rendering unless explicitly supported. They have no visible glyph or advance width in and of themselves, although they may affect the display, positioning, or adornment of adjacent or surrounding characters.
(See
Section 5.21, Ignoring Characters in Processing.)

Demotic Script. (1) A script
or a form of a script used to write the vernacular or common speech
of some language community. (2) A simplified form of the ancient
Egyptian hieratic writing.

Dependent Vowel. A symbol or
sign that represents a vowel and that is attached or combined with
another symbol, usually one that represents a consonant. For
example, in writing systems based on Arabic, Hebrew, and Indic
scripts, vowels are normally represented as dependent vowel signs.

Deprecated. Of a coded character
or a character property, strongly discouraged from use. (Not the
same as obsolete.)

Designated Code Point.
Any code point that has either been assigned to an abstract
character (assigned characters) or that has otherwise been
given a normative function by the standard (surrogate code points
and noncharacters). This definition excludes reserved code points.
Also known as assigned code point. (See
Section 2.4 Code Points and Characters.)

Deterministic Comparison. A string comparison in which strings that do not have identical contents will compare as unequal. There are two main varieties, depending on the sense of "identical:" (a) binary equality, or (b) canonical equivalence. This is a property of the comparison mechanism, and not of the sorting algorithm. Also known as stable (or semi-stable) comparison.

Deterministic Sort. A sort algorithm which returns exactly the same output each time it is applied to the same input. This is a property of the sorting algorithm, and not of the comparison mechanism. For example, a randomized Quicksort (which picks a random element as the pivot element, for optimal performance) is not deterministic. Multiprocessor implementations of a sort algorithm may also not be deterministic.

Diacritic. (1) A mark applied or
attached to a symbol to create a new symbol that represents a
modified or new value. (2) A mark applied to a symbol irrespective
of whether it changes the value of that symbol. In the latter case,
the diacritic usually represents an independent value (for example,
an accent, tone, or some other linguistic information). Also called
diacritical mark or diacritical. (See also combining characterand
nonspacing mark.)

Diaeresis. Two horizontal dots
over a letter, as in naïve. The diaeresis is not
distinguished from the umlaut in the Unicode character
encoding. (See umlaut.)

Digraph. A pair of signs or symbols
(two graphs), which together represent a single sound or a single
linguistic unit. The English writing system employs many digraphs
(for example, th, ch, sh, qu, and so on). The same two
symbols may not always be interpreted as a digraph (for example,
cathode versus cathouse). When three signs
are so combined, they are called a trigraph. More than three
are usually called an n-graph.

Diphthong. A pair of vowels that
are considered a single vowel for the purpose of phonemic
distinction. One of the two vowels is more prominent than the other.
In writing systems, diphthongs are sometimes written with one symbol
and sometimes with more than one symbol (for example, with a
digraph).

Double-Byte Character Set.
One of a number of character sets defined for representing Chinese,
Japanese, or Korean text (for example, JIS X 0208-1990). These
character sets are often encoded in such a way as to allow
double-byte character encodings to be mixed with single-byte
character encodings. Abbreviated DBCS. (See also
multibyte character set.)

Ductility. The ability of a cursive
font to stretch or compress the connective baseline to effect text
justification.

Dynamic Composition.
Creation of composite forms such as accented letters or Hangul
syllables from a sequence of characters.

EBCDIC. Acronym for Extended
Binary-Coded Decimal Interchange Code. A group of coded character
sets used on mainframes that consist of 8-bit coded characters.
EBCDIC coded character sets reserve the first 64 code points (x00
to x3F) for control codes, and reserve the range x41 to xFE for graphic characters. The English
alphabetic characters are in discontinuous segments with uppercase
at xC1 to xC9, xD1 to xD9, xE2 to xE9, and lowercase at x81 to x89,
x91 to x99, xA2 to xA9.

Emoji. (1) The Japanese word for "pictograph." (2) Certain pictographic and other symbols encoded in the Unicode Standard that are commonly given a colorful or playful presentation when displayed on devices. Most of the emoji in Unicode were encoded for compatibility with Japanese telephone symbol sets. (3) Colorful or playful symbols which are not encoded as characters but which are widely implemented as graphics. (See pictograph.)

Emoticon. A symbol added to text to express emotional affect or reaction—for example, sadness, happiness, joking intent, sarcasm, and so forth. Emoticons are often expressed by a conventional kind of "ASCII art," using sequences
of punctuation and other symbols to portray likenesses of facial expressions. In Western contexts these are often turned sideways, as :-) to express a happy face; in East Asian contexts other conventions often portray a facial expression without turning, as ^-^. Rendering systems often recognize conventional emoticon sequences and display them as colorful or even animated glyphs in text. There is also a set of dedicated pictographic symbols—mostly representing different facial expressions—encoded as characters in the Unicode Standard. (See pictograph.)

Enclosing Mark. A nonspacing mark with the General Category of
Enclosing Mark (Me). (See definition D54 in
Section 3.6, Combination.) Enclosing marks are a subclass of nonspacing marks
that surround a base character, rather than merely being placed
over, under, or through it.

Encoded Character. An
association (or mapping) between an abstract character and a
code point. (See definition D11 in
Section 3.4,
Characters and Encoding.) By itself, an abstract character has no numerical
value, but the process of “encoding a character” associates a
particular code point with a particular abstract character, thereby
resulting in an “encoded character.”

Escape Sequence. A sequence
of bytes that is used for code extension. The first byte in the
sequence is escape (hex 1B).

EUDC. Acronym for end-user defined character. A character defined by
an end user, using a private-use code point, to represent a
character missing in a particular character encoding. These are
common in East Asian implementations.

European Digits. Forms of
decimal digits first used in Europe and now used worldwide.
Historically, these digits were derived from the Arabic digits; they
are sometimes called “Arabic numerals,” but this nomenclature leads
to confusion with the real Arabic digits. Also called "Western digits" and "Latin digits."

Extended Combining Character Sequence. A maximal character sequence consisting of either an extended base followed by a sequence of one or more characters where each is a combining character,
zero width joiner, or
zero width non-joiner; or a sequence of one or more characters where each is a combining character,
zero width joiner, or
zero width non-joiner. Abbreviated as
ECCS. (See definition D56a in
Section 3.6, Combination.)

Folding. An operation that maps similar characters to a common
target, such as uppercasing or lowercasing a string. Folding
operations are most often used to temporarily ignore certain
distinctions between characters.

Font. A collection of glyphs used for
the visual depiction of character data. A font is often associated
with a set of parameters (for example, size, posture, weight, and serifness), which, when set
to particular values, generate a collection of imagable glyphs.

Format Character. A character that is inherently invisible but that
has an effect on the surrounding characters.

Fullwidth. Characters of East Asian
character sets whose glyph image extends across the entire character
display cell. In legacy character sets, fullwidth characters are normally encoded in two or
three bytes. The Japanese term for fullwidth characters is zenkaku.

Globalization. (1) The overall process for internationalization and localization of software products. (2) a synonym for internationalization. Also known by the abbreviation "g11n". Note that the meaning of "globalization" which is
relevant to software products should be distinguished from the more widespread use of "globalization" in the context of economics. (See internationalization, localization.)

Glyph. (1) An abstract form that
represents one or more glyph images. (2) A synonym for glyph image. In displaying Unicode
character data, one or more glyphs may be selected to depict a
particular character. These glyphs are selected by a rendering
engine during composition and layout processing. (See also
character.)

Glyph Code. A numeric code that
refers to a glyph. Usually, the glyphs contained in a font are
referenced by their glyph code. Glyph codes may be local to a
particular font; that is, a different font containing the same
glyphs may use different codes.

Glyph Identifier. Similar to
a glyph code, a glyph identifier is a label used to refer to a glyph
within a font. A font may employ both local and global glyph
identifiers.

Glyph Image. The actual, concrete
image of a glyph representation having been rasterized or otherwise
imaged onto some display surface.

Glyph Metrics. A collection of
properties that specify the relative size and positioning along with
other features of a glyph.

Grapheme. (1) A minimally
distinctive unit of writing in the context of a particular writing
system. For example, ‹b› and ‹d› are distinct graphemes in English
writing systems because there exist distinct words like big and dig.
Conversely, a lowercase italiform letter a and a
lowercase Roman letter a are not distinct graphemes because no word
is distinguished on the basis of these two different forms. (2) What
a user thinks of as a character.

Grapheme Extender. A
character with the property Grapheme_Extend. (See definition D59 in
Section 3.6, Combination.) Grapheme extender characters consist of
all nonspacing marks, zero
width joiner, zero
width non-joiner, and a small number of spacing marks.

Guillemet. Punctuation marks
resembling small less-than and greater-than signs, used as quotation
marks in French and other languages. (See “Language-Based Usage of
Quotation Marks” in
Section 6.2, General Punctuation.)

Halant. A preferred Hindi synonym
for a virama. It literally means killer, referring to
its function of killing the inherent vowel of a consonant
letter. (See virama.)

Half-Consonant Form. In
the Devanagari script and certain other scripts of the Brahmi family
of Indic scripts, a dead consonant may be depicted in the so-called
half-form. This form is composed of the distinctive part of a
consonant letter symbol without its vertical stem. It may be used to
create conjunct forms that follow a horizontal layout pattern. Also
known as half-form.

Halfwidth. Characters of East Asian
character sets whose glyph image occupies half of the character
display cell. In legacy character sets, halfwidth characters are
normally encoded in a single byte. The Japanese term for halfwidth characters is hankaku.

Hangul Syllable. (1) Any of
the 11,172 encoded characters of the Hangul Syllables character
block, U+AC00..U+D7A3. Also called a precomposed Hangul syllable
to clearly distinguish it from a Korean syllable block. (2) Loosely
speaking, a Korean syllable block.

Hanja. The Korean name for Han
characters; derived from the Chinese word hànzì.

Higher-Level Protocol.
Any agreement on the interpretation of Unicode characters that
extends beyond the scope of this standard. Note that such an
agreement need not be formally announced in data; it may be implicit
in the context. (See definition D16 in
Section 3.4,
Characters and Encoding.)

Hiragana. One of two standard
syllabaries associated with the Japanese writing system. Hiragana
syllables are typically used in the representation of native
Japanese words and grammatical particles.

HTML. HyperText Markup Language. A text
description language related to SGML; it mixes text format markup
with plain text content to describe formatted text. HTML is
ubiquitous as the source language for Web pages on the Internet.
Starting with HTML 4.0, the Unicode Standard functions as the
reference character set for HTML content. (See also SGML.)

ICU. Acronym for International Components for Unicode, an Open
Source set of C/C++ and Java libraries for Unicode and software
internationalization support. For information, see
http://www.icu-project.org/

Ideograph (or ideogram). (1) Any symbol that primarily denotes an idea or concept in contrast to a sound or pronunciation—for example, ♻, which denotes the concept of recycling by a series of bent arrows. (2) A generic term for the unit of writing of a logosyllabic writing system. In this sense, ideograph (or ideogram) is not systematically distinguished from logograph (or logogram). (3) A term commonly used to refer specifically to Han characters, equivalent to the Chinese, Japanese, or Korean terms also sometimes used: hànzì, kanji, or hanja. (See logograph, pictograph, sinogram.)

IICore. A subset of common-use CJK unified ideographs, defined as
the fixed collection 370 IICore in ISO/IEC 10646. This subset
contains 9,810 ideographs and is intended for common use in East
Asian contexts, particularly for small devices that cannot support
the full range of CJK unified ideographs encoded in the Unicode
Standard.

In-Band. An in-band channel conveys
information about text by embedding that information within the text
itself, with special syntax to distinguish it. In-band information
is encoded in the same character set as the text, and is
interspersed with and carried along with the text data. Examples are
XML and HTML markup.

Independent Vowel. In Indic
scripts, certain vowels are depicted using independent letter
symbols that stand on their own. This is often true when a word
starts with a vowel or a word consists of only a vowel.

Informative. Information in this
standard that is not normative but that contributes to the correct
use and implementation of the standard.

Inherent Vowel. In writing
systems based on a script in the Brahmi
family of Indic scripts, a consonant letter symbol normally has an
inherent vowel, unless otherwise indicated. The phonetic value of
this vowel differs among the various languages written with these
writing systems. An inherent vowel is overridden either by
indicating another vowel with an explicit vowel sign or by using
virama to create a dead consonant.

Inner Caps. Mixed case format
where an uppercase letter is in a position other than first in the
word—for example, “G” in the Name “McGowan.”

Internationalization. The process of designing and implementing a software product so that it can be easily localized, with few if any structural changes. Ideally, an internationalized software product can be localized simply by translating messages and other text displayed to a user, and by adapting icons and other visual elements. An "internationalized" software product is also known as a "localizable" product. Also known by the abbreviation "i18n" and the term "World-Readiness". (See localization, globalization.)

IPA. (1) The International Phonetic
Alphabet. (2) The International Phonetic Association, which defines
and maintains the International Phonetic Alphabet.

Kana. The name of a primarily syllabic script used by the Japanese
writing system. It comes in two forms, hiragana and katakana. The
former is used to write particles, grammatical affixes, and words
that have no kanjiform; the latter is used primarily to write
foreign words.

Kanji. The Japanese name for Han characters; derived from the
Chinese word hànzì. Also romanized as
kanzi.

Katakana. One of two standard syllabaries associated with the
Japanese writing system. Katakana syllables are typically used in
representation of borrowed vocabulary (other than that of Chinese
origin), sound-symbolic interjections, or phonetic representation of
“difficult” kanji characters in Japanese.

Kerning. (1) Changing the space between certain pairs of letters to
improve the appearance of the text. (2) The process of mapping from
pairs of glyphs to a positioning offset used to change the space
between letters.

Korean Syllable Block.
A sequence of Korean jamos, consisting of one or more leading
consonants followed by one or more vowels followed by zero or more
trailing consonants, or any canonically equivalent sequence
including a precomposed Hangul syllable. In regular expression
notation: L L* V V* T*. Also called a standard
Korean syllable block. (See
Section 3.12, Conjoining Jamo Behavior.)

Letter. (1) An element of an alphabet. In a broad sense, it includes
elements of syllabaries and ideographs. (2) Informative property of
characters that are used to write words.

Ligature. A glyph representing a combination of two or more
characters. In the Latin script, there are only a few in modern use,
such as the ligatures between “f” and “i” or “f” and “l”. Other scripts make use of many ligatures, depending on the font
and style.

Localization. (1) The process of adapting a software product to use the languages and conventions suitable for a local market, such as adapting an English US software product to work in Spanish for Argentina. (2) The management of software product translation, which includes extraction of translatable text, management of translations, and generation of language resource modules. Also known by the abbreviation "L10n". Localization produces "localized" software products. (See internationalization, globalization.)

Logograph (or logogram). (1) Any symbol that primarily represents a word (or morpheme) in contrast to a sound or pronunciation. (2) A generic term for the unit of writing of a logosyllabic writing system. In this sense, logograph (or logogram) is not systematically distinguished from ideograph (or ideogram). (See ideograph, pictograph.)

Logosyllabary. A writing system in which the units are used
primarily to write words and/or morphemes of words, with some
subsidiary usage to represent just syllabic sounds. The best example
is the Han script.

Mathematical Property. Informative property of characters that are
used as operators in mathematical formulae.

Matra. A dependent vowel in an Indic script. It is the name for
vowel letters that follow consonant letters in logical order. A
matra often has a completely different letterform from that for the
same phonological vowel used as an independent letter.

MIME. Multipurpose Internet Mail Extensions. MIME is a standard that
allows the embedding of arbitrary documents and other binary data of
known types (images, sound, video, and so on) into e-mail handled by
ordinary Internet electronic mail interchange protocols.

Modifier Letter. A character
with the Lm General Category in the Unicode Character Database.
Modifier letters, which look like letters or punctuation, modify the
pronunciation of other letters (similar to diacritics). (See
Section 7.8, Modifier Letters.)

Mora. A phonological term: the unit of
sound which determines syllable weight in some languages. Some
syllabaries have characteristics which reflect moraic structure more
or less exactly. In particular, the Japanese kana syllabaries
actually write one character per mora, rather than one character per
syllable. The Vai syllabary also counts final nasals as distinct
moras, and writes moras instead of syllables.

Multibyte Character Set. A character set encoded with a variable
number of bytes per character, often abbreviated as MBCS. Many large
character sets have been defined as MBCS so as to keep strict
compatibility with the ASCII subset and/or ISO/IEC 2022.

Named Unicode Algorithm.
A Unicode algorithm that is specified in the Unicode Standard or in
other standards published by the Unicode Consortium and that is
given an explicit name for ease of reference. (See definition D18 in
Section 3.4,
Characters and Encoding. See also Table 3-1, “Named
Unicode Algorithms,” for a list of named Unicode algorithms.)

Namespace. (1) A set of names, no
two of which are identical. (2) A set of names together with name
matching rules, so that all names are distinct under the matching
rules. (See definition D6 in
Section 3.3, Semantics.) Character
names are distinct if they do not match under the name matching
rules in effect for the standard.

Nekudot. Marks that indicate vowels or other modifications of
consonantal letters in Hebrew.

Nonspacing Mark. A combining
character with the General Category of Nonspacing Mark (Mn) or
Enclosing Mark (Me). (See definition D53 in
Section 3.6, Combination.) The position of a nonspacing mark in presentation
depends on its base character. It generally does not consume space
along the visual baseline in and of itself. (See also
combining character.)

Normalization. A process of
removing alternate representations of equivalent sequences from
textual data, to convert the data into a form that can be
binary-compared for equivalence. In the Unicode Standard,
normalization refers specifically to processing to ensure that
canonical-equivalent (and/or compatibility-equivalent) strings have
unique representations. For more information, see “Equivalent
Sequences” in
Section 2.2, Unicode Design Principles, and
Section 3.11, Normalization Forms.

Normalization Form C (NFC).
A normalization form that erases any canonical differences, and
generally produces a composed result. For example, a + umlaut is
converted to ä in this form. This form most closely matches legacy
usage. The formal definition
is D120 in
Section 3.11, Normalization Forms.

Normalization Form D (NFD).
A normalization form that erases any canonical differences, and
produces a decomposed result. For example, ä is converted to a +
umlaut in this form. This form is most often used in internal
processing, such as in collation. The
formal definition is D118 in
Section 3.11, Normalization Forms.

Normalization Form KC (NFKC).
A normalization form that erases both canonical and compatibility
differences, and generally produces a composed result: for example,
the single ǆ character is converted to d + ž in this form. This form
is commonly used in matching. The
formal definition is D121 in
Section 3.11, Normalization Forms.

Obsolete. Applies to a character that is no longer in current use,
but that has been used historically. Whether a character is obsolete
depends on context: For example, the Cyrillic letter big yus is
obsolete for Russian, but is used in modern Bulgarian. (Not the same
as
deprecated.)

Octet. An ordered sequence of eight bits considered as a unit. The
Unicode Standard follows current industry practice in referring to
an octet as a byte. (See byte.)

Out-of-Band. An out-of-band channel conveys additional information
about text in such a way that the textual content, as encoded, is
completely untouched and unmodified. This is typically done by
separate data structures that point into the text.

Overridable. A characteristic of a Unicode character property that
may be changed by a higher-level protocol to create desired
implementation effects.

Oxia. Greek term for acute accent, used in polytonic Greek character names.

Paragraph Direction. The default direction (left or
right) of the
text of a paragraph. This direction does not change the display
order of characters within an Arabic or English word. However, it
does change the display order of adjacent Arabic and English words,
and the display order of neutral characters, such as punctuation and
spaces. For more details, see
Unicode Standard Annex #9, “Unicode
Bidirectional Algorithm,” especially definitions BD2–BD5.

Paragraph Embedding Level.
The embedding level that determines the default bidirectional
orientation of the text in that paragraph.

Perispomeni. Greek term for circumflex accent, used in polytonic Greek character names.

Phoneme. A minimally distinct sound in the context of a particular
spoken language. For example, in American English, /p/ and /b/ are
distinct phonemes because pat and bat are distinct; however, the two
different sounds of /t/ in tick and stick are not distinct in
English, even though they are distinct in other languages such as
Thai.

Pictograph (or pictogram). Any symbol that denotes an object by means of a more or less conventional visual likeness—for example, ✈. (See emoji, ideograph, logograph.)

Pinyin. Standard system for the romanization of Chinese on the basis
of Mandarin pronunciation.

Pivot Conversion. The use of a third character encoding to serve as
an intermediate step in the conversion between two other character
encodings. The Unicode Standard is widely used to support pivot
conversion, as its character repertoire is a superset of most other
coded character sets.

Plain Text. Computer-encoded text that consists
only of a sequence
of code points from a given standard, with no other formatting or
structural information. Plain text interchange is commonly used
between computer systems that do not share higher-level protocols.
(See also rich text.)

Plane. A range of 65,536 (1000016)
contiguous Unicode code points, where the first code point is an
integer multiple of 65,536 (1000016). Planes are numbered
from 0 to 16, with the number being the first code point of the
plane divided by 65,536. Thus Plane 0 is U+0000..U+FFFF, Plane 1 is
U+10000..U+1FFFF, ..., and Plane 16 (1016)
is U+100000..10FFFF. (Note that ISO/IEC 10646 uses
hexadecimal notation for the plane numbers—for example, Plane B instead of
Plane 11). (See Basic
Multilingual Plane and supplementary
planes.)

Points. (1) The nonspacing vowels and other signs of written Hebrew.
(2) A unit of measurement in typography.

Private Use. Refers to designated code points in the Unicode
Standard or other character encoding standards whose interpretations
are not specified in those standards and whose use may be determined
by private agreement among cooperating users.

Private Use Area (PUA). Any
one of the three blocks of private-use code points in the Unicode
Standard.

Productive. Said of a feature or
rule that can be employed in novel combinations or circumstances,
rather than being restricted to a fixed list. In the Unicode
Standard, combining marks—particularly the accents—are productive.
In contrast, variation selectors are deliberately not productive.
Also known as generative.

Reorderable Pair. Two adjacent characters A and B in a coded character sequence
<A, B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0. (Used in the definition of Unicode Normalization Forms.) (See
definition D108 in
Section 3.11, Normalization Forms.)

Rich Text. Also known as styled text. The result of adding
information to plain text. Examples of information that can be added
include font data, color, formatting information, phonetic
annotations, interlinear text, and so on. The Unicode Standard does
not address the representation of rich text. It is expected that
systems and applications will implement proprietary forms of rich
text. Some public forms of rich text are available (for example, ODA,
HTML, and SGML). When everything except primary content is removed
from rich text, only plain text should remain.

Row. A range of 256 contiguous Unicode code points, where the first
code point is an integer multiple of 256. Two code points are in the
same row if they share all but the last two hexadecimal digits. (See
plane.)

Script. A collection of letters and
other written signs used to represent textual
information in one or more writing systems. For example, Russian is
written with a subset of the Cyrillic script; Ukranian is written
with a different subset. The Japanese writing system uses several
scripts.

SGML. Standard Generalized Markup Language. A standard framework,
defined in ISO 8879, for defining particular text markup languages.
The SGML framework allows for mixing structural tags that describe
format with the plain text content of documents, so that fancy text
can be fully described in a plain text stream of data. (See also HTML,XML,
and rich text.)

Shaping Characters. Characters that assume different glyphic forms
depending on the context.

Shift-JIS. A shifted encoding of the Japanese character encoding
standard, JIS X 0208, widely deployed in PCs.

Signature. An optional code
sequence at the beginning of a stream of coded characters that
identifies the character encoding scheme used for the following
text. (See Unicode signature.)

Stable Sort. A sort in which two records with a field that compares as equal will retain their relative order if sorted according to that field. This is a property of the sorting algorithm, and not of the comparison mechanism. For example, a bubble sort is stable, whereas a Quicksort is not.

Surrogate Character. A misnomer. It would be an encoded character
having a surrogate code point, which is impossible. Do not use this
term.

Surrogate Code Point. A
Unicode code point in the range U+D800..U+DFFF. Reserved for use by UTF-16, where a pair of
surrogate code units (a high surrogate followed by a low surrogate)
“stand in” for a supplementary code point.

Tagging. The association of attributes of text with a point or range
of the primary text. The value of a particular tag is not generally
considered to be a part of the “content” of the text. A typical
example of tagging is to mark the language or the font for a portion
of text.

Tailorable. A characteristic of an algorithm for which a
higher-level protocol may specify different results than those
specified in the algorithm. A tailorable algorithm without actual
tailoring is also known as a default algorithm, and the results of
an algorithm without tailoring are known as the default results.

TEX. Computer language designed for use in typesetting—in
particular, for typesetting math and other technical material.
(According to Knuth, TEX rhymes with the word blecchhh.)

Text Element. A minimum unit of text in relation to a particular
text process, in the context of a given writing system. In general,
the mapping between text elements and code points is many-to-many.
(See
Chapter 2, General Structure.)

Titlecase. Uppercased initial letter followed by lowercase letters
in words. A casing convention often used in titles, headers, and
entries, as exemplified in this glossary.

Tonal Sandhi. A phonological
process whereby the tone associated with one syllable in a tonal
language influences the realization of a tone associated with a
neighboring syllable.

Tone Mark. A
diacritic or
nonspacing mark that represents a phonemic
tone. Tone languages are common in Southeast Asia and Africa.
Because tones always accompany vowels (the syllabic nucleus), they
are most frequently written using functionally independent marks
attached to a vowel symbol. However, some writing systems such as
Thai place tone marks on consonant symbols; Chinese does not use
tone marks (except when it is written phonemically).

Tonemic. Refers to the underlying,
distinctive units of a tonal system in a language. Tones of a tonal
language are often referred to by numbers (“tone 1,” “tone 2,” and
so on), and each tone has an idealized, specific tone level or
contour that is considered to be its tonemic value. The term was
created by analogy with phonemic.

Tonetic. Refers to the surface,
actual pitch realization of tones in a tonal system. Tonetic values
are what can be directly measured by tracking pitch contours in
actual speech recordings. The term was created by analogy with
phonetic.

Tonos. The basic accent in modern Greek, having the form of an acute
accent.

Typographic Interaction.
Graphical application of one nonspacing mark in a position relative
to a grapheme base that is already occupied by another nonspacing
mark, so that some rendering adjustment must be done (such as
default stacking or side-by-side placement) to avoid illegible
overprinting or crashing of glyphs. (See definition D106 in
Section 3.11, Normalization Forms.)

Unicameral. A script that has no
case distinctions. Most often used
in the context of European alphabets.

Unicode. (1) The standard for digital representation of the characters used in writing all of the world's languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet. Unicode is developed and maintained by the Unicode Consortium:
http://www.unicode.org. (2) A label applied to software internationalization and localization standards developed and maintained by the Unicode Consortium.

Unicode Common
Locale Data Repository.
The repository of locale data in XML format maintained by the
Unicode Consortium (http://www.unicode.org/cldr/).
This repository provides information needed in the localization of
software products into a wide variety of languages, supplying (among
other things): date, time, number, and currency formats; sorting,
searching, and matching information; and translated names for
languages, territories, scripts, currencies, and time zones. (See
also Unicode Locale
Data Markup Language.)

Unicode Consortium. A standards development organization creating widely-used specifications related to character encoding, as well as for software internationalization and localization. Major projects are the Unicode Standard and the Unicode Locales Project, which defines repositories of standardized data needed
to develop software for particular regions and cultures. The Consortium was founded in 1991, and is headquartered in Mountain View, California. Its current members include major software corporations, governments, and academic institutions. See
http://www.unicode.org.

UTF-8. A multibyte encoding for text that represents each Unicode character with 1 to 4 bytes, and which is backward-compatible with ASCII. UTF-8 is
the predominant form of
Unicode in web pages. More technically: (1)
The
UTF-8 encoding form. (2) The
UTF-8 encoding scheme. (3) “UCS Transformation Format 8,” defined in Annex D of ISO/IEC 10646:2003, technically equivalent to the definitions in the Unicode Standard.

UTF-16. A multibyte encoding for text that
represents each Unicode character with 2 or 4 bytes; it is not
backward-compatible with ASCII. It is the internal form of
Unicode in
many programming languages, such as Java, C#, and JavaScript, and in
many operating systems. More technically: (1)
The UTF-16 encoding
form. (2) The
UTF-16 encoding scheme. (3) “Transformation format for
16 planes of Group 00,” defined in Annex C of ISO/IEC 10646:2003;
technically equivalent to the definitions in the Unicode Standard.

UTF-16 Encoding Form.
The Unicode encoding form that assigns each Unicode scalar value in
the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned
16-bit code unit with the same numeric value as the Unicode scalar
value, and that assigns each Unicode scalar value in the range
U+10000..U+10FFFF to a surrogate pair, according to Table 3-5,
“UTF-16 Bit Distribution.” (See definition D91 in
Section 3.9, Unicode Encoding Forms.)

Varia. Greek term for grave accent,
used in polytonic Greek character names.

Virama. From Sanskrit. The name of a sign used in many Indic
and other Brahmi-derived scripts to suppress the inherent vowel of
the consonant to which it is applied, thereby generating a dead
consonant. (See
Section 12.1, Devanagari.) The sign varies in shape
from script to script, and may be known by other names in various
languages. For example, in Hindi it is known as hal or halant, in
Bangla it is called hasant, and in Tamil it is called pulli.

Visual Ambiguity. A situation arising from two characters (or
sequences of characters) being rendered indistinguishably.

wchar_t. The ANSI C defined wide character type, usually implemented
as either 16 or 32 bits. ANSI specifies that wchar_t be an integral
type and that the C language source character set be mappable by
simple extension (zero- or sign-extension).

Writing Direction. The direction or orientation of writing
characters within lines of text in a writing system. Three
directions are common in modern writing systems: left to right,
right to left, and top to bottom.

Writing System. A set of rules for using one or more scripts to
write a particular language. Examples include the American English
writing system, the British English writing system, the French
writing system, and the Japanese writing system.

XML. eXtensible Markup Language. A subset of SGML constituting a
particular text markup language for interchange of structured data.
The Unicode Standard is the reference character set for XML content.
(See also SGMLand rich text.) XML is a trademark of the World Wide
Web Consortium.

Ypogegrammeni. Greek term for subscript iota, used in polytonic Greek character names.