Abstract

This Architectural Specification provides authors of specifications,
software developers, and content developers with a common reference for
interoperable text manipulation on the World Wide Web, building on the Universal Character Set,
defined jointly by the Unicode Standard and
ISO/IEC 10646. Topics addressed include
use of the terms 'character', 'encoding' and 'string', a reference processing model, choice and identification of character encodings, character escaping, string indexing, and URI conventions.

For normalization and string identity matching, see the companion document Character Model for the World Wide Web 1.0: Normalization[CharNorm].

Status of this Document

This section describes the status of this document at the time
of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is a third Last Call Working Draft for review by W3C Members and
other interested parties. The Last Call period begins 25 February 2004 and
ends 19 March 2004. Since the last publication, the document has been split into two parts, called Fundamentals (this document) and Normalization ([CharNorm], not yet in last call), in order to advance the material in this draft while continuing to work on Normalization. The main goal of this third last call is to crosscheck that earlier comments have been addressed adequately (see the last call disposition in a public version and
Members only version) and no new problems have been introduced.

The I18N WG invites comments on this specification. Comments
should be submitted via
the Last Call Comment Form (http://www.w3.org/2002/05/charmod/LastCall). Comments may alternatively be submitted by email to www-i18n-comments@w3.org (public archive). In this case, please send one email per comment where possible, otherwise number comments clearly.

This document is published as part of the
W3C
Internationalization Activity by the W3C
Internationalization Working Group (I18N WG) (Members only),
with the help of the Internationalization Interest Group. The Working Group expects to advance this Working Draft to Recommendation. The
Internationalization Working Group will not allow early implementation to
constrain its ability to make changes to this specification prior to final
release. Publication as a Working Draft does not imply endorsement by the W3C
Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material
or to cite them as other than "work in progress".

Appendices

1 Introduction

1.1 Goals and Scope

The goal of this document is to facilitate use of the Web by all
people, regardless of their language, script, writing system, and cultural
conventions, in accordance with the
W3C goal of universal
access. One basic prerequisite to achieve this goal is to be able to
transmit and process the characters used around the world in a well-defined and
well-understood way.

The main target audience of this document is W3C specification
developers. This document and parts of it can be referenced from other
W3C specifications. This document defines conformance criteria for W3C specifications as well as other specifications.

Other audiences of this document include software developers,
content developers, and authors of specifications outside the W3C. Software
developers and content developers implement and use W3C specifications. This
document defines some conformance criteria for implementations (software) and
content that implement and use W3C specifications. It also helps
software developers and content developers to understand the character-related
provisions in W3C specifications.

The character model described in this document provides authors of
specifications, software developers, and content developers with a common
reference for consistent, interoperable text manipulation on the World Wide
Web. Working together, these three groups can build a more international
Web.

Topics addressed in this part of the Character Model for the World Wide Web include
use of the terms 'character', 'encoding' and 'string', a reference processing model, choice and identification of character encodings, character escaping, string indexing, and Internationalized Resource Identifiers (IRI) conventions.

Another part of the Character Model [CharNorm] addresses normalization and string identity matching.

Topics as yet not addressed or barely touched include fuzzy matching, and language tagging. Some of these topics may be addressed in a
future version of this specification.

At the core of the model is the Universal Character Set (UCS),
defined jointly by the Unicode Standard [Unicode] and ISO/IEC
10646 [ISO/IEC 10646]. In this document, Unicode is used
as a synonym for the Universal Character Set. The model will allow Web
documents authored in the world's scripts (and on different platforms) to be
exchanged, read, and searched by Web users around the world.

1.2 Background

This section provides some historical background on the topics
addressed in this document.

Starting with Internationalization of the Hypertext Markup
Language[RFC 2070], the Web community has recognized
the need for a character model for the World Wide Web. The first step towards
building this model was the adoption of Unicode as the document character set
for HTML.

The choice of Unicode was motivated by the fact that Unicode:

is the only universal character repertoire available,

provides a way of referencing characters independent of the
encoding of the text,

is being updated/completed carefully,

is widely accepted and implemented by industry.

W3C adopted Unicode as the document character set for HTML in
[HTML 4.0]. The same approach was later used for specifications
such as XML 1.0 [XML 1.0] and CSS2 [CSS2].
W3C specifications and applications now use Unicode as
the common reference character set.

When data transfer on the Web remained mostly unidirectional (from
server to browser), and where the main purpose was to render documents, the use
of Unicode without specifying additional details was sufficient. However, the
Web has grown:

Data transfers among servers, proxies, and clients, in all
directions, have increased.

Non-ASCII characters [ISO/IEC 646] are being used in
more and more places.

In short, the Web may be seen as a single, very large application
(see [Nicol]), rather than as a collection of small independent
applications.

While these developments strengthen the requirement that Unicode be
the basis of a character model for the Web, they also create the need for
additional specifications on the application of Unicode to the Web. Some
aspects of Unicode that require additional specification for the Web include:

Choice of Unicode encoding forms (UTF-8, UTF-16, UTF-32).

Counting characters, measuring string length in the presence
of variable-length character encodings and combining characters.

Duplicate encodings of characters (e.g. precomposed vs decomposed).

Use of control codes for various purposes (e.g. bidirectionality
control, symmetric swapping, etc.).

It should be noted that such
aspects also exist in legacy encodings (where
legacy encoding is taken to mean any character encoding not based
on Unicode), and in many cases have been inherited by Unicode in one way or
another from such legacy encodings.

The remainder of this document presents additional specifications
and requirements to ensure an interoperable character model for the Web, taking
into account earlier work (from W3C, ISO and IETF).

The first few chapters of the Unicode Standard [Unicode] provide very useful background reading. The policies adopted by the IETF for on the use of character sets on
the Internet are documented in [RFC 2277].

1.3 Terminology and Notation

For the purpose of this specification,
the producer of text data is the sender of the data in the case of
protocols, and the tool that produces the data in the case of formats. The
recipient of text data is the software module that receives the
data.

NOTE: A software module may be both a recipient and a producer.

Unicode code points are denoted as U+hhhh, where "hhhh" is a
sequence of at least four, and at most six hexadecimal digits.

Text has been used for examples to allow them to be cut and pasted by the reader. Characters used will not appear as intended unless you have the appropriate font, but care has been taken to annotate the examples so that they remain understandable even if you do not. In some cases it is important to see the result of an example, so images have been used; by clicking on the image it is possible to link to the text for these examples in C Example text.

2 Conformance

This section explains what conditions specifications, software, and Web content have to fulfill to be able to claim conformance to this specification.

The key words "MUST", "MUST
NOT", "REQUIRED", "SHALL",
"SHALL NOT", SHOULD", "SHOULD
NOT", "RECOMMENDED", "MAY" and
"OPTIONAL" in this document are to be interpreted as
described in RFC 2119 [RFC 2119].

NOTE: RFC 2119 makes it clear that requirements that use
SHOULD are not optional and must be complied with unless
there are specific reasons not to: "This word, or the adjective
"RECOMMENDED", mean that there may exist valid reasons in particular
circumstances to ignore a particular item, but the full implications must be
understood and carefully weighed before choosing a different
course."

This specification defines conformance criteria
for specifications, for software and for Web content. To aid the reader, all
conformance criteria are
preceded by '[X]' where 'X' is one of
'S' for specifications, 'I' for software
implementations, and 'C' for Web content. These markers indicate
the relevance of the conformance criteria and allow the
reader to quickly locate relevant conformance criteria by searching through this document.

A specification conforms to this document if they:

does not violate any conformance criteria preceded by [S],

documents the reason for any deviation from criteria where the imperative is SHOULD, SHOULD NOT, or RECOMMENDED,

if applicable, requires implementations conforming to the specification to conform to this document,

if applicable, requires content conforming to the specification to conform to this document.

Implementations (software) conform to this document if it does not
violate any conformance criteria preceded by [I].

Content conforms to this document if it does not violate any conformance criteria preceded by [C].

NOTE: Requirements placed on specifications might indirectly cause requirements to be placed on implementations or content that claim to conform to those specifications. Likewise, requirements placed on content may affect implementations designed to produce such content, and so on.

Where this specification places requirements on processing, it is to be understood as a way to
specify the desired external behavior. Implementations can
use other means of achieving the same results, as
long as observable behavior is not affected.

3 Perceptions of Characters

3.1 Introduction

"Character. (1) The smallest component of written language
that has semantic values; refers to the abstract meaning and/or shape
..."

The word 'character' is used in many contexts, with
different meanings. Human cultures have radically differing writing systems,
leading to radically differing concepts of a character. Such wide variation in
end user experience can, and often does, result in misunderstanding. This
variation is sometimes mistakenly seen as the consequence of imperfect
technology. Instead, it derives from the great flexibility and creativity of
the human mind and the long tradition of writing as an important part of the
human cultural heritage. The alphabetic approach used by scripts such as Latin,
Cyrillic and Greek is only one of several possibilities.

EXAMPLE: A character in Japanese hiragana and katakana scripts corresponds
to a syllable (usually a combination of consonant plus vowel).

EXAMPLE: Korean Hangul combines symbols for
individual sounds of the language into square blocks, each of which represents a syllable. Depending on the
user and the application, either the individual symbols or the syllabic
clusters can be considered to be characters.

EXAMPLE: In Indic scripts
each consonant letter carries an inherent vowel that is
eliminated or replaced using semi-regular or irregular ways to combine
consonants and vowels into clusters. Depending on the user and the application,
either individual consonants or vowels, or the consonant or consonant-vowel
clusters can be perceived as characters.

EXAMPLE: In Arabic and Hebrew vowel sounds are typically not written at all.
When they are written they are indicated by the use of combining marks placed
above and below the consonantal letters.

The developers of specifications, and the developers of
software based on those specifications, are likely to be more familiar with
usages of the term 'character' they have experienced
and less familiar with the wide variety of usages in an international context.
Furthermore, within a computing context, characters are often confused with
related concepts, resulting in incomplete or inappropriate specifications and
software.

This section examines some of these contexts, meanings and
confusions.

3.2 Units of aural rendering

In some scripts, characters have a close relationship to phonemes
(a phoneme is a minimally distinct sound in the context of a
particular spoken language), while in others they are closely related to
meanings. Even when characters (loosely) correspond to phonemes, this
relationship may not be simple, and there is rarely a one-to-one correspondence
between character and phoneme.

EXAMPLE: In the English sentence,
"They were too close to the door to close it." the same character
's' is used to represent both /s/ and /z/ phonemes.

EXAMPLE: In the English language the phoneme /k/ of "cool" is like the phoneme /k/ of "keel".

EXAMPLE: In many scripts a single character may represent a sequence of
phonemes, such as the syllabic characters of Japanese hiragana.

EXAMPLE: In many writing systems a sequence of characters may represent a
single phoneme, for example 'wr' and 'ng' in
"writing".

C001[S][I][C]
Specifications,
software and content MUST NOT assume that there is a one-to-one
correspondence between characters and the sounds of a
language.

3.3 Units of visual
rendering

Visual rendering introduces the notion of a
glyph. Glyphs are defined by ISO/IEC 9541-1
[ISO/IEC 9541-1] as "a recognizable abstract graphic symbol which
is independent of a specific design". There is not a
one-to-one correspondence between characters and glyphs:

A single character can be represented by multiple glyphs
(each glyph is then part of the representation of that character). These glyphs
may be physically separated from one another.

A single glyph may represent a sequence of characters (this
is the case with ligatures, among others).

A character may be rendered with very different glyphs
depending on the context.

A single glyph may represent different characters (e.g.
capital Latin A, capital Greek A and capital Cyrillic A).

Each glyph can be represented by a number of different glyph
images; a set of glyph images makes up a font. Glyphs can be
construed as the basic units of organization of the visual rendering of text,
just as characters are the basic unit of organization of encoded text.

C002[S][I][C]
Specifications,
software and content MUST NOT assume a one-to-one mapping between
characters and units of displayed text.

Some scripts, in particular Arabic and Hebrew, are written from
right to left. Text including characters from these scripts can run in both
directions and is therefore called bidirectional text. The Unicode Standard
[Unicode] requires that characters be stored and interchanged in
logical order, i.e. roughly corresponding to the order in which text
is typed in via the keyboard (for a more detailed definition see
[Unicode 4.0], Section 2.2). Logical ordering is
important to ensure interoperability of data, and also benefits accessibility,
searching, and collation.
C003[S][I][C]
Protocols,
data formats and APIs MUST store, interchange or process
text data in logical order.

In the presence of bidirectional text, two possible selection
modes can be considered. The first is logical selection mode,
which selects all the characters logically located between the
end-points of the user's mouse gesture. Here the user selects from between the
first and second letters of the second word to the middle of the number.
Logical selection looks like this:

Visual display

Logical order

Logical selection resulting in discontiguous visual ranges

It is a consequence of the bidirectionality of the text that a
single, continuous logical selection in memory results in a discontinuous
selection appearing on the screen. This discontinuity makes some users prefer a
visual selection mode, which selects all the characters
visually located between the end-points of the user's mouse
gesture. With the same mouse gesture as before, we now obtain:

Visual display

Logical order

Visual selection resulting in discontiguous logical ranges

In visual selection mode, as seen in the example above, a single visual selection range may result in
two or more logical ranges, which may have to be accommodated by protocols,
APIs and implementations. Other, related aspects of a user interface for bidirectional text include caret movement, behavior of backspace/delete keys, and so on.

Currently, most implementations provide logical selection, while only very few provide visual selection. C075[I] Independent of whether some implementation uses logical selection or visual selection, characters selected MUST be kept in logical order in storage.

C004[S]
Specifications of protocols
and APIs that involve selection of ranges SHOULD provide for
discontiguous selections, at least to the extent necessary to support
implementation of visual selection on screen on top of those protocols and
APIs.

3.4 Units of input

In keyboard input, it is not always the case that
keystrokes and input characters correspond one-to-one. A limited number of keys
can fit on a keyboard. Some keyboards will generate multiple characters from a
single keypress. In other cases ('dead keys') a key will generate
no characters, but affect the results of subsequent keypresses. Many writing
systems have far too many characters to fit on a keyboard and must rely on more
complex input methods, which transform keystroke sequences into
character sequences. Other languages may make it necessary to input some
characters with special modifier keys. See B Examples of Characters, Keystrokes
and Glyphs
for examples of non-trivial input.

C005[S][I]
Specifications
and software MUST NOT assume that a single keystroke results
in a single character, nor that a single character can be input with a single
keystroke (even with modifiers), nor that keyboards are the same all over the
world.

3.5 Units of collation

String comparison as used in sorting and searching is based on
units which do not in general have a one-to-one relationship to encoded
characters. Such string comparison can aggregate a character sequence into a
single collation unit with its own position in the sorting order,
can separate a single character into multiple collation units, and can
distinguish various aspects of a character (case, presence of diacritics, etc.)
to be sorted separately (multi-level sorting).

In addition, a certain amount of pre-processing may also be
required, and in some languages (such as Japanese and Arabic) sort order may be
governed by higher order factors such as phonetics or word roots. Collation
methods may also vary by application.

EXAMPLE: In traditional Spanish sorting, the character sequences 'ch' and 'll' are treated as atomic collation units.
Although Spanish sorting, and to some extent Spanish everyday use, treat
'ch' as a single unit, current digital encodings treat it as two
characters, and keyboards do the same (the user types 'c', then
'h').

EXAMPLE: In some languages, the letter
'æ' is sorted as two consecutive collation units: 'a'
and 'e'.

EXAMPLE: The sorting of text written in a
bicameral script (i.e. a script which has distinct upper and lower case
letters) is usually required to ignore case differences in a first pass; case
is then used to break ties in a later pass.

EXAMPLE: Treatment of
accented letters in sorting is dependent on the script or language in question.
The letter 'ö' is treated as a modified 'o' in
French, but as a letter completely independent from 'o' (and
sorting after 'z') in Swedish. In German certain applications
treat the letter 'ö' as if it were the sequence
'oe'.

EXAMPLE: In Thai the sequence 'ไก' (U+0E44 U+0E01) must
be sorted as if it were written 'กไ' (U+0E01 U+0E44). Reordering is typically done
during an initial pre-processing stage.

EXAMPLE: German dictionaries typically sort 'ä', 'ö' and 'ü' together with 'a', 'o' and 'u' respectively. On the other hand, German telephone books typically sort 'ä', 'ö' and 'ü' as if they were spelled 'ae', 'oe' and 'ue'. Here the application is affecting the collation algorithm used.

C006[S][I]
Software
that sorts or searches text for users
SHOULD do so on
the basis of appropriate collation units and ordering rules for the relevant
language and/or application.

C007[S][I]
Where searching or sorting is done dynamically,
particularly in a multilingual environment, the 'relevant language'
SHOULD be determined to be that of the current user, and may
thus differ from user to user.
C066[S][I]
Software that allows
users to sort or search text SHOULD allow the user to select
alternative rules for collation units and ordering.

C008[S][I]
Specifications and implementations of sorting and searching algorithms SHOULD accommodate all characters in Unicode.

ISO/IEC 14651 [ISO/IEC 14651] and Unicode Technical Report #10, the Unicode Collation
Algorithm [UTR #10], describe a model for collation that accommodates most
languages and provide a default collation order. They are appropriate
references for collation and provide implementation guidelines.
The default collation order can be used in conjunction with rules tailored for a particular locale
to ensure a predictable ordering and comparison of strings, whatever characters
they include.

3.6 Units of storage

Computer storage and communication rely on units of physical
storage and information interchange, such as bits and bytes (8-bit units, also called octets). A frequent error in specifications and implementations is
the equating of characters with units of physical storage. The mapping between
characters and such units of storage is actually quite complex, and is
discussed in the next section, 4.1 Character Encoding.

C009[S][I][C]
Specifications,
software and content MUST NOT assume a one-to-one relationship
between characters and units of physical storage.

3.7 Summary

The term character is used differently in a variety
of contexts and often leads to confusion when used outside of these contexts.
In the context of the digital representations of text, a character can be
defined informally as a small logical unit of text. Text is then
defined as sequences of characters. While such an informal definition is
sufficient to create or capture a common understanding in many cases, it is
also sufficiently open to create misunderstandings as soon as details start to
matter. In order to write effective specifications, protocol implementations,
and software for end users, it is very important to understand that these
misunderstandings can occur.

This section, 3 Perceptions of Characters, has discussed terms for units that do not necessarily overlap with the term 'character', such as phoneme, glyph, and collation unit. The next section, 4.1 Character Encoding, lists terms that should be used rather than 'character' to precisely define units of encoding (code point, code unit, and byte).

C010[S]
When specifications use the
term 'character' the specifications MUST
define which meaning they intend.
C067[S]
Specifications SHOULD
avoid the use of the term 'character' if a more specific term is
available.

4 Digital Encoding of Characters

4.1 Character Encoding

To be of any use in computers, in computer communications and in
particular on the World Wide Web, characters must be encoded. In fact, much of
the information processed by computers over the last few decades has been
encoded text, exceptions being images, audio, video and numeric data. To
achieve text encoding, a large variety of character encodings have been devised. Character encodings can loosely be explained as mappings between the character sequences that
users manipulate and the sequences of bits that computers manipulate.

Given the complexity of text encoding and the large variety of
mechanisms for character encoding invented throughout the computer age, a more
formal description of the encoding process is useful. The process of defining a
text encoding can be described as follows (see Unicode Technical Report #17:
Character Encoding Model [UTR #17] for a more detailed
description):

A set of characters to be encoded is identified. The
characters are pragmatically chosen to express text and to efficiently allow
various text processes in one or more target languages. They may not correspond
precisely to what users perceive as letters and other characters. The set of
characters is called a repertoire.

Each character in the repertoire is then associated with a
(mathematical, abstract) non-negative integer, the code point
(also known as a character number or code position).
The result, a mapping from the repertoire to the set of non-negative integers,
is called a coded character set (CCS).

To enable use in computers, a suitable base datatype is
identified (such as a byte, a 16-bit unit of storage or other) and a
character encoding form (CEF) is used, which encodes the abstract
integers of a CCS into sequences
of the code units of the base datatype. The character encoding form can be
extremely simple (for instance, one which encodes the integers of the
CCS into the natural
representation of integers of the chosen datatype of the computing platform) or
arbitrarily complex (a variable number of code units, where the value of each
unit is a non-trivial function of the encoded integer).

To enable transmission or storage using byte-oriented devices,
a serialization scheme or character encoding scheme
(CES) is next used. A CES is a mapping of the code units
of a CEF into well-defined
sequences of bytes, taking into account the necessary specification of
byte-order for multi-byte base datatypes and including in some cases switching
schemes between the code units of multiple
CESes (an example is ISO
2022). A CES, together
with the CCSes it is used
with, is called a character encoding, and is identified by a unique identifier, such as an
IANA charset
identifier. Given a sequence of bytes representing text and a character encoding identified by a charset
identifier, one can in principle unambiguously recover the sequence of
characters of the text.

NOTE: The term 'character encoding' is somewhat ambiguous,
as it is sometimes used to describe the actual process of encoding characters
and sometimes to denote a particular way to perform that process (as in
"this file is in the X character encoding"). Context normally
allows the distinction of those uses, once one is aware of the ambiguity.

NOTE: Given a sequence of characters, a given 'character encoding' may not always produce the same sequence of bytes. In particular for encodings based on ISO 2022, there may be choices available during the encoding process.

In very simple cases, the whole encoding process can be collapsed to
a single step, a trivial one-to-one mapping from characters to bytes; this is
the case, for instance, for US-ASCII [ISO/IEC 646] and ISO-8859-1.

Text is said to be in a Unicode
encoding form if it is encoded in UTF-8, UTF-16 or UTF-32.

4.2 Transcoding

Transcoding is the process of
converting text from one character
encoding to another. Transcoders work only at
the level of character
encoding and do not parse the text; consequently, they do not deal with
character escapes such as numeric
character references (see 4.6 Character Escaping) and do not adjust
embedded character encoding information (for instance in an XML declaration or
in an HTML meta element).

NOTE: Transcoding may involve one-to-one, many-to-one, one-to-many or
many-to-many mappings. In addition, the storage order of characters varies
between encodings: some, such as the Unicode encoding forms, prescribe
logical ordering, while others use visual ordering; among encodings that have
separate diacritics, some prescribe that they be placed before the base
character, some after. Because of these differences in sequencing characters,
transcoding may involve reordering: thus XYZ may map to yxz.

EXAMPLE: This first example shows the transcoding of the Russian word 'Русский' meaning 'Russian' (language),
from the UTF-16 encoding of Unicode to the ISO 8859-5 encoding:

UTF-16

ISO 8859-5

Code unit

Char. name (abbreviated)

Code unit

Char. name (abbreviated)

0420

CAPITAL ER

C0

CAPITAL ER

0443

SMALL U

E3

SMALL U

0441

SMALL ES

E1

SMALL ES

0441

SMALL ES

E1

SMALL ES

043A

SMALL KA

DA

SMALL KA

0438

SMALL I

D8

SMALL I

0439

SMALL SHORT I

D9

SMALL SHORT I

EXAMPLE: This second example shows a much more complex case, where the Arabic word 'السلام', meaning 'peace', is transcoded from the
visually-ordered, contextualized encoding IBM CP864 to the UTF-16 encoding of Unicode:

IBM CP864

UTF-16

Code unit

Char. name (abbreviated)

Code unit

Char. name (abbreviated)

EF

FINAL MEEM

0627

ALEF

9E

MEDIAN LAM-ALEF

0644

LAM

D3

MEDIAN SEEN

0633

SEEN

E4

MEDIAN LAM

0644

LAM

C7

INITIAL ALEF

0627

ALEF

0645

MEEM

Notice that the order of the characters has been reversed, that the single LAM-ALEF in CP864 has been converted to a LAM ALEF sequence in UTF-16, and that the contextual variants (initial, median or final) in the source encoding have been converted to generic characters in the target encoding.

4.3 Reference Processing Model

Many Internet protocols and data formats, most
notably the very important Web formats HTML, CSS and XML, are based on text. In
those formats, everything is text but the relevant specifications impose a
structure on the text, giving meaning to certain constructs so as to obtain
functionality in addition to that provided by plain text (text
where no markup or programming language applies). HTML and XML are markup
languages, defining
documents entirely composed of text but with
conventions allowing the separation of this text into markup and
character data. Citing from the XML 1.0 specification
[XML 1.0],
section
2.4:

"Text consists of intermingled character data and markup.
[...] All text that is not markup constitutes the character data of the
document."

For the purposes of this section, the important aspect is that
everything is text, that is, a sequence of characters.

A textual data object is a whole text protocol message or a whole text document, or a part of it that is treated separately for purposes of external storage and retrieval. Examples include external parsed entities in XML and textual MIME entities.

C013[S][C]
Textual data objects defined by
protocol or format specifications MUST be in a
single character encoding.
Note that this does not
imply that character set switching schemes such as ISO 2022 cannot be
used, since such schemes perform character set switching within a single
character encoding.

Since its early days, the Web has seen the
development of a Reference Processing Model, first described for
HTML in RFC 2070 [RFC 2070]. This model was later embraced by XML
and CSS. It is applicable to any data format or protocol that is text-based as
described above. The essence of the Reference Processing Model is the use of
Unicode as a common reference. Use of the Reference Processing Model by a
specification does not, however, require that implementations actually use
Unicode. The requirement is only that the implementations behave as if the
processing took place as described by the Model. Also, while this document uses the term Reference Processing Model and describes its properties in terms of processing, the model also applies to specifications that do not explicitly define a processing model.

Specifications MUST define text in terms of
Unicode characters, not bytes or glyphs.

For their textual data objects specifications MAY allow use of any
character encoding which can be transcoded to a Unicode encoding form.

Specifications MAY choose to disallow or
deprecate some character encodings and to make others mandatory. Independent of the
actual character encoding, the specified behavior MUST be the same
as if the processing happened as follows:

The character encoding of any textual data object received by the
application implementing the specification MUST be
determined and the data object MUST be interpreted as a
sequence of Unicode characters - this MUST be equivalent to
transcoding the data object to some
Unicode encoding form, adjusting
any character encoding label if necessary, and receiving it in that Unicode
encoding form.

All processing MUST take place on
this sequence of Unicode characters.

If text is output by the application, the sequence of
Unicode characters MUST be encoded using a character encoding chosen
among those allowed by the specification.

If a specification is such that multiple textual data objects are
involved (such as an XML document referring to external parsed entities), it
MAY choose to allow these data objects to be in different
character encodings. In all cases, the Reference Processing ModelMUST be applied to all textual data objects.

NOTE: All specifications which define applications of the XML 1.0 specification
[XML 1.0] automatically inherit this Reference Processing Model.
XML is entirely defined in terms of Unicode characters and mandates the UTF-8
and UTF-16 character encodings while allowing any other character encoding for parsed entities.

NOTE: When specifications choose to allow character encodings other than Unicode
encoding forms, implementers should be aware that the correspondence between the
characters of a legacy encoding and
Unicode characters may in practice depend on the software used for
transcoding. See the Japanese XML
Profile [XML Japanese Profile] for examples of such
inconsistencies.

C070[S]
Specifications SHOULD NOTarbitrarily exclude code points from the full range of Unicode code points from U+0000
to U+10FFFF inclusive. Specifications MUST NOT allow code points above U+10FFFF.

Excluding code points without good reason conflicts with the W3C goal of
universal accessiblity. Excluding code points would prevent some scripts from
being used which may be important to a user community or communities. For
example, without strong reasons to do so, decisions to exclude code points above
the Basic Multilingual Plane or to limit code points to the ASCII or Latin-1
repertoire are inappropriate. Also, please note that the Unicode Standard requires software to not corrupt any
code points.

Unicode contains some code points for internal use (such as noncharacters) or
special functions (such as surrogate code points). To be consistent with the Unicode Standard, specifications should not
use these code points for interchange.

Other examples of legitimate and non-arbitrary reasons to exclude characters can
be seen in Unicode in XML and other Markup Languages[UXML], where the
use of certain characters is discouraged for reasons such as:

They are deprecated in the Unicode Standard.

They cannot be supported without additional data.

They are better handled by markup.

They conflict with equivalent markup.

4.4 Choice and Identification of Character
Encodings

Because encoded text cannot be interpreted and
processed without knowing the encoding, it is vitally important that the
character encoding (see 4.1 Character Encoding) is known at all times and
places where text is exchanged, stored or processed. In what follows we use
'character encoding' to mean either CEF or CES depending
on the context. When text is transmitted or stored as a byte stream, for
instance in a protocol or file system, specification of a CES is required to ensure proper
interpretation. In contexts such as an API, where the environment (typically
the processor architecture) specifies the byte order of multibyte quantities,
specification of a CEF suffices.
C015[S]
Specifications MUST
either specify a unique character encoding, or provide character encoding identification
mechanisms such that the encoding of text can be reliably
identified.
C016[S]
When
designing a new protocol, format or API, specifications
SHOULD mandate a unique character
encoding.
C017[S]
When basing
a protocol, format, or API on a protocol, format, or API that already
has rules for character encoding, specifications
SHOULD use rather than change these rules.

EXAMPLE: An XML-based format should use the existing XML rules for choosing and determining
the character encoding of external entities, rather than invent new ones.

4.4.1 Mandating a unique character
encoding

Mandating a unique character encoding is simple, efficient, and
robust. There is no need for specifying, producing, transmitting, and
interpreting encoding tags. At the receiver, the character encoding will always be
understood. There is also no ambiguity as to which character encoding to use if data is
transferred non-electronically and later has to be converted back to a digital
representation. Even when there is a need for compatibility with existing data,
systems, protocols and applications, multiple character encodings can often be dealt with
at the boundaries or outside a protocol, format, or API. The
DOM[DOM Level 1] is an
example of where this was done. The advantages of choosing a unique character encoding
become more important the smaller the pieces of text used are and the closer to
actual processing the specification is.

C018[S]
When a unique character encoding is
mandated, the character encoding MUST be UTF-8, UTF-16 or
UTF-32.
C019[S]
If a unique
character encoding is mandated and compatibility with US-ASCII is desired, UTF-8 (see
[RFC 3629]) is RECOMMENDED.
In
other situations, such as for APIs, UTF-16 or UTF-32 may be more appropriate.
Possible reasons for choosing one of these include efficiency of internal
processing and interoperability with other processes.

NOTE: The IETF Charset Policy [RFC 2277] specifies that
on the Internet "Protocols MUST be able to use the UTF-8
charset".

4.4.2 Character encoding
identification

The MIME Internet specification [MIME] provides a
good example of a mechanism for character encoding identification. The MIME
charset parameter definition is intended to supply sufficient
information to uniquely decode the sequence of bytes of the received data into
a sequence of characters. The values are drawn from the IANA charset registry
[IANA].

NOTE: Unfortunately, some charset identifiers do not represent a
single, unique character encoding. Instead, these identifiers denote a number of
small variations. Even though small, the differences
may be crucial and may vary over time. For these identifiers, recovery of the
character sequence from a byte sequence is ambiguous. For example, the
character encoded as 0x5C in Shift_JIS is ambiguous. This code point sometimes represents a YEN SIGN and sometimes
represents a REVERSE SOLIDUS. See the
[XML Japanese Profile] for more detail on this example and for
additional examples of such ambiguous charset identifiers.

NOTE: The term charset derives from 'character
set', an expression with a long and tortured history (see
[Connolly] for a discussion).

C020[S]
Specifications
SHOULD avoid using the terms 'character set'
and 'charset' to refer to a character encoding, except when the
latter is used to refer to the MIME charset parameter or its
IANA-registered values. The term 'character encoding',
or in specific cases the terms 'character encoding form' or 'character encoding
scheme', are RECOMMENDED.

NOTE: In XML, the XML declaration or the text declaration contains the encoding
pseudo-attribute which identifies the character
encoding using the IANA charset.

The IANA charset registry is the official list of names and
aliases for character encoding schemes on the Internet.

C021[S]
If the unique encoding
approach is not taken, specifications SHOULD mandate the use
of the IANA charset registry names, and in particular the names identified in
the registry as 'MIME preferred names', to designate character
encodings in protocols, data formats and APIs.
C022[S][I][C]
Character
encodings
that are not in the IANA registry SHOULD NOT be
used, except by private agreement.
C023[S][I][C]
If
an unregistered character encoding is used, the convention of using
'x-' at the beginning of the name MUST be
followed.
C024[I][C]
Content and software
that label text data MUST use one of the names mandated by
the appropriate specification (e.g. the XML specification when editing XML
text) and SHOULD use the MIME preferred name of a character encoding
to label data in that character encoding.
C025[I][C]
An IANA-registered
charset name MUST NOT be used to label text data in
a character encoding other than the one identified in the IANA registration of that
name.

C026[S]
If the unique encoding
approach is not chosen, specifications MUST designate at
least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible
character encodings and SHOULD choose at least one of UTF-8 or UTF-16
as mandated encoding forms (encoding forms that MUST be
supported by implementations of the specification).
C027[S]
Specifications MAY
define either UTF-8 or UTF-16 as a default encoding form (or both if they
define suitable means of distinguishing them), but they MUST
NOT use any other character encoding as a default.
C028[S]
Specifications MUST NOT
propose the use of heuristics to determine the encoding of
data.

Examples of heuristics include the use of statistical analysis of byte
(pattern) frequencies or character (pattern) frequencies. Heuristics are bad
because they will not work consistently across different implementations.
Well-defined instructions of how to unambiguously determine a character encoding,
such as those given in XML 1.0 [XML 1.0],
Appendix F,
are not considered heuristics.

C029[I]Receiving
software MUST determine the encoding of data from available
information according to appropriate specifications.
C030[I]
When an IANA-registered charset
name is recognized, receiving software MUST interpret the
received data according to the encoding associated with the name in the IANA
registry.
C031[I]
When no charset
is provided receiving software MUST adhere to the default
character encoding(s) specified in the specification.

C032[I]
Receiving software
MAY recognize as many character encodings and as many charset names and aliases for them as
appropriate.
A field-upgradeable mechanism may be appropriate
for this purpose. Certain character encodings are more or less associated with certain
languages (e.g. Shift_JIS with Japanese). Trying to support a given language or
set of customers may mean that certain character encodings have to be supported. However, one cannot assume universal support for a favoured but non-mandated encoding. The
character encodings that need to be supported may change over time. This document does
not give any advice on which character encoding may be appropriate or necessary for the
support of any given language.

C033[I]
Software
MUST completely implement the mechanisms for character
encoding identification and SHOULD implement them in such a
way that they are easy to use (for instance in HTTP
servers).

C034[C]
Content
MUST make use of available facilities for character encoding
identification by always indicating character encoding; where the facilities
offered for character encoding identification include defaults (e.g. in XML 1.0
[XML 1.0]), relying on such defaults is sufficient to satisfy this
identification requirement.

Because of the layered Web architecture (e.g. formats used over
protocols), there may be multiple and at times conflicting information about
character encoding. C035[S]
Specifications
MUST define conflict-resolution mechanisms (e.g. priorities)
for cases where there is multiple or conflicting information about character
encoding.
C036[I][C]
Software and content
MUST carefully follow conflict-resolution mechanisms where
there is multiple or conflicting information about character
encoding.

4.5 Private use code points

Certain ranges of Unicode code points are designated for private use:
the Private Use Area (U+E000-F8FF) and planes 15 and 16 (U+F0000-FFFFD and
U+100000-10FFFD). These code points are guaranteed to never be allocated to
standard characters, and are available for use by private agreement between a
producer and a
recipient. However, private agreements do not
scale on the Web. Code points from different private agreements may collide. Also, a private agreement, and therefore the meaning of the code points, can
quickly become lost.

NOTE: A typical exception would be the use of the PUA to design
and test the encoding of not yet encoded (e.g. historic or rare)
scripts.

C037[S]
Specifications MUST
NOT define any assignments of private use code
points.
C038[S]
Conformance to a
specification MUST NOT require the use of private use area
characters.
C039[S]
Specifications MUST
NOT require the use of mechanisms for agreement on the use of private
use code points.
C040[S][I]
Specifications and
implementations SHOULD NOT disallow the use of private use code points by private
agreement.
As an example, XML does not disallow the use of
private use code points.

C041[S]
Specifications
MAY define markup to
allow the transmission of symbols not in Unicode or to identify specific
variants of Unicode characters.

C068[S] Specifications SHOULD allow the inclusion of or reference to pictures and graphics where appropriate, to eliminate the need to (mis)use character-oriented mechanisms for pictures or graphics.

C069[C] Content
SHOULD NOT misuse character technology for pictures or graphics.

4.6 Character Escaping

Markup languages or programming languages often
designate certain characters as syntax-significant, giving them
specific functions within the language (e.g. '<' and
'&' serve as markup delimiters in HTML and XML). As a
consequence, these syntax-significant characters cannot be used to represent
themselves in text in the same way as all other characters do, creating the
need for a mechanism to "escape" their syntax-significance. There is also a need, often satisfied by the same or similar
mechanisms, to express characters not directly representable in the
character encoding chosen for a particular document or program (an
instance of the markup or programming language).

Formally, a character
escape is a syntactic device defined in a markup or programming language
that allows one or more of:

expressing syntax-significant characters while disregarding
their significance in the syntax of the language, or

expressing characters not representable in the character
encoding chosen for an instance of the language, or

expressing characters in general, without use of the
corresponding character codes.

Escaping a character means expressing it using such a construct,
appropriate to the format or protocol in which the character appears;
expanding a character escape (or unescaping) means
replacing it with the character that it represents.

EXAMPLE: HTML and XML define 'Numeric Character References' which allow
both the escaping of syntax-significance and the expression of arbitrary Unicode characters. Expressed
as &#x3C; or &#60; the character '<' will not be parsed as
a markup delimiter.

EXAMPLE: The programming language Java uses '"' to delimit strings.
To express '"' within a string, one may escape it as '\"'.

EXAMPLE: XML defines 'CDATA sections' which allow escaping the
syntax-significance of all characters between the CDATA section delimiters. CDATA sections
do not allow the expression of unrepresentable characters and in fact prevent their
expression using numeric character references.

The following guidelines apply to the way specifications define character
escapes.

C042[S]
Specifications
MUST NOT invent a new escaping mechanism if an appropriate
one already exists.

C043[S]
The number of different
ways to escape a character SHOULD be minimized (ideally to
one).
[A well-known counter-example is that for historical
reasons, both HTML and XML have redundant decimal (&#ddddd;) and
hexadecimal (&#xhhhh;) character escapes.]

C044[S]
Escape syntax
SHOULD either require explicit end delimiters or mandate a
fixed number of characters in each character escape. Escape syntaxes where the
end is determined by any character outside the set of characters admissible in
the character escape itself SHOULD be
avoided.
These character escapes are not clear visually, and
can cause an editor to insert spurious line-breaks when word-wrapping on
spaces. Forms like SPREAD's &UABCD; [SPREAD] or XML's
&#xhhhh;, where the character escape is explicitly terminated by a
semicolon, are much better.

C045[S]
Whenever specifications
define character escapes that allow the representation of characters using a
number, the number MUST represent the Unicode code point
of the character and SHOULD be in hexadecimal
notation.

C046[S]
Escaped characters
SHOULD be acceptable wherever their unescaped forms are; this does not preclude
that syntax-significant
characters, when escaped, lose their
significance in the syntax. In particular, if a character is
acceptable
in identifiers and comments, then its escaped form should also be
acceptable.

The following guidelines apply to content developers, as well as to
software that generates content:

C047[I][C]
Escapes
SHOULD be avoided when the characters to be expressed are
representable in the character encoding of the document.

C048[I][C]
Since
character set standards usually list character numbers as hexadecimal, content
SHOULD use the hexadecimal form of character escapes when
there is one.

C049[I][C]
The character encoding of a document
SHOULD be chosen so that it maximizes the opportunity to directly
represent characters and minimizes the need to represent characters by
markup means such as character
escapes.

NOTE: Due to Unicode's large repertoire and wide base of
support, a character encoding based on Unicode
is a good choice to encode a
document.

5 Compatibility and Formatting
Characters

This specification does not address the suitability of particular
characters for use in markup languages,
in particular formatting characters and compatibility equivalents. For detailed
recommendations about the use of compatibility and formatting characters, see
Unicode in XML and other Markup Languages[UXML].

C050[S]
Specifications
SHOULD exclude compatibility characters in the syntactic
elements (markup, delimiters, identifiers) of the formats they
define.

6 Strings

6.1 String concepts

Various specifications use the notion of a 'string',
sometimes without defining precisely what is meant and sometimes defining it
differently from other specifications. The reason for this variability is that
there are in fact multiple reasonable definitions for a string, depending on
one's intended use of the notion; the term 'string' is used for
all these different notions because these are actually just different views of
the same reality: a piece of text stored inside a computer.

Byte string: A string viewed as a
sequence of bytes representing characters in a particular character encoding. This
corresponds to a CES. As a definition for a
string, this definition is most often useless, except when the textual nature
is unimportant and the string is considered only as a piece of opaque data with
a length in bytes. C011[S]
Specifications in
general SHOULD NOT define a string as a 'byte
string'.

Code unit string: A string
viewed as a sequence of code units
representing characters in a particular character encoding. This corresponds to a
CEF. A definition of a code unit string needs to include the size of the code units (e.g. 16 bits) and the character encoding used (e.g. UTF-16). Code unit strings are useful in APIs that
expose a physical representation of string data. Example: For the DOM
[DOM Level 1], UTF-16 was chosen based on widespread implementation
practice.

Character string: A string
viewed as a sequence of characters, each represented by a
code point in Unicode [Unicode].
This is usually what programmers consider to be a string, although it may not
match exactly what most users perceive as characters. This is the highest layer
of abstraction that ensures interoperability with very low implementation
effort. C012[S]
The 'character
string' definition of a string is generally the most useful and
SHOULD be used by most specifications, following the
examples of Production [2] of XML 1.0 [XML 1.0], the SGML
declaration of HTML 4.0 [HTML 4.01], and the character model of RFC
2070 [RFC 2070].

EXAMPLE: Consider the string comprising the characters U+233B4 (a Chinese character meaning 'stump
of tree'), U+2260 NOT EQUAL TO, U+0071
LATIN SMALL LETTER Q and U+030C COMBINING CARON,
encoded in UTF-16 in big-endian byte order. The rows of the following table show the
string viewed as a character string, code unit string and byte string, respectively:

Glyphs

Character string

U+233B4

U+2260

U+0071

U+030C

Code unit string

D84C

DFB4

2260

0071

030C

Byte string

D8

4C

DF

B4

22

60

00

71

03

0C

NOTE: It is also possible to view a string as a sequence of
grapheme clusters. Grapheme clusters divide the text into units that
correspond more closely than character strings to the user's perception of where character boundaries occur in a
visually rendered text. A discussion of grapheme clusters is given at the end of Section 2.10 of the Unicode Standard, Version 4
[Unicode 4.0]; a formal definition is given in Unicode Standard Annex #29 [UTR #29]. The Unicode Standard defines default grapheme clustering. Some languages require tailoring to this default. For example, a Slovak user might wish to treat the default pair of grapheme clusters "ch" as a single grapheme cluster. Note that the interaction between the language of string content and the end-user's preferences may be complex.

6.2 String indexing

There are many situations where a software process needs to access a
substring or to point within a string and does so by the use of
indices, i.e. numeric "positions" within a string.
Where such indices are exchanged between components of the Web, there is a need
for an agreed-upon definition of string indexing in order to ensure consistent
behavior. The requirements for string indexing are discussed in
Requirements for String Identity Matching[CharReq],
section 4. The two
main questions that arise are: "What is the unit of counting?" and
"Do we start counting at 0 or 1?".

C052[S][I]
A
code unit stringMAY be used as a basis for string indexing if this results
in a significant improvement in the efficiency of internal operations when
compared to the use of character
string.
(Example: the use of UTF-16 in
[DOM Level 1]).

C071[S][I]Grapheme clustersMAY be used as a basis for string indexing in applications where user interaction is the primary concern.
See Unicode Standard Annex #29, Text Boundaries [UTR #29]. C074[S] Specifications that define indexing in terms of grapheme clusters MUST either: a) define grapheme clusters in terms of default grapheme clusters as defined in Unicode Standard Annex #29, Text Boundaries [UTR #29], or b) define specifically how tailoring is applied to the indexing operation.

It is noteworthy that there exist other, non-numeric ways of
identifying substrings which have favorable properties. For instance,
substrings based on string matching are quite robust against small edits;
substrings based on document structure (in structured formats such as XML) are
even more robust against edits and even against translation of a document from
one human language to another.
C053[S]
Specifications that need a way to identify
substrings or point within a string SHOULD provide ways
other than string indexing to perform this operation.
C054[I][C]
Users of
specifications (software developers, content developers)
SHOULD whenever possible prefer ways other than string
indexing to identify substrings or point within a string.

Experience shows that more general, flexible and robust specifications
result when individual characters are understood and processed as substrings,
identified by a position before and a position after the substring.
Understanding indices as boundary positions between the counting
units also makes it easier to relate the indices resulting from the different
string definitions. C055[S]
Specifications
SHOULD understand and process single characters as
substrings, and treat indices as boundary positions between
counting units, regardless of the choice of counting
units.

C056[S]
Specifications of APIs
SHOULD NOT specify single character or single encoding-unit
arguments.

EXAMPLE: uppercase('ß') cannot return the proper result (the two-character string
'SS') if the return type of the uppercase
function is defined to be a single character.

The issue of index origin, i.e. whether we count from 0 or 1, actually
arises only after a decision has been made on whether it is the units
themselves that are counted or the positions between the units.
C057[S]
When the positions between the units are
counted for string indexing, starting with an index of 0 for the position at
the start of the string is the RECOMMENDED solution, with
the last index then being equal to the number of counting units in the
string.

7 Character Encoding in URI References

According to the definition in RFC 2396 [RFC 2396], URI
references are restricted to a subset of US-ASCII, with an escaping mechanism
to encode arbitrary byte values, using the %HH convention. However, the %HH
convention by itself is of limited use because there is no definitive mapping
from characters to bytes. Also, non-ASCII characters cannot be used directly.
Internationalized Resource Identifiers (IRIs)[I-D IRI] solves both problems with an uniform approach that
conforms to the Reference Processing
Model.

NOTE: Document formats should allow IRIs to be used; handlers for protocols
that do not currently support IRIs can convert the IRI to a URI when
the IRI is dereferenced.

C060[S]
Specifications that define
new syntax for URIs, such as a new URI scheme or a new kind of fragment
identifier, MUST specify that characters outside the
US-ASCII repertoire are encoded using UTF-8 and %HH-escaping.
This is in accordance
with Guidelines for new URL Schemes[RFC 2718], Section 2.2.5.
C061[S]
Such specifications SHOULD also define the
normalization requirements for the syntax they introduce.

8 Referencing the Unicode Standard and
ISO/IEC 10646

Specifications often need to make references to
the Unicode Standard or
International Standard ISO/IEC 10646. Such references must be made with care,
especially when normative. The questions to be considered are:

Which standard should be referenced?

How to reference a particular version?

When to use versioned vs. unversioned references?

ISO/IEC 10646 is developed and published jointly by
ISO (the International
Organization for Standardization)
and
IEC (the International
Electrotechnical Commission). The Unicode Standard is developed and published
by the
Unicode Consortium, an
organization of major computer corporations, software producers, database
vendors, national governments, research institutions, international agencies,
various user groups, and interested individuals. The Unicode Standard is
comparable in standing to W3C Recommendations.

ISO/IEC 10646 and the Unicode Standard define exactly the same
CCS (same repertoire, same code
points) and encoding forms. They are actively maintained in synchrony
by liaisons and overlapping membership between the respective technical
committees. In addition to the jointly defined CCS and encoding forms, the Unicode Standard adds normative and informative lists of character properties,
normative character equivalence and normalization specifications, a normative
algorithm for bidirectional text and a large amount of useful implementation
information. In short, the Unicode Standard adds
semantics to the characters that ISO/IEC 10646 merely enumerates. Conformance
to the Unicode Standard implies conformance to ISO/IEC
10646, see [Unicode 4.0] Appendix C.

C062[S]
Since specifications in general
need both a definition for their characters and the semantics associated with
these characters, specifications SHOULD include a reference
to the Unicode Standard, whether or not they include a
reference to ISO/IEC 10646.
By providing a reference to the Unicode Standard implementers can benefit from the wealth of information
provided in the standard and on the Unicode Consortium Web site.

The fact that both ISO/IEC 10646 and the Unicode Standard are evolving (in
synchrony) raises the issue of versioning: should a specification refer to a
specific version of the standard, or should it make a generic reference, so
that the normative reference is to the version current at the time of
reading the specification? In general the answer is
both. C063[S]
A generic reference to
the Unicode Standard MUST be made if
it is desired that characters allocated after a specification is published are
usable with that specification. A specific reference to
the Unicode Standard MAY be included
to ensure that functionality depending on a particular version is available and
will not change over time.
An example would be the set of characters acceptable
as Name characters in XML 1.0 [XML 1.0], which is an enumerated
list that parsers must implement to validate names.

By explicitly including a generic entry in the
bibliography section of a specification and simply referring to that entry in
the body of the specification. Such a generic entry contains text such as
"... as it may from time to time be revised or amended".

By including a specific entry in the bibliography
and adding text such as "... as it may from time to time be revised or
amended" at the point of reference in the body of the specification.

It is an editorial matter, best left to each specification, which of
these two formulations is used. Examples of the first formulation can be found
in the bibliography of this specification (see the entries for
[ISO/IEC 10646] and [Unicode]). Examples of the latter,
as well as a discussion of the versioning issue with respect to MIME
charset parameters for UCS encodings, can be found in
[RFC 3629] and [RFC 2781].

C064[S]
All generic
references to the Unicode Standard [Unicode]MUST refer to the latest version of the Unicode Standard available at the date of publication of the containing specification.
C065[S]
Generic references to ISO/IEC 10646
[ISO/IEC 10646]MUST be written such that they make
allowance for the future publication of additional parts of the
standard. When referring to Part 1, they MUST refer to
ISO/IEC 10646-1:2000 [ISO/IEC 10646-1:2000] or later, including any
amendments.

The Unicode Consortium,
The Unicode Standard, Version 4, ISBN 0-321-18578-1, as
updated from time to time by the publication of new versions. (See
http://www.unicode.org/unicode/standard/versions
for the latest version and additional information on versions of the standard
and of the Unicode Character Database).

ISO/IEC 646:1991, Information technology -- ISO 7-bit coded character set for information interchange. This standard defines an International Reference Version (IRV) which corresponds exactly to what is widely known as ASCII or US-ASCII. ISO/IEC 646 was based on the earlier standard ECMA-6. ECMA has maintained its standard up to date with respect to ISO/IEC 646 and makes an electronic copy available at http://www.ecma-international.org/publications/standards/Ecma-006.htm

B Examples of Characters, Keystrokes
and Glyphs (Non-Normative)

A few examples will help make sense all this complexity
of text in computers (which is mostly a reflection of the complexity of human
writing systems). Let us start with a very simple example: a user, equipped
with a US-English keyboard, types "Foo", which the computer
encodes as 16-bit values (the UTF-16 encoding of Unicode) and displays on the
screen.

Keystrokes

Shift-f

o

o

Input characters

F

o

o

Encoded characters (byte values
in hex)

0046

006F

006F

Display

Foo

Example: Basic Latin

The only complexity here is the use of a modifier (Shift) to input the
capital 'F'.

A slightly more complex example is a user typing 'çé' on
a traditional French-Canadian keyboard, which the computer again encodes in
UTF-16 and displays. We assume that this particular computer uses a fully
composed form of UTF-16.

Keystrokes

¸

c

é

Input characters

ç

é

Encoded characters (byte values
in hex)

00E7

00E9

Display

çé

Example: Latin with diacritics

A few interesting things are happening here: when the user types the
cedilla ('¸'), nothing happens except for a change of state of the
keyboard driver; the cedilla is a dead key. When the driver gets
the c keystroke, it provides a complete 'ç' character to the
system, which represents it as a single 16-bit code
unit and displays a 'ç'
glyph. The user then presses the dedicated
'é' key, which results in, again, a character represented by two
bytes. Most systems will display this as one glyph, but it is also possible to
combine two glyphs (the base letter and the accent) to obtain the same
rendering.

On to a Japanese example: our user employs a romaji input
method to type '日本語' (U+65E5, U+672C, U+8A9E), which the computer encodes in UTF-16 and
displays.

Keystrokes

n i h o n g o <space> <return>

Input characters

日

本

語

Encoded characters (byte values in hex)

65E5

672C

8A9E

Display

Example: Japanese

The interesting aspect here is input: the user types Latin characters,
which are converted on the fly to kana (not shown here), and then to kanji when
the user requests conversion by pressing <space>; the kanji characters
are finally sent to the application when the user presses <return>. The
user has to type a total of nine keystrokes before the three characters are
produced, which are then encoded and displayed rather trivially.

An Arabic example will show different phenomena:

Keystrokes

Input characters

ل

ا

ل

ا

غ

غ

Encoded characters (byte values in hex)

0644

0627

0644

0627

0639

0639

Display

Example: Arabic

Here the first two keystrokes each produce an input character and an
encoded character, but the pair is displayed as a single glyph
('', a lam-alef ligature). The next
keystroke is a lam-alef, which some Arabic keyboards have; it produces the same
two characters which are displayed similarly, but this second lam-alef is
placed to the left of the first one when displayed. The last two
keystrokes produce two identical characters which are rendered by two different
glyphs (a medial form followed to its left by a final form). We thus have 5
keystrokes producing 6 characters and 4 glyphs laid out right-to-left.

A final example in Tamil, typed with an ISCII
keyboard, will illustrate some additional phenomena:

Keystrokes

Input characters

ட

ா

ங

்

க

ோ

Encoded characters (byte values in hex)

0B9F

0BBE

0B99

0BCD

0B95

0BCB

Display

Example: Tamil

Here input is straightforward, but note that contrary to the preceding
accented Latin example, the virama diacritic ' ்' (U+0BCD) is entered
after the 'ங' (U+0B99) to which it applies. Rendering
is interesting for the last two characters. The last one ' ோ' (U+0BCB) clearly consists of two glyphs which surround
the glyph of the next to last character 'க' (U+0B95).

C Example text (Non-Normative)

The following are textual versions of strings or characters used in image-based examples in this document. They are provided here for the benefit of those who want to cut and paste the text for their own testing.

D Acknowledgements (Non-Normative)

Tim
Berners-Lee and James Clark provided important details in the section on URIs.
Asmus Freytag , Addison Phillips, and in early stages Ian Jacobs, provided significant help in the authoring and editing process. The W3C I18N WG and IG, as well as others, provided many comments and
suggestions.