In this chapter, we discuss how HTML documents are represented on a
computer and over the Internet.

The section on the document character
set addresses the issue of what abstract characters may be
part of an HTML document. Characters include the Latin letter "A", the
Cyrillic letter "I", the Chinese character meaning "water", etc.

The section on character encodings
addresses the issue of how those characters may be
represented in a file or when transferred over the
Internet. As some character encodings cannot directly represent all
characters an author may want to include in a document, HTML offers
other mechanisms, called character references,
for referring to any character.

Since there are a great number of characters throughout human
languages, and a great variety of ways to represent those characters,
proper care must be taken so that documents may be understood
by user agents around the world.

To promote interoperability, SGML requires that each application
(including HTML) specify its document character
set. A document character set consists of:

A Repertoire: A set of
abstract characters,, such as the Latin
letter "A", the Cyrillic letter "I", the Chinese character meaning
"water", etc.

Code positions: A set of integer references to
characters in the repertoire.

Each SGML document (including each HTML document) is a sequence of
characters from the repertoire. Computer systems identify each
character by its code position; for example, in the ASCII character
set, code positions 65, 66, and 67 refer to the characters 'A', 'B',
and 'C', respectively.

The ASCII character set is not sufficient for a global information
system such as the Web, so HTML uses the much more complete character
set called the Universal Character Set (UCS), defined in
[ISO10646]. This standard defines a
repertoire of thousands of characters used by communities all over the
world.

The character set defined in [ISO10646]
is character-by-character equivalent
to Unicode 2.0 ([UNICODE]). Both of these standards are updated
from time to time with new characters, and the amendments should be
consulted at the respective Web sites. In the current specification,
references to ISO/IEC-10646 or Unicode imply the same document
character set. However, the HTML specification also refers to the
Unicode specification for other issues such as the bidirectional text
algorithm.

The document character set, however, does not suffice to allow user
agents to correctly interpret HTML documents as they are typically
exchanged -- encoded as a sequence of bytes in a file or during a
network transmission. User agents must also know the specific character encoding that was used to transform
the document character stream into a byte stream.

What this specification calls a character encoding is
known by different names in other specifications (which may cause some
confusion). However, the concept is largely the same across the
Internet. Also, protocol headers, attributes, and parameters referring
to character encodings share the same name -- "charset" -- and use the
same values from the [IANA] registry (see [CHARSETS]
for a complete list).

The "charset" parameter identifies a character encoding, which is a
method of converting a sequence of bytes into a sequence of
characters. This conversion fits naturally with the scheme of Web
activity: servers send HTML documents to user agents as a stream of
bytes; user agents interpret them as a sequence of characters. The
conversion method can range from simple one-to-one correspondence to
complex switching schemes or algorithms.

A simple one-byte-per-character encoding technique is not sufficient
for text strings over a character repertoire as large as [ISO10646].
There are several different encodings of parts of [ISO10646]
in addition to encodings of the entire character set (such as UCS-4).

Authoring tools (e.g., text editors) may encode HTML documents in
the character encoding of their choice, and the choice largely depends
on the conventions used by the system software. These tools may
employ any convenient encoding that covers most of the characters
contained in the document, provided the encoding is correctly labeled. Occasional
characters that fall outside this encoding may still be represented by
character references. These always refer to
the document character set, not the character encoding.

Servers and proxies may change a character encoding (called
transcoding) on the fly to meet the requests of user agents
(see section 14.2 of [RFC2068],
the "Accept-Charset" HTTP request header). Servers and proxies do not
have to serve a document in a character encoding that covers the
entire document character set.

Commonly used character encodings on the Web include
ISO-8859-1 (also referred to as "Latin-1"; usable for most Western
European languages), ISO-8859-5 (which supports Cyrillic), SHIFT_JIS
(a Japanese encoding), EUC-JP (another Japanese encoding), and UTF-8
(an encoding of ISO 10646 using a different number of bytes for
different characters). Names for character encodings are
case-insensitive, so that for example "SHIFT_JIS", "Shift_JIS", and
"shift_jis" are equivalent.

This specification does not mandate which character encodings
a user agent must support.

Conforming user agents must
correctly map to Unicode all characters in any character encodings
that they recognize (or they must behave as if they did).

Notes on specific encodings

When HTML text is transmitted in UTF-16
(charset=UTF-16), text data should be transmitted in network byte
order ("big-endian", high-order byte first) in accordance with [ISO10646], Section 6.3 and [UNICODE],
clause C3, page 3-1.

Furthermore, to maximize chances of proper interpretation, it is
recommended that documents transmitted as UTF-16 always begin with a
ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called
Byte Order Mark (BOM)) which, when byte-reversed, becomes hexadecimal
FFFE, a character guaranteed never to be assigned. Thus, a user-agent
receiving a hexadecimal FFFE as the first bytes of a text would know
that bytes have to be reversed for the remainder of the text.

How does a server determine which character encoding applies for a
document it serves? Some servers examine the first few bytes of the
document, or check against a database of known files and
encodings. Many modern servers give Web masters more control over
charset configuration than old servers do. Web masters should use
these mechanisms to send out a "charset" parameter whenever possible,
but should take care not to identify a document with the wrong
"charset" parameter value.

How does a user agent know which character encoding has been used?
The server should provide this information. The most straightforward
way for a server to inform the user agent about the character encoding
of the document is to use the "charset" parameter of the "Content-Type" header field of the HTTP protocol ([RFC2068], sections 3.4 and 14.18) For example,
the following HTTP header announces that the character encoding is
EUC-JP:

The HTTP protocol ([RFC2068],
section 3.7.1) mentions ISO-8859-1 as a default character encoding
when the "charset" parameter is absent from the "Content-Type" header
field. In practice, this recommendation has proved useless because
some servers don't allow a "charset" parameter to be sent, and others
may not be configured to send the parameter. Therefore, user agents
must not assume any default value for the "charset" parameter.

To address server or configuration limitations, HTML documents may
include explicit information about the document's character encoding;
the META element can be used to provide
user agents with this information.

For example, to specify that the character encoding of the current
document is "EUC-JP", a document should include the following META declaration:

<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">

The META declaration must only be used
when the character encoding is organized such that ASCII characters
stand for themselves (at least until the META element is parsed). META declarations should appear as early as
possible in the HEAD element.

For cases where neither the HTTP protocol nor the META element provides information about the
character encoding of a document, HTML also provides the charset attribute on several elements. By
combining these mechanisms, an author can greatly improve the chances
that, when the user retrieves a resource, the user agent will
recognize the character encoding.

To sum up, conforming user agents must observe the following priorities when determining a document's
character
encoding (from highest priority to lowest):

An HTTP "charset" parameter in a "Content-Type" field.

A META declaration with "http-equiv"
set to "Content-Type" and a value set for "charset".

The charset attribute set
on an element that designates an external resource.

In addition to this list of priorities, the user agent may use
heuristics and user settings. For example, many user agents use a
heuristic to distinguish the various encodings used for Japanese
text. Also, user agents typically have a user-definable, local default
character encoding which they apply in the absence of other
indicators.

User agents may provide a mechanism that allows users to override
incorrect "charset" information. However, if a user agent offers such
a mechanism, it should only offer it for browsing and not for editing,
to avoid the creation of Web pages marked with an incorrect "charset"
parameter.

Note.
If, for a specific application, it becomes necessary to refer to
characters outside [ISO10646], characters
should be assigned to a private zone to avoid conflicts with present
or future versions of the standard. This is highly discouraged,
however, for reasons of portability.

A given character encoding may not be able to express all
characters of the document character set. For such encodings, or when
hardware or software configurations do not allow users to input some
document characters directly, authors may use SGML character
references. Character references are a character
encoding-independent mechanism for entering any character from the
document character set.

Character references in HTML may appear in two forms:

Numeric character references (either decimal or hexadecimal).

Character entity references.

Character references within comments have no
special meaning; they are comment data only.

Note.
HTML provides other ways to present character data, in particular
inline images.

Note. In SGML, it is
possible to eliminate the final ";" after a character reference in
some cases (e.g., at a line break or immediately before a tag). In other
circumstances it may not be eliminated (e.g., in the middle of a
word). We strongly suggest using the ";" in all cases to avoid
problems with user agents that require this character to be
present.

The syntax "&#D;", where D is a decimal
number, refers to the Unicode decimal character number D.

The syntax "&#xH;" or "&#XH;", where
H is an hexadecimal number, refers to the Unicode hexadecimal
character number H. Hexadecimal numbers in numeric character
references are case-insensitive.

Here are some examples of numeric character references:

&#229; (in decimal) represents the letter "a" with a small
circle above it (used, for example, in Norwegian).

&#xE5; (in hexadecimal) represents the same character.

&#Xe5; (in hexadecimal) represents the same character as well.

&#1048; (in decimal) represents the Cyrillic capital letter "I".

&#x6C34; (in hexadecimal) represents the Chinese character
for water.

Note.
Although the hexadecimal representation is not
defined in [ISO8879], it is expected to be in the revision,
as described in [WEBSGML]. This convention is particularly useful
since character standards generally use hexadecimal
representations.

In order to give authors a more intuitive way of referring to
characters in the document character set, HTML offers a set of character
entity references. Character entity references use
symbolic names so that authors need not remember code positions. For example, the character
entity reference &aring; refers to the lowercase "a" character
topped with a ring; "&aring;" is easier to remember than
&#229;.

HTML 4.0 does not define a character entity reference for every
character in the document character set. For instance, there is no
character entity reference for the Cyrillic capital letter "I".
Please consult the full list of
character references defined in HTML 4.0.

Character entity references are case-sensitive. Thus,
&Aring; refers to a different character (uppercase A, ring) than
&aring; (lowercase a, ring).

Four character entity references deserve special mention since they are
frequently used to escape special characters:

"&lt;" represents the < sign.

"&gt;" represents the > sign.

"&amp;" represents the & sign.

"&quot; represents the " mark.

Authors wishing to put the "<" character in text should use
"&lt;" (ASCII decimal 60) to avoid possible confusion with the
beginning of a tag (start tag open delimiter). Similarly, authors
should use "&gt;" (ASCII decimal 62) in text instead
of ">" to avoid problems with older user agents that
incorrectly perceive this as the end of a tag (tag close delimiter)
when it appears in quoted attribute values.

Authors should use "&amp;" (ASCII decimal 38) instead of "&"
to avoid confusion with the beginning of a character
reference (entity reference open delimiter). Authors should also use
"&amp;" in attribute values since character references are allowed
within CDATA attribute values.

Some authors use the character entity reference "&quot;" to
encode instances of the double quote mark (") since that character may
be used to delimit attribute values.

A user agent may not be able to
render all characters in a document meaningfully, for instance,
because the user agent lacks a suitable font, a character has a value
that may not be expressed in the user agent's internal character
encoding, etc.

Because there are many different things that may be done in such
cases, this document does not prescribe any specific behavior.
Depending on the implementation, undisplayable
characters may also be handled by the underlying display system
and not the application itself. In the absence of more sophisticated
behavior, for example tailored to the needs of a particular script or
language, we recommend the following behavior for user agents:

Adopt a clearly visible, but unobtrusive mechanism to alert the
user of missing resources.

If missing characters are presented using their numeric
representation, use the hexadecimal (not decimal) form since this
is the form used in character set standards.