Abstract

While encodings have been defined to some extent, implementations have not always implemented them in the same way, have not always used the same labels, and often differ in dealing with undefined and former proprietary areas of encodings. This specification attempts to fill those gaps so that new implementations do not have to reverse engineer encoding implementations of the market leaders and existing implementations can converge.

Status of This Document

This section describes the status of this document at the time of its publication. Other
documents may supersede this document. A list of current W3C publications and the latest revision
of this technical report can be found in the W3C technical reports
index at http://www.w3.org/TR/.

Publication as a First Public Working Draft does not imply endorsement by the W3C Membership.
This is a draft document and may be updated, replaced or obsoleted by other documents at
any time. It is inappropriate to cite this document as other than work in progress.

This is a snapshot of the editor's document, as of the date shown on the title page, published after discussion with the WHATWG editors. No changes have been made in the body of the W3C draft other than to align with W3C house styles. The primary reason that W3C is publishing this document is so that HTML5 and other specifications may normatively refer to a stable W3C Recommendation.

1 Preface

While encodings have been defined to some extent, implementations have
not always implemented them in the same way, have not always used the same
labels, and often differ in dealing with undefined and former proprietary
areas of encodings. This specification attempts to fill those gaps so that
new implementations do not have to reverse engineer encoding implementations
of the market leaders and existing implementations can converge.

This specification is primarily intended for dealing with
legacy content, it requires new content and formats to use the
utf-8encoding exclusively.

2 Conformance

All diagrams, examples, and notes in this specification are
non-normative, as are all sections explicitly marked non-normative.
Everything else in this specification is normative.

The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in the normative parts of this document are to be
interpreted as described in RFC2119. For readability, these words do
not appear in all uppercase letters in this specification.
[RFC2119]

Conformance requirements phrased as algorithms or specific steps
may be implemented in any manner, so long as the end result is
equivalent. (In particular, the algorithms defined in this
specification are intended to be easy to follow, and not intended to
be performant.)

User agents may impose
implementation-specific limits on otherwise unconstrained inputs,
e.g. to prevent denial of service attacks, to guard against running
out of memory, or to work around platform-specific limitations.

3 Terminology

Hexadecimal numbers are prefixed with "0x".

In equations, all numbers are integers, addition is represented by "+",
subtraction by "−", multiplication by "×", division by "/",
calculating the remainder of a division (also known as modulo) by "%",
exponentiation by "bn",
arithmetic left shifts by "<<", arithmetic right shifts by ">>",
bitwise AND by "&", and bitwise OR by "|".

A byte is a sequence of eight bits, represented as a
double-digit hexadecimal number in the range 0x00 to 0xFF.

A code point is a Unicode code point and is represented as a
four-to-six digit hexadecimal number, typically prefixed with "U+".
In equations and indexes code points are prefixed
with "0x". [UNICODE]

Comparing two strings in an ASCII case-insensitive manner
means comparing them exactly, code point for code point, except that the
characters in the range U+0041 to U+005A (i.e. LATIN CAPITAL LETTER A to
LATIN CAPITAL LETTER Z) and the corresponding characters in the range
U+0061 to U+007A (i.e. LATIN SMALL LETTER A to LATIN SMALL LETTER Z) are
considered to also match.

4 Encodings

An encoding defines a mapping from a code point sequence to a
byte sequence (and vice versa). Each encoding has a
name, and one or more labels.

A decoder algorithm takes a byte stream and emits a
code point stream. The byte pointer is initially zero,
pointing to the first byte in the stream. It cannot be negative. It can be
increased and decreased to point to other bytes in the stream. The
EOF byte is a conceptual byte representing the end of the
stream. The byte pointer cannot point beyond the
EOF byte. The EOF code point is a conceptual code
point that is emitted once the byte stream is handled in its entirety.
A decoder must be invoked again when the word continue is used or
when one or more code points are emitted of which none is the
EOF code point.

An encoder algorithm takes a code point stream and emits a
byte stream. It fails when a code point is passed for which it does not have
a corresponding byte (sequence). Analogously to a decoder, it
has a code point pointer.
An encoder must be invoked again when the word continue is
used or when one or more bytes are emitted of which none is the
EOF byte.

5 Indexes

Most legacy encodings make use of
an index. An index is an ordered list of
pointers and corresponding code points. Within an index
pointers are unique and code points can be duplicated.

To find the pointers and their corresponding code points in an index,
let lines be the result of splitting the resource's contents on U+000A.
Then remove each item in lines that is the empty string or starts with U+0023.
Then the pointers and their corresponding code points are found by splitting each item in lines on U+0009.
The first subitem is the pointer (as a decimal number) and the second is the corresponding code point (as a hexadecimal number).
Other subitems are not relevant.

The index code point for pointer in
index is the code point corresponding to
pointer in index, or null if
pointer is not in index.

The index pointer for code point in
index is the first pointer corresponding to
code point in index, or null if
code point is not in index.

This index works different from all others. Listing all
code points would result in over a million items whereas they can be
represented neatly in 207 ranges combined with trivial limit checks. It
therefore only superficially matches the GB18030 standard for code points
encoded as four bytes. See also index gb18030 ranges code point and
index gb18030 ranges pointer below.

To decode a byte stream stream using
fallback encoding encoding, run these steps:

Let offset be 0.

For each of the rows in the following table, starting with the first
one and going down, if the first bytes of stream match
all the bytes given in the first column (ergo stream
contains at least two or three bytes), then set encoding
to the encoding given in the cell in the second column of
that row, and set offset to the offset given in the cell
in the third column of that row.

7 API

This section uses terminology from the DOM, Typed Arrays, and Web IDL.
Non-browser implementations are not required to implement this API.
[DOM][TYPEDARRAY][WEBIDL]

The following example uses the TextEncoder object to encode
an array of strings into an
ArrayBuffer. The result is a
Uint8Array containing the number
of strings (as a Uint32Array),
followed by the length of the first string (as a
Uint32Array), the
utf-8 encoded string data, the length of the second string (as
a Uint32Array), the string data,
and so on.

If the streaming flag is unset, set the encoding state
to the default values of the encoding's decoder's
associated variables, unset the BOM seen flag, and empty the stream.

If options's stream is true,
set the streaming flag, and unset the streaming flag
otherwise.

If input is given, then given
input's buffer,
byteOffset, and byteLength, append
byteLength bytes from buffer,
starting at byteOffset, to the stream.

If the BOM seen flag is unset, and the stream either
holds at least two bytes, or at least three bytes if the encoding
is utf-8, then set the BOM seen flag, and for each of
the rows in the following table, starting with the first one and going
down, if the first bytes of the stream match all the bytes given in
the first column, and the encoding matches the
encoding given in the cell in the second column of that row,
then remove those bytes at the start of the stream.

Return the output of running encoding's decoder, with its
error handling mode set to fatal if the fatal flag is
set, on the stream. If encoding's decoder terminates with
failure, throw an
"EncodingError".

In addition to the reason given above with respect to the
byte order mark, this also does not use the encode algorithm
as it assumes a continuous stream rather than one delivered in fragments.

The constraints in the utf-8 decoder above match
“Best Practices for Using U+FFFD” from the Unicode standard. No other
behavior is permitted per the Encoding Standard (other algorithms that
achieve the same result are obviously fine, even encouraged).

In violation of the Unicode standard, which does not allow for handling a
byte order mark in its definition of utf-16be and utf-16le, checking and using a
byte order mark happens before an encoding to decode a byte stream is chosen, as seen in
the decode algorithm.

The utf-16 lead byte and utf-16 lead surrogate
are initially null and the utf-16be flag is initially unset.

Acknowledgments

There have been a lot of people that have helped make encodings more
interoperable over the years and thereby furthered the goals of this
standard. Likewise many people have helped making this standard what it is
today.

Ideally they are all listed here so please contact the editor with any
omissions.