I had started writing this to document the format used for the mapping tables
on the website. These formats are somewhat ad-hoc, but I tried to follow what
existed on the website where possible. I did introduce a few new conventions
that would allow the data to be fleshed out in a standard way in the future.

Unicode Character Mapping Formats

Draft, 10-15 MED

The Unicode website provides character mappings to and from Unicode for a
number of different code pages for use in character code conversion (also called
charset or code page transcoding). Some of these mappings are supplied by the
Unicode Consortium; others are supplied directly by the vendors. This document
describes the format for those files.

In all cases, comments start with '#', and continue to the end of the line.
For historical reasons, a number of comment lines may actually have significant
data; these are described below. Line ends may be CR, LF, or CRLF. Whitespace
between items consists of one or more spaces or tabs in arbitrary combination.
Whitespace between '#' and the following comment is optional.

This file describes the recommended format for the mapping files, and gives
directions for creating new files. Most of the files on the Unicode site are
supplied by vendors, and may not yet have been updated to this format.

Before discussing the format, it is important to review some necessary
concepts.

Client software may need to distinguish the different types of mismatches
that can occur when transcoding data to and from Unicode. These fall into
the following categories:

The sequence is unassigned (aka undefined).
For example,

0xA3BF is unassigned in CP950

0x0EDD is unassigned in Unicode, V3.0

The sequence is incomplete.
For example,

0xA3 is incomplete in CP950.

Unless followed by another byte of the right form, it is illegal.

0xD800 is incomplete in Unicode.

Unless followed by another value of the right form, it is illegal.

0xDC00 is incomplete in Unicode.

Unless preceded by another value of the right form, it is illegal.

The sequence is illegal.
For example,

0xFF is illegal in CP950

Unassigned characters are treated as a single code point: for example, 0xA3BF
is treated as a single code point when mapping into Unicode from CP950. The
actual conversion routines will typically handle an unassigned value in a
variety of ways (depending on the parameters passed in), such as:

stop or throw an exception

in particular, this is commonly used by higher level character
encodings, such as ISO 2022 conversions, to know when to stop converting
into one set and pick another to convert to.

map it to a substitution character

such as the Unicode U+FFFD REPLACEMENT CHARACTER

represent it by a hex escape sequence

for example, when mapping from U+1234 to other code pages, it can be
represented by "%12%34" in URLs, "&#x1234;" in
XML or HTML, "\u1234" in Java or C++, or "\x{1234}"
in Perl.

Note that there is an important difference between the case where a sequence
represents a real REPLACEMENT CHARACTER in a legacy set, as opposed to just
being unassigned, and thereby sometimes being mapped to REPLACEMENT CHARACTER
for that reason.

Illegal values represent some corruption of the data stream. Conversion
routines may be directed to handle this in a different way than by replacement
characters. For example, a routine might map unassigned characters to a
substitution character, but throw an exception on illegal values.

It is important that a mapping file be a complete description. From the data
in the file, it should be possible to tell for any sequence of bytes whether
that sequence is assigned, unassigned, incomplete, or illegal.

Unless otherwise indicated in the data file, any sequences of bytes that
are not mentioned are assumed to be unassigned.

All control values (C0, C1) should be explicitly mapped.

All private use (e.g. user defined) characters should be explicitly
mapped, either to the private use zone in Unicode, or to the correct
characters outside of that zone.

Only a real replacement character should be mapped to REPLACEMENT CHAR;
unassigned characters should not be mapped to it. Similarly, when mapping
back from Unicode, only the REPLACEMENT CHAR should map to SUB or other
legacy equivalent.

Unfortunately, code page names are not unique: different sources differ on
the precise mappings for Shift-JIS, for example. The description should
describe the code page in enough detail to distinguish it from other code
pages. E.g. "Macintosh variant of Japanese Shift-JIS".

When multiple values are given, such as for maintainers, they are
comma-separated (with optional whitespace).

In some cases, users have the option of using fallback characters, where an
character that is not represented in the target code page is given a
"best fit" mapping. For example, an encoding might not have curly
quotes; the generic quotes can be used as a fallback. Any fallback mappings
should be provided before the main mapping. That is, any latter data line
overrides any earlier one. For readability, the fallback section should be
marked with comments, as below.

For example, the following indicates that 0x22 should be mapped to U+0022.
When mapping back from Unicode, U+0022 is mapped to 0x22; in addition, if the
fallback option is on, then U+201D and U+201C are also mapped to 0x22.

In some cases, a mappings is only a minor variation of another mapping. When
this is the case, this can be indicated without copying the entire contents of
the file, by just supplying the source mapping and the changed lines. As it
turns out, the overriding can be described very simply using an IMPORT style
mechanism and the same overriding used for Fallbacks,
using the following format.