HTML Unleashed. Internationalizing HTML: Character Encoding Standards

HTML Unleashed: Internationalizing HTML

Character Encoding Standards

t
so happened that the computer industry has been flourishing in
the country whose language uses one of the most compact alphabets in
the world. However, not long after the first computers had learned to
spell English, a need arose to encode characters from other languages.
In fact, even the minimum set of Latin letters and basic symbols has
been for some time the subject of controversy between two competing
standards, ASCII and (now almost extinct) EBCDIC; no wonder that for
other languages' alphabets, a similar muddle has been around for much
longer (in fact, it's still far from over).

As explained in Chapter 3, "SGML and the HTML
DTD," a character encoding (often called character set
or, more precisely, coded character set) is defined---first,
by the numerical range of codes; second, by the repertoire of
characters; and third, by a mapping between these two sets.
You see that the term "character set" is a bit misleading because it
actually implies two sets and a relation between them.
Probably the most precise definition of a character encoding in
mathematical terms is given by Dan Connolly in his paper "Character
Set Considered Harmful": "A function whose domain is a subset of
integers, and whose range is a set of characters."

The range of codes is limited by the length of the sequence of bits
(called bit combination) used to encode one character. For
instance, a combination of 8 bits is sufficient to encode the total of
256 characters (although not all of these code positions may be
actually used). The smaller the bit combination size, the more compact
the encoding (that is, the less storage space is required for a piece
of text), but at the same time, the fewer total characters you can
encode.

It is quite logical to codify characters using bit combinations of
the size most convenient for computers. Because modern computer
architecture is based on bytes (also called octets) of 8 bits,
all contemporary encoding standards use bit combinations of 8, 16, or
32 bits in length. The next sections survey the most important of
these standards to see the roles they play in today's Internet.

7-Bit ASCII

The so-called 7-bit ASCII, or US ASCII, encoding is equivalent to
the international standard named ISO 646 established by the
International Organization for Standardization (ISO). This
encoding actually uses octets of 8 bits per character, but it leaves
the first (the most significant) bit in each octet unused (it must
be always zero). The 7 useful bits of ISO 646 are capable of
encoding the total of 128 characters.

This is the most ubiquitous encoding standard used on the
overwhelming majority of computers worldwide (either by itself or as a
part of other encodings, as you'll see shortly). ISO 646 may be called
international in the sense that there are precious few computers in
the world that use other encodings for the same basic repertoire of
characters. It is also used exclusively for keywords and syntax in all
programming and markup languages (including SGML and HTML), as well as
for all sorts of data that is human-editable but of essentially
computer nature, such as configuration files or scripts.

However, with regard to the wealth of natural languages spoken
around the world, ISO 646 is very restrictive. In fact, only English,
Latin, and Swahili languages can use plain 7-bit ASCII with no
additional characters. Most languages whose alphabets (also called
scripts or writing systems) are based on the Latin
alphabet use various accented letters and ligatures.

The first 32 codes of ISO 646 are reserved for control
characters, which means that they invoke some functions or
features in the device that reads the text rather than produce a
visible shape (often called glyph) of a character for human
readers. As a rule, character set standards are reluctant to exactly
define the functions of control characters, as these functions may
vary considerably depending on the nature of text processing software.

For example, of the 32 control characters of ISO 646, only a few
(carriage return, line feed, tabulation) have more or less
established meanings. For use in texts, most of these codes
are just useless. The code space thus wasted in vain is a
hangover from the old days when these control characters used to
play the role of today's document formats and communication
protocols.

8-Bit Encodings

The first natural step to accommodate languages that are more
letter-hungry than English is to make use of the 8th bit in every
byte. This provides for additional 128 codes that are sufficient to
encode an alphabet of several dozens letters (for example, Cyrillic or
Greek) or a set of additional Latin letters with diacritical marks and
ligatures used in many European languages (such as ç in French or Ã
in German).

Unfortunately, there exist many more 8-bit encodings in the world
than are really necessary. Nearly every computer platform or operating
system making its way onto a national market without a strong computer
industry of its own brought along a new encoding standard. For
example, as many as three encodings for the Cyrillic alphabet are now
widely used in Russia, one being left over from the days of MS-DOS,
the second native to Microsoft Windows, and the third being popular in
the UNIX community and on the Internet. A similar situation can be
observed in many other national user communities.

ISO, being an authoritative international institution, has done its
best to normalize the mess of 8-bit encodings. The ISO 8859
series of standards covers almost all extensions of the Latin
alphabet as well as the Cyrillic (ISO 8859-5), Arabic (ISO 8859-6),
Greek (ISO 8859-7), and Hebrew (ISO 8859-8) alphabets. All of
these encodings are backwards compatible with ISO 646; that is, the
first 128 characters in each ISO 8859 code table are identical to
7-bit ASCII, while the national characters are always located in the
upper 128 code positions.

Again, the first 32 code positions (128 to 159 decimal, inclusive)
of the upper half in ISO 8859 are reserved for control characters and
should not be used in texts. This time, however, many software
manufacturers chose to disregard the taboo; for example, the majority
of True Type fonts for Windows conform to ISO 8859-1 in code positions
from 160 upwards, but use the range 128-159 for various additional
characters (notably the em dash and the trademark sign). This leads to
the endless confusion about whether one may access these 32 characters
in HTML (the DTD, following ISO 8859, declares this character range
UNUSED). HTML internationalization extensions resolve this controversy
by making it possible to address these characters via their Unicode
codes.

The authority of ISO was not, however, sufficient to position all
of the 8859 series as a strong alternative to the ad hoc national
encodings supported by popular operating systems and platforms. For
example, ISO 8859-5 is hardly ever used to encode Russian texts except
on a small number of computers.

On the other hand, the first standard in the 8859 series, ISO
8859-1 (often called ISO Latin-1), which contains the most
widespread Latin alphabet extensions serving many European
languages, has been widely recognized as the 8-bit ASCII
extension. Whenever a need arises for an 8-bit encoding
standard that is as international as possible, you're likely to see
ISO 8859-1 playing the role. For instance, ISO 8859-1 served
as a basis for the document character set in HTML versions up to 3.2
(in 4.0, this role was taken over by Unicode, see below).

16-Bit Encodings

Not all languages in the world use small alphabets. Some writing
systems (for example, Japanese and Chinese) use ideographs, or
hieroglyphs, instead of letters, each corresponding not to a sound of
speech but to an entire concept or word. As there are many more words
and conceivable ideas than there are sounds in a language,
such writing systems usually contain many thousands of ideographs. An
encoding for such a system needs at least 16 bits (2 octets) per
character which allows to accommodate the total of
216 = 65536
characters.

Ideally, such a 16-bit encoding should be backwards compatible with
the existing 8-bit (and especially 7-bit ASCII) encodings. This means
that an ASCII-only device reading a stream of data in this encoding
should be able to correctly interpret at least ASCII characters if
they're present. This is implemented using code switching, or
escaping techniques: Special sequences of control characters
are used to switch back and forth between ASCII mode with the 1 octet
per character and 2-octet modes (also called code pages).
Encodings based on this principle are now widely used for Far East
languages.

Code switching works all right, but one interesting problem is that
the technique makes it ambiguous what to consider a coded
representation of a character---is it just its 2-octet code or the
code preceded by the switching sequence? It is obvious that the
"extended" national symbols and ASCII characters are not treated
equally in such systems, which may be practically justifiable but is
likely to pose problems in the future.

In late 80s, the need for a truly international 16-bit coding
standard became apparent. The Unicode Consortium, formed in 1991,
undertook to create such a standard called Unicode. In
Unicode, every character from the world's major writing systems is
assigned a unique 2-octet code. According to the tradition,
the first 128 codes of Unicode are identical to 7-bit ASCII, and the
first 256 codes, to ISO 8859-1. However, strictly speaking,
this standard is not backwards compatible with 8-bit encodings; for
instance, Unicode for the Latin letter A is 0041 (hex) while
ASCII code for the same letter is simply 41.

The Unicode standard deserves a separate book to describe it fully
(in fact, its official specification is available in book form from
the Unicode Consortium). Its many blocks and zones cover all literal
and syllabic alphabets that are now in use, alphabets of many dead
languages, lots of special symbols and combined characters (such as
letters with all imaginable diacritical marks, circled digits, and so
on).

Also, Unicode provides space for more than 20 thousand unified
ideographs used in Far East languages. Contrary to other alphabets,
ideographic systems were treated on a language-independent basis. This
means that an ideograph that has similar meanings and appearance
across the Far East languages is represented by a single code despite
the fact that it corresponds to quite different words in each
of the languages and that most such ideographs have country-specific
glyph variants.

The resulting ideographic system implemented in Unicode is often
abbreviated CJK (Chinese, Japanese, Korean) after the names of the
major languages covered by this system. CJK unification reduced the
set of ideographs to be encoded to a manageable (and codeable) number,
but the undesirable side effect is that it is impossible to create a
single Unicode font suitable for everyone; a Chinese text should be
displayed using slightly different visual shapes of ideographs than a
Japanese text even if they use the same Unicode-encoded ideographs.

The work on Unicode is far from complete, as about 34 percent of
the total coding space remains unassigned. Working groups in both the
Unicode Consortium and ISO are working on selection and codification
of the most deserving candidates to colonize Unicode's as-of-yet
wastelands. A good sign is that the process of Unicode acceptance
throughout the computer industry is taking off; for example, Unicode
is used for internal character coding in Java programming language and
for font layout in Windows 95 and Windows NT operating systems.

ISO 10646

Although Unicode is still not widely used, ISO published in 1993 a
new, 32-bit encoding standard named ISO/IEC 10646-1, or Universal
Multiple-Octet Coded Character Set (abbreviated UCS).
Just as 7-bit ASCII does, though, this standard leaves the most
significant bit in the most significant octet unused, which makes it
essentially a 31-bit encoding.

Still, the code space of ISO 10646 spans the tremendous amount of
231 = 2147483648 code positions, which is much, much more than could
be used by all languages and writing systems that ever existed on
Earth. What, then, is the rationale behind such a huge "Unicode of
Unicodes?"

The main reason for developing a 4-octet encoding standard is that
Unicode actually cannot accommodate all the characters for which it
would be useful to provide encoding. Although a significant share of
Unicode codes are still vacant, the proposals for new character and
ideograph groups that are now under consideration require in total
several times more code positions than are available in 16-bit
Unicode.

Extending Unicode thus seems inevitable, and it makes little sense
to extend it by one octet because computers will have trouble dealing
with 3-octet (24-bit) sequences; 32-bit encoding, on the other hand,
is particularly convenient for modern computers, most of which process
information in 32-bit chunks.

Just as Unicode extends ISO 8859-1, the new ISO 10646 is a proper
extension of Unicode. In terms of ISO 10646, a chunk of 256 sequential
code positions is called a row, 256 rows constitute a
plane, and 256 planes make up a group. The whole code
space is thus divided into 128 groups. In such terms, Unicode is
simply plane 00 of group 00, the special plane that in ISO 10646
standard is called the Basic Multilingual Plane (BMP). For
example, the Latin letter A (Unicode 0041) is in ISO 10646 fully coded
00000041. As of now, ISO 10646 BMP is absolutely identical to Unicode,
and it is unlikely that these two standards will ever diverge.

ISO 10646 specifies a number of intermediate formats that do not
require using the codes in the canonical form of 4 octets per
character. For example, the UCS-2 (Universal Character Set, 2-octet
format) is indistinguishable from Unicode as it uses 16-bit codes from
the BMP. The UTF-8 format (UCS Transformation Format, 8 bits) can be
used to incorporate, with a sort of code switching technique, 32-bit
codes into a stream consisting of mostly 7-bit ASCII codes. Finally,
the UTF-16 method was developed to access more than a million 4-octet
codes from within a Unicode/BMP 2-octet data stream without making it
incompatible with current Unicode implementations.

Most probably, ISO 10646 will be rarely used in its canonical 4-
octet form. For most texts and text-processing applications, wasting
32 bits per character is beyond the acceptable level of redundancy.
However, ISO 10646 is an important standard in that it establishes a
single authority on the vast lands lying beyond Unicode, thus
preventing the problem of incompatible multioctet encodings even
before this problem could possibly emerge.