Unicode: The Universal Character Set

Part 1: The Computer and Language

"Today's computers have only rudimentary capabilities for understanding and manipulating language…. For the most part they serve only as mute storage bins for the streams of characters we interpret as language." [1]

Computers have become our writing machines, to the point that we can and do forget that they are not language-based machines at all, but mere calculators working on a base of ones and zeroes. The story of how these machines evolved from super-fast adding machines to their current function as the modern world's universal writer is a fascinating one. It is also a story in which libraries, for the first time in the history of computing, were instrumental in the development of what is now a universal standard.

Bits, Bytes, and Seven of Eight

Although computers function internally at a level of ones and zeroes, the need for humans to communicate instructions to computers meant that very early in the computer's existence it had to interface with people by using Arabic numbers and at least some mathematical formulas that used letters and symbols. The very first computer programs were written in machine language, codified in the ones and zeroes that computers could understand. Early computer scientists, and in particular Grace Hopper, [2] began to see that the power of the computer could be harnessed to make the job of programming easier. If programmers could write their instructions in a more human language, then the job of giving instructions to the computer would be simplified, and perhaps even made more efficient. The first compilers, in which programmers wrote human-readable code which was then "compiled" by another program into the computer's machine language, were developed in the early 1950's. [3] For the next decade, computers were still glorified counting machines, and it wasn't until 1963 that a standard was developed for a character set that represented letters and numbers: this was the American Standard Code for Information Interchange, or ASCII, designated ANSI X3.4. [4]

At the time that ASCII was developed, computers had two primary concepts already in place: the bit, which would be either 1 or 0, and the byte, which was made up of eight bits. Because some computers used one of the bits in the byte for a special function, there were seven bits left that could be used to represent the world of numbers and letters. Seven bits meant that the total number of characters that could be represented was 128. ASCII reserved the first 32 characters as control characters, such as one for the "start of text" and one for the "end of text." Many of these control characters were developed from codes used in telegraphy, and some are now anachronous, such as the control character for the "bell" that would cause a bell to ring to let an operator of a telegraph station know that a message was incoming.

The first version of ASCII in 1963 contained only the upper case alphabetic letters from the English alphabet, A-Z, the numbers 0-9, and a set of punctuation characters that looked very much like the characters on the upper row of a typewriter keyboard. In 1967 the lower case letters were added to the ASCII standard. This change probably marks the shift from the use of the alphabet solely for programming instructions, which continued to use only the upper case letters well into the 1990's, to the use of letters to represent actual language expressions. As long, of course, as that language was English, or could be represented by the English alphabet.

ASCII is still the key character set standard in use in computing today. In total, ASCII has 33 control characters, 33 punctuation marks and symbols, upper and lower case a-z, and 0-9, to make a total of 128. Because of its limitation to only seven of the eight bits in the byte, there is no room in ASCII to expand beyond the characters needed for the English language.

Internationalization Begins

As long as computers were solely used by computer professionals, the ASCII character set was sufficient for their purposes. Computer programmers around the world were writing programs in what was essentially "programmer English." An assembly language programmer, regardless of native tongue, wrote "MOV" to move data from one place in the computer's memory to another. Fortran programmers often used the universal language of mathematics ("2 X 3 = "), but if their program needed to branch they all wrote "GOTO." As soon as computers were being used for actual texts, however, the limitation of computers to English language characters proved woefully inadequate. Even the simplest use of language, such as a transcription of names and addresses of customers, required more than ASCII could provide if your language included any form of accent marks or diacritics.

To respond to the needs of its international customer base, IBM created what it called "code pages" for its early personal computers. These were definitions of characters that went beyond the standard English ASCII. Code pages made use of the entire byte, all eight bits, since the earlier use of the eighth bit for control purposes was no longer needed in their operating system. By using all eight bits they could define up to 256 different characters. Since even this number of characters is not sufficient to define the characters needed for all of the Western languages that IBM wished to represent, multiple code pages were defined. Each code page covered a language (e.g. Cyrillic, or Greek) or a family of languages (e.g. Slavic, or Western European) whose alphabetic characters could be assigned a code without exceeding the limit of 256. A program would invoke a code page for a language and the computer's display programs would use the characters defined in that code page to print or display characters like ?, ç, or ?. Code pages are generally enabled at the operating system level, so different makers of operating systems, such as Microsoft, Apple, and Sun, all provide code page support with their products. These code pages are still in use today, and because the programs that use them will likely exist for many years to come, they will need to be supported in computer systems for many years into the future.

The code pages found in computer operating systems serve the function of de facto standards in the computer world, but they weren't themselves formal standards and the various proprietary code pages were not interchangeable. In the mid-1980's, the International Standards Organization (ISO) developed ten 8-bit character set standards that defined encoding for languages similar to the proprietary code pages. These formal standards provided a way to represent a subset of the world's languages that was not dependent on any one computer manufacturer. Many programs that you interact with today make use of ISO 8859-1, the first of these standards, also known as "Latin-1" or "Extended Latin." This standard covers the original ASCII character set, and special characters that cover most of what is needed to express 22 different languages, including French, German, Icelandic, and Afrikaans. Other ISO 8859 standards are used for Cyrillic, Arabic, Greek, Hebrew, and for some language groups like the East European group. [5] Because of the use of the ASCII-defined characters both in operating systems functions and in programming, each of the ISO 8859 standards also includes the basic ASCII characters.

Neither the code pages nor the ISO 8859 standards covered ideographic languages, like Chinese or Japanese. The number of character places available in an 8-bit system was only 256, and therefore far from the tens of thousands of characters need to express these languages in computer code. The Japanese had already defined two simplified writing systems, hiragana and katakana, that use only 50 characters each. These could be defined within the confines of the byte, and even without disturbing the definition of the ASCII characters, which most computers required for programming and operating system functions like the naming of file. The traditional Japanese character set, kanji, uses over six thousand characters, so this one was more of a challenge. The Japanese Information Standard (JIS) defined a two-byte character encoding for the traditional Japanese characters. It used only bytes that had not been defined in standard ASCII, so rather than having the ability to express 256 * 256 codes (65,536 possible values) it used two bytes of 94 values each, giving a total of 8,836 possible unique values. Like ISO 8859 this became a family of codes, each with some variations.

Mainland China also had a simplified character set already in use when computers came on the scene. That character set, called GuoBiao (GB) used an encoding method similar to the two-byte JIS encodings. Taiwan, however, used the traditional Chinese character set. To encode the over 13,000 characters of this character set it was necessary to develop escape sequences, similar to those used in the encoding of Chinese characters in library systems, to extend the meaning of the series of bytes. This character set was called Big5, reportedly because it was a effort to standardize the character set over the five major computer companies operating in Taiwan at the time. Even this set was not enough however; when Hong Kong wished to make use of the Big5 standard, there were nearly 5,000 characters used in Hong Kong that were not available in the Taiwan-based character set. Like other attempts to standardize complex character sets, variations formed.

From International to Universal

Although some of the major languages were well served with the existing code pages, other languages did not have computer equivalents at all. This meant that if speakers of those languages wished to send e-mail, they had to use English or another of the languages represented by code pages. In the early days of computing, when the users of computers throughout the world were in the scientific fields, English was already established in those environments as the international language of communication. We had, in a sense, recreated a state not unlike that of the medieval and post-medieval times when all scientific and scholarly communication took place in Latin. True, many learned people today around the world do speak one or more of the European languages, and increasingly English is the second language of the educated members of many societies, [6] but as soon as computers came to be used for the creation and communication of cultural expression, it became essential for computer users to be able to carry out that communication in their native language.

Even for those languages where code pages were available, the use of that technique was awkward, especially if you wanted to create multilingual documents whose characters were found on different code pages. In addition, the code page solutions for some languages did not result in a complete encoding of the character repertoire needed for a full expression of the language. The current methods of dealing with the non-Latin-based character sets in particular contained numerous exceptions, such as characters or ideograms that required more than one two-byte sequence. In the mid-1980's, some computer scientists began to feel a need to solve the character set problem once and for all. Researchers at Xerox PARC and Apple Computer gathered together to share their ideas on character encoding. Early in this discussion the computer scientists met with some others who were also working to bridge the character set gap: specifically, Alan Tucker and Karen Smith-Yoshimura of the Research Libraries Group. Libraries, and particularly research libraries, had a great interest in being able to represent their collections in computer catalogs, and that meant having the ability to input and display any human script in written documents in those libraries. From the beginning, the participation of language specialists from the library field contributed to the development of what soon came to be known as "Unicode."

Unicode - Unique, Universal and Uniform

Work on the Unicode character set focused on a series of seemingly intractable problems: first, there was a need for a large number of characters to be defined if they were going to be able to include every written language, either modern or ancient; next, there was the huge problem of changing the very basis for all Western text computing, the one byte character. The computer scientists were worried about doubling the size of every program and every text file by moving from one byte to two bytes for each character. Even so, it wasn't even clear if two bytes would be sufficient. Two bytes would mean a total of 65,536 characters, although control characters, punctuation, and various graphical characters would have to be included along with those that represented actual units of written language. The real difficulty, though, was the inclusion of Chinese, Japanese, and Korean (CJK). These languages alone would occupy many tens of thousands of two-byte values, especially when older classic texts were included in the scope of the character set. To reduce the total number of characters that had to be encoded, the group working on the universal character set (UCS) took advantage of the large number of characters that these languages shared due to their common origins. The RLG staff had already developed one list of unified characters which it was using for its CJK cataloging capabilities. The exploitation of this area of overlap, called "unified Han," allowed the developers to create a universal character set in only two bytes. In late 1987, the term "Unicode" was first used to describe this marvelous new way to encode languages for computer manipulation. [7]

Unicode and ISO 10646

Similar to the development of code pages in the computer industry and the subsequent development of a parallel international standard for language expression, the private Unicode effort preceded the creation of a formal international standard. The International Standards Organization and the recently formed non-profit Unicode Consortium combined their efforts to create a universal character set (UCS) standard, ISO 10646. Although the Unicode and ISO 10646 character sets are coordinated, there are differences in approach between the efforts of ISO and the continuing work of the Unicode Consortium. In particular, the latter provides support for implementations through technical publications that are available on its web site. Unicode has defined key practices such as normalization of Unicode text, standard abbreviated codes for scripts, and algorithms for the display of characters that flow from right to left. Also, the International Standards Organization has extended its view of the worlds languages to four bytes so that some of the limitations in Unicode's two-byte code can be overcome.

The number of characters defined by in the Universal Character Set continues to grow. The current encoding can allow over one million characters to be defined, although in practice many computer systems that are Unicode compliant are limited to the set of 65,000 characters that can be defined in two bytes. Specific language research communities are working to make the character set as complete as possible, adding either new forms to already defined groups of characters, or adding entire character repertoires that were not formerly represented. Unicode covers not only currently used languages but also ancient languages that are studied by historians and scholars. The work of the Unicode Consortium is not just a matter of creating computer encodings for characters, it has itself become a way to focus what we know about written language. Because language is one of the ways that human communities define themselves, the entry of some characters or scripts can be delayed because of cultural or political debates.

Character Sets and Fonts

A glance through the lists of characters on the Unicode Consortium web pages or in the published version of the Unicode standard is enough to make anyone marvel at the beauty of the writing of so many of these languages. Note, however, that the UCS defines characters, it does not define the display forms (called "glyphs") and it does not define fonts. So the UCS-defined character "a" is the character "a" no matter how it is rendered for display; font forms such as bold and italic do not change the meaning of the character, and do not change the computer encoding of the character itself.

There are, font packages that can display a wide variety of Unicode characters, and these come installed on newer computers today. On a Windows system, Arial Unicode MS and Lucida Unicode Sans are commonly installed with word processing software. The Macintosh operating systems since OS 8.5 include some Unicode-compliant fonts. Most versions of the Unix and Linux operating systems come with Unicode capability already installed. Your web browser, if it is a current version of one of the major browsers, may be able to detect whether an incoming web page uses Unicode or one of the common code pages that are installed on your computer. You may not be aware that you are already using Unicode in your everyday life.

Next: Unicode and Libraries

The advantages of a single character set for all languages is particularly evident in libraries where works in different scripts sit side-by-side on the virtual shelves of the online catalog. In my next column, I will cover the use of Unicode in MARC21, and particular issues that libraries face in using Unicode in their catalogs.

[1] Lucky, Robert W. Silicon dreams: information, man, and machine. New York, St. Martin's Press, 1989. p. 97
[2] Hopper worked on the MARK 1 calculator for the Navy, and is credited with inventing the concept of the compiler around 1949. See: http://www.cs.yale.edu/homes/tap/Files/hopper-story.html (Accessed August 8, 2005)
[3] Campbell-Kelly, Martin, and Wiliam Aspray. Computer: a history of the information machine. Basic Books, 1996. p. 187
[4] ANSI is the American National Standards Institute.
[5] There are numerous online resources that explain both the ISO standard character sets and the proprietary code pages most commonly in use. A good summary of the character sets and the issues surround them can be found in the Wikipedia entry for ISO 8859, http://en.wikipedia.org/wiki/ISO_8859 (Accessed August 7, 2005), and the articles on each of the ISO 8859 subsets that are linked from that page.
[6] There is even some speculation that the future "world language" will be an internationalized form of English. See: Fischer, Steven Roger. The History of Language. London, Reaktion Books Ltd., 1999. p. 217
[7] A short history of the early days of Unicode is available on the Unicode site at: http://www.unicode.org/history/tenyears.html (Accessed August 7, 2005)

The copyright in this article is NOT held by the author. For copyright-related permissions, contact Elsevier Inc.