Universal character set

The Universal Character Set is a character encoding that is defined by the international standardISO/IEC 10646. It maps hundreds of thousands of abstract characters, each identified by an unambiguous name, to numeric code points.

Since 1991, the Unicode Consortium has been working with ISO to develop the Unicode Standard and ISO/IEC 10646 in tandem. The repertoire, character names, and code points of Version 2.0 of the Unicode Standard are identical to those of ISO/IEC 10646-1:1993 with its first seven published amendments. After Unicode 3.0 was published in February 2000, the new and updated characters were brought into the UCS via ISO/IEC 10646-1:2000.

The UCS has over 1.1 million code points, but only the first 65536 (the Basic Multilingual Plane, or BMP) are commonly used, the remainder being reserved for such purposes as representing ancient Egyptian hieroglyphics or rare Chinese characters. Many code points, even in the BMP, are deliberately not assigned to characters, to allow for future expansion or to minimize conflicts with other encoding forms.

There are several character encoding forms defined by ISO 10646 for the Universal Character Set. The simplest is UCS-2, which uses a single code value between 0 and 65535 for each character, and allowing that value to be represented as exactly two bytes (one 16-bit word). UCS-2 thereby permits a binary representation of every code point in the BMP, as long as the code point represents a character. Code points outside the BMP can be represented by pairs of special characters from what is called the S (Special) Zone of the BMP, each pair consisting of what is called an RC-element from the high-half zone and an RC-element from the low-half zone.

In Unicode terminology these characters are called high surrogates and low surrogates respectively and UTF-16 is the Unicode terminology for UCS-2.

Another encoding is UCS-4, which uses a single code value between 0 and, theoretically, hexadecimal FFFFFFFF for each character (although the UCS stops at 10FFFF and ISO/IEC 10646 has stated that all future assignments of characters will also be in that range), and allowing that value to be represented as exactly four bytes (one 32-bit word). UCS-4 thereby permits a binary representation of every code point in the UCS, including those outside the BMP. Like UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate, but of course it requires twice as much storage as UCS-2. ISO/IEC 10646

Occasionally, articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". There is no UCS-16; the authors who make this error usually intended to refer to UCS-2 or UTF-16.

Fact-index.com financially supports the Wikimedia Foundation. Displaying this page does not burden Wikipedia hardware resources.This article is from Wikipedia. All text is available under the terms of the GNU Free Documentation License.