UNICODE

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

In all, the Unicode Standard, Version 5.2 provides codes for 107,361 characters from the world's alphabets, ideograph sets, and symbol collections.

The majority of common-use characters fit into the first 64K code points, an area of the codespace that is called the basic multilingual plane, or BMP for short.

There are sixteen other supplementary planes available for encoding other characters which currently have over eight hundred thousand unused code points.

The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit).

UTF-8

Popular for HTML and similar protocols.

Way of transforming all Unicode characters into a variable length encoding of bytes.

It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII.

Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.

UTF-16

Popular in many environments that need to balance efficient access to characters with economical use of storage

It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.

UTF-32

Useful where memory space is no concern, but fixed width, single code unit access to characters is desired.

Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.

The character identified by a Unicode code point is an abstract entity, such as "LATIN CHARACTER CAPITAL S".

The mark made on screen or paper -- called a glyph -- is a visual representation of the character.

The Unicode Standard does not define glyph images.

The standard defines how characters are interpreted, not how glyphs are rendered. The Unicode Standard does not specify the size, shape, nor style of on-screen characters.

The Unicode Standard directly addresses only the encoding and semantics of text.

The Unicode Character Standard primarily encodes scripts rather than languages.

Where more than one language shares a set of symbols that have a historically related derivation, the union of the set of symbols of each such language is unified into a single collection identified as a single script.

Some scripts like Latin and Devanagari can support many languages.

Some languages may also make use of more than one script; for example, Japanese traditionally makes use of the Han (Kanji), Hiragana, and Katakana scripts, and modern Japanese usage commonly mixes in the Latin script as well.

The primary scripts currently supported by Unicode 5.2.0 are:

Arabic

Aramaic, Imperial

Armenian

Avestan

Balinese

Bamum

Bengali

Bopomofo

Buginese

Buhid

Canadian Syllabics

Carian

Cham

Cherokee

Coptic

Cypriot

Cyrillic

Deseret

Devanagari

Egyptian Hieroglyphs

Ethiopic

Georgian

Glagolitic

Gothic

Greek

Gujarati

Gurmukhi

Han

Hangul

Hanunóo

Hebrew

Hiragana

Javanese

Kaithi

Kannada

Katakana

Kayah Li

Kharoshthi

Khmer

Lao

Latin

Lepcha (Rong)

Limbu

Linear B

Lisu

Lycian

Lydian

Malayalam

Meetei Mayek

Mongolian

Myanmar

New Tai Lue

N'Ko

Ogham

Ol Chiki

Old Italic (Etruscan)

Old Persian Cuneiform

Old South Arabian

Old Turkic

Osmanya

Oriya

Pahlavi, Inscriptional

Parthian, Inscriptional

Phags-pa

Phoenician

Rejang

Runic

Saurashtra

Samaritan

Shavian

Sinhala

Sumero-Akkadian Cuneiform

Sundanese

Syloti Nagri

Syriac

Tagalog

Tagbanwa

Tai Le

Tai Tham

Tai Viet

Tamil

Telugu

Thaana

Thai

Tibetan

Tifinagh (Berber)

Ugaritic

Vai

Yi

Unicode also encodes a number of other collections of symbols. These other collections are as follows:

Numbers

General Diacritics

General Punctuation

General Symbols

Mathematical Symbols

Musical Symbols (Western, Byzantine, and Ancient Greek)

Technical Symbols

Dingbats

Arrows, Blocks, Box Drawing Forms, and Geometric Shapes

Game Symbols

Miscellaneous Symbols

Presentation Forms

Braille Patterns

Kangxi Radicals

Some of the Members of Unicode Consortium (Source: Unicode.org)

Adobe

Apple

Microsoft

Google

IBM

Oracle

SAP

Sybase

Yahoo

Government of India

Columbia University

SAS

Verisign

Sony Ericsson

Nokia

HP

Some technologies using Unicode

XML

Java

ECMAScript (Official standard defining JavaScript)

Some important terms:

Unicode transformation format (UTF)

UTF is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence.

UTF-8 is most common on the web.

UTF-16 is used by Java and Windows.

UTF-32 is used by various Unix systems.

The conversions between all of them are algorithmically based, fast and lossless.

UTF-16 uses 2 bytes.

UTF-32 uses 4 bytes.

UTF-16 is available in 3 forms.

UTF-16 (Unmarked) : Uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used.