Summary

Status

This document is a Unicode Technical Note. Sole responsibility for
its contents rests with the author(s). Publication does not imply any endorsement by the
Unicode Consortium. This document is not subject to the
Unicode Patent Policy.

The encoding of the Latin, Greek, and Cyrillic scripts

There are a number of very good reasons why the Latin, Greek,
and Cyrillic scripts have been separately encoded, rather than
being encoded as a single script.

1. Traditional graphology has always treated them as distinct
scripts, while acknowledging that they are, of course, historically
related. Mere historic relatedness is insufficient reason to
unify scripts, however, as Latin, Greek, and Cyrillic can ultimately
all trace their roots back to Phoenician, and Phoenician itself
is then related to Aramaic and all its descendants, from Hebrew
and Arabic to farflung outliers like Sogdian, Uighur, and even
Mongolian.

2. In the case of Latin and Greek, the distinction has existed
since classical times. Cyrillic is more closely related to
Greek, and in medieval manuscripts there is a fair amount
of overlap in Greek and early Cyrillic writing conventions,
but by the time of the development of modern typography,
Greek script and Cyrillic script are clearly distinct, and their
current manifestations in print usage are very different.

3. Literate users of Latin, Greek, and Cyrillic alphabets do
not have cultural conventions of treating each other's
alphabets and letters as part of their own writing systems.
"It's all Greek to me," is not just a saying, but accurately
reflects the common perception by users of any one of those
scripts when presented with textual material in one of the
other scripts. It's not just that the words are unfamiliar,
but that the writing itself is considered alien. Most people
will be able to pick out the letters which share common shapes
(A, B, etc.), but the high proportion of odd-looking letters
that carry no significance to users of the other scripts
results in the text as a whole being treated as simply
illegible. This, by the way, is one of the operative means
by which distinctions in script can be identified, although
it is far from being a simple, objective method that works
in all instances.

4. Even more significantly, from the point of view of the
problem of character encoding for digital textual representation
in information technology, the preexisting identification
of Latin, Greek, and Cyrillic as distinct scripts was carried
over into character encoding, from the very earliest instances
of such encodings. Once ASCII and EBCDIC were expanded to
start incorporating Greek or Cyrillic letters, all significant
instances of such encodings included a basic Latin (ASCII or
otherwise) set and a full set of letters for Greek or a
full set of letters for Cyrillic. Precedent for the
purposes of character encoding was clearly established by
those early 8-bit charsets.

5. Following on from point #4, any universal character encoding
must distinguish Latin, Greek, and Cyrillic as scripts. If
it does not do so, it would have insurmountable interoperability
problems dealing with any of the huge amount of legacy data
which already distinguished the scripts. Note that multiscript
(partially) universal character encodings predating the Unicode
Standard all did this. That includes IBM's registry of
glyph identifiers, DEC's and Hewlett-Packard's listings of
characters and glyphs, Xerox's XCCS character standard,
WordPerfect's proprietary character sets, and
Microsoft's and Apple's internal system of character identifications.
The library community maintains the same script distinctions in its own
data formats: MARC 21 (published by the Library of Congress) and UNIMARC
(published by IFLA).
Even the East Asian character encodings, as they developed,
also distinguished Latin, Greek, and Cyrillic. See, for
instance, JIS X 0208 itself, which separately encodes Greek
and Cyrillic alphabets from ASCII Latin.

6. The few character encodings that actually attempt to do
a unification of Latin and Greek or Cyrillic are very special-purpose
and limited in usage, and cannot interoperate well with the
vast majority of text processing infrastructure. A good example
of this is ETSI's GSM 03.38, which attempts to address the
problem of displaying uppercase Greek on a Latin device with
a 7-bit character set by unifying all the uppercase Greek letters
with their Latin lookalikes and by dispensing with any support
for lowercase Greek. Such schemes to unify Greek (or Cyrillic)
with Latin have never spread beyond their original, limited-purpose
contexts, simply because they can't handle the requirements for
more general-purpose processing.

7. In terms of implementation issues, any attempt at a unification
of Latin, Greek, and Cyrillic would wreak havoc with certain
required text processes. In particular, a unified encoding of
Latin, Greek, and Cyrillic would make casing operations an unholy
mess, in effect making all casing operations context sensitive in
a way that is now restricted to a few problematical edge cases
(Turkish i's, Greek sigmas).

The encoding of the Han script

Now by way of contrast, consider the issue for the Han script.

1. Graphologically, the Han script ("Chinese characters") has
long been considered a single script, adapted for use by
neighboring cultures, but not separated into distinct scripts
by such usage. Historically very early versions of the Chinese
character usages (e.g., the Great Seal script) probably rightly
qualify as distinct scripts, but such distinctions are
irrelevant to the status of Han synchronically.

2. This identity of the Han script has been perpetuated by the
historically more or less continuous cultural preeminence of
China in East Asia over the course of millennia, and by the
political usage which successive Chinese empires have put
Chinese writing to -- using the single written form of Chinese
as a way of covering many, many distinct Chinese languages
in a single Han cultural identity. The imperial spread of Han
writing through East Asia mirrors, in many ways, the imperial
spread of Latin script in the Western world, where spread
of the Latin alphabet from language to language and ethnic
group to ethnic group, first by the Roman Empire and much later
by the Western European empires, did not result in fractionating
the script itself, but rather the widespread usage of the single
script, and its elaboration by the addition of new ideographs
(for Han) and new letters (for Latin) as new demands were
placed upon it. (A similar pattern can be seen in the spread
of the Arabic script around the world.)

3. The important non-Han peoples who adapted Chinese writing
in their own culture (most notably Korea, Japan, and Vietnam)
continued to view the Han characters as Chinese writing,
as demonstrated even by the name of the script in each of
these countries, being literally "Chinese character". And
rather than simply adopting the script at one point and then
evolving it off in some independent direction, the typical
pattern for each of these cultures was over the centuries to
keep adding to the store of Han ideographs they used by
continued borrowing of large new sets of them directly from
China.

4. The major exception in this developmental pattern, in Japan,
actually speaks, to the contrary, to the continued unitary
nature of the Han script itself. In Japan, a highly cursive
style of writing Chinese characters for Japanese sounds,
as opposed to borrowed Chinese vocabulary, a style called
manyooshuu, was simplified down into a set of conventional
syllabic symbols for Japanese alone. That clearly was the
development of a new script, which came to be known as
Hiragana, from the Han script. But separately and simultaneously,
in Japan, the Han characters themselves ("kanji" in Japanese)
continued to be written in the traditional Chinese way.

5. Unlike the case for Latin, Greek, and Cyrillic, there is
a longstanding cultural tradition in Japan, China, and Vietnam,
of viewing "Chinese characters" as being of shared identity
throughout the region. Japanese can't "read" Chinese -- after
all, it is a completely different and utterly alien language
to Japanese speakers -- any more than English speakers can
read Tagalog written in the Latin alphabet. But they do
recognize that the Chinese characters themselves are shared
and in fact can recognize much of the shared vocabulary that
was originally borrowed into Japanese from Chinese, in the
same way that English speakers recognize much French vocabulary.

6. There is a lot of confusion that results among those not
intimately familiar with East Asian languages and writing systems
because of the fact that the writing systems for Japan, Korea,
China, and Vietnam are completely different, at the same
time that they all share, as parts of those writing systems,
a shared single Han script. There is no question that the
Japanese writing system as a whole is very, very different from
the Chinese writing system. But Japanese kanji as a portion
of the Japanese writing system constitute the same script
as Chinese hanzi as the main part of the Chinese writing system.

7. Another issue that leads to controversy over "Han unification"
in East Asia tends to result from considerations of font styles
and character variants. The style issue results from largely from the fact
that Japan has traditionally
been an extremely conservative country, going through a long,
deliberately isolationist period before the Meiji reformations.
As a result, Japan tended to preserve, in its Buddhist and other
literary traditions, forms of Chinese that dated all the way
back to Tang, Sung, and Ming dynasty material. In the meantime, China
itself was busy undergoing vast revolutions and upheavals and
changing rulers from one ethnicity to entirely different ones
(Mongols ruled China in one dynasty, Manchus in another). During
this time, Chinese writing kept innovating, while the forms
carried to Japan tended to stay more conservative.

Notwithstanding the systematic variation sometimes seen between
more conservative character forms in Japan and typographically
distinct forms seen in China, the typical range of variation
among glyphs across all the CJK user communities is well within
the bounds of typical variation seen within other scripts. This
fact, which is noted in the JIS X 0208 standard itself, forms
the basis for the principles upon which characters are identified
as being the "same character" in Japanese, Chinese, and Korean
sources.

8. In the 20th century, we encounter the most extreme form of stylistic
innovation in China, when as a
result of educational policy after the Communist revolution in
the PRC, a deliberate and very widespread process of orthographic
simplification and reform was imposed all over China. Those
changes were not adopted outside of China in Japan, nor even
in Taiwan and Hong Kong. That resulted
in a sharp split in Han script usage ("simplified" versus
"traditional"). But even that cannot be considered enough to
have created a new, distinct Han script. The reason is that
even in the PRC, the new, simplified forms were always treated
as alternative forms of the traditional characters, often
printed alongside them in reference works. There has been a
continuous adjustment of writing in China, as more characters get simplified,
but some simplifications are abandoned in favor of more traditional
forms, and so on. Many Chinese end up, for one reason or
another, simply having to learn both the traditional and
the simplified forms of characters, and read them as alternative
glyphs for the same character -- implicitly within the same
overall Han script.

9. When it comes to character encoding decisions made in
East Asia, it is also clear that Han characters have in almost
all cases been considered to constitute a single script, rather
than distinct scripts per East Asian country. Japanese standards
were early on devoted to encoding those Chinese characters needed
for Japanese. And Chinese standards focussed on that subset
of Chinese characters needed for Chinese. But later on, the
standards on both sides expanded as the Japanese standards added
characters from China and the Chinese standards added characters
from Japan. In neither case did these additions follow the
pattern seen when Greek or Cyrillic were added to early
Latin character encodings. Instead, in both instances, it was
just a matter of adding X more thousand Han characters into the
big tables that already consisted of thousands of Han characters.

10. The effort to "unify" the encoding of Han characters in
10646 and the Unicode Standard has been misunderstood by
some as an attempt to intermingle an essentially different
Japanese writing system and a Chinese writing system, as if
there was some kind of enforced miscegenation underway.
But the correct way to interpret what went on was rather simply
the avoidance of duplicate encoding of the same Han characters
represented in several different East Asian standards. This
process was well-understood by the actual national standards
participants from Japan, Korea, China, and other countries,
who all along have been doing the major work involved in
minimizing the amount of duplicate encoding of what all the
committee members fully agree is the same character.

The analogy to bring to bear when considering "Han unification"
is not a picture of trying to unify a Latin encoding, and
Greek encoding, and a Cyrillic encoding based on character
shape alone, but instead, unifying an ASCII (Latin) encoding,
an EBCDIC (Latin) encoding, the Latin portion of the
JIS X 0208 Japanese standard, and the Latin portion of the
GB 2312 Chinese standard. There would be no point in encoding
the same Latin character 4 times in Unicode simply because
it appeared in ASCII, EBCDIC Code Page 300, JIS X 0208,
and GB 2312. The exact same logic was applied to the Han characters
in the various East Asian standards when encoding decisions
were taken about encoding the Han script.

11. In terms of implementation issues, an encoding approach towards
Han characters that did not unify the same characters from
Japanese, Chinese, Korean (and other) source standards would
need to carry expensive (both in memory cost and in maintenance
cost) equivalence tables around merely to make the unification
happen on the fly, before text searching or nearly any other
text process of interest could be done.

12. For more information about how Han characters from different
East Asian legacy source standards were identified as being
the same character for encoding purposes in the Unicode
Standard, see the detailed discussion in
Section 12.1, "Han".

Ideographic Rapporteur Group

Information about the Ideographic Rapporteur Group, which has
primary responsibility for development of the repertoire of
Han ideographic characters for encoding in 10646 (and Unicode):

The Ideographic Rapporteur Group (IRG) is a group reporting to
ISO/IECJTC1/SC2/WG2.
It focuses on the development of ideographic characters (Han
characters used in China, Japan, Korea and other parts of Asia) in the ISO/IEC
10646 standard. Its mission is to submit ideographic characters for
inclusion in the ISO/IEC 10646 standard. The IRG has developed the CJK Unified
Ideographs Block and CJK Unified Ideographs Extensions A through D.
IRG members include China, Hong Kong SAR, Macao SAR, Taipei Computer
Association, Singapore, Japan, South Korea, North Korea, Vietnam and the USA.
Representatives from the Unicode Consortium also
attend IRG meetings for coordinating the synchronization between the ISO/IEC
10646 standard and the Unicode Standard.