Internal
representation of text in Indian Languages may be viewed as the problem
of assigning codes to the aksharas of the languages. The complexities of
the syllabic writing systems in use have presented difficulties in standardizing
internal representations. TeX was an inspiration in the late 1980s but
using TeX was more suited for Typesetting and not Text processing per se.
In the absence of appropriate fonts, interactive applications could not
be attempted and when fonts became available, applications simply used
the Glyph positions as the codes and the number of Glyphs was restricted
on account of the eight bit fonts.

The following representations
still apply as many applications have been written to use one or the other.
It must be remembered that these representations primarily address the
issue of internal representation for rendering text.

Use of Roman letters with
diacritic marksISCII codesUnicode for Indian Scripts. ISFOC standard from CDAC

Of the
above, the first has been discussed in the section on Transliteration
principles. The ISFOC standard applies more to standardization of Fonts
for different scripts and cannot really be thought as as an encoding standard.
We confine our discussion in this section to ISCII and the Unicode. A brief
note on ISFOC will be found in a separate page.

Indian
Script Code for Information Interchange (ISCII)
ISCII was proposed in the eighties and a suitable standard was evolved
by 1991. Here are the salient aspects of the ISCII representation.

It is a single representation
for all the Indian Scripts.

codes have been assigned in
the upper ASCII region (160 - 255) forthe
aksharas of the language.

The scheme also assigns codes
for the Matras (vowel extensions).

Special characters have been
included to specify how a consonant in a syllable should be rendered.Rendering of Devanagari has
been kept in mind.

A special Attribute character
has been included to identify the script to be used in rendering specific
sections of the text.

shown below is the basic assignment
in the form of a Table. There is also a version of this table known
as PC-ISCII, where there are no characters defined in the range 176-223.
In PC-ISCII, The first three columns of the ISCII-91 table have been shifted
to the starting location of 128. PC-ISCII has been used in many applications
based on the GIST Card, a hardware adapter which supported Indian language
applications on an IBM PC. In the table, some code values have not been
assigned. Six columns of 16 assignments each start at the Hexadecimal value
of A0 which is equivalent to decimal 160.

The following observations
are made.

1. The
ISCII code is reasonably well suited for representing the syllables of
Indian languages, though one must remember that a multiple byte representation
is inevitable, which could vary from one byte to as many as 10 bytes for
a syllable.

2. The
ISCII code has effected a compromise in grouping the consonants of the
languages into a common set that does not preserve the true sorting order
of the aksharas across the languages. Specifically, some aksharas of Tamil,
Malayalam and Telugu are out of place in the assignment of codes.

3. The
ISCII code provides for some tricks to be used in representing some aksharas,
specifically the case of Devanagari aksharas representing Persian letters.
ISCII uses a concept known as the Nukta Character to indicate the required
akshara.

4. When
forming conjuncts, ISCII specifications require that the halanth character
be used once or twice depending on whether the halanth form of the consonant
or half form of the consonant is present. This results in more than one
internal representations for the same syllable. Also, ISCII provides for
the concept of the soft halanth as well as
an invisible consonant to handle representations
of special letters. Parsing a text string made up of ISCII codes is a fairly
complex problem requiring a state machine which is also language dependent.
This is a consequence of the observation that languages like Tamil
do not support conjuncts made up of three or more differing consonants.
In fact it is stated that Tamil has no conjunct aksharas. What is probably
implied here is that a syllable in Tamil is always split into its basic
consonants and the Matra. Several decades ago Tamil writing in palm leaves
did show geminated consonants in special form.

Though representation at the level of a syllable is possible in ISCII,
processing a syllable can become quite complex, i.e., linguistic
processing may pose specific difficulties due to the variable length codes
for syllables.

5. The
code assignments, though language independent, do not admit of clean and
error free transliteration across languages especially into Tamil from
Devanagari.

6. It
is difficult to perform a check on an ISCII string to see if arbitrary
syllables are present. Though theoretically many syllables are possible,
in practice the set is limited to about 600 - 800 basic syllables which
can also combine with all the vowels. The standard provides for arbitrary
syllables to handle cases where new words may be introduced in the language
or syllables from other languages are to be handled.

It must
be stated here that ISCII represents the very first attempt at syllable
level coding of Indian Language aksharas. Unfortunately, outside of CDAC
which promoted ISCII through their GIST technology, very few seem
to use ISCII.

ISCII codes have nothing to do with fonts and a given text in ISCII may
be displayed using many different fonts for the same script. This will
require specific rendering software which can map the ISCII codes to the
glyphs in a matching font for the script. Multibyte syllables will have
to be mapped into multiple glyphs in a font dependent and language
dependent manner. It is primarily this complexity that has rendered ISCII
less popular. Details of ISCII are covered in the Bureau of Indian Standard
Documents No. IS:13194-1991.

Shown below are some examples
of strings in Devanagri and other scripts along with their ISCII representations.

Unicode was
the first attempt at producing a standard for multilingual documents. Unicode
owes its origin to the concept of the ASCII code extended to accommodate
International Languages and scripts.

Short
character codes ( 7 bits or 8 bits) are adequate to represent the letters
of the alphabets of many languages of the world. The fundamental idea behind
Unicode is that a superset of characters from all the different languages/scripts
of the world be formed so that a single coding scheme could effectively
handle almost all the alphabets of all the languages. What this implies
is that the different scripts used in the writing systems followed by different
languages be accommodated in the coding scheme. In Unicode more than 65000
different characters can be referenced. This large set includes not only
the letters of the alphabet from many different languages of the world
but also punctuation, special shapes such as mathematical symbols, Currency
symbols etc. The term Code Space is often used to refer to the full set
of codes and in Unicode, the Code space is divided into consecutive regions
spanning typically 128 code values. Essentially this assignment retains
the ordering of the characters within the assigned group and is therefore
very similar to the ASCII assignments which were in vogue earlier.

Unicode assignments
may be viewed geometrically as a stack of planes, each plane having one
and possibly multiple chunks of 128 consecutive code values. Logically
related characters or symbols have been grouped together in Unicode to
span one or more regions of 128 code values. We may view these regions
as different planes in the Code Space as illustrated in the figure below.
Data processing software using Unicode will be able to identify the Language
of the text for each character by identifying the plane the character is
located in and use appropriate font to display the same or invoke some
meaningful linguistic processing.

Technically, Unicode can handle many more languages than the supported
scripts if these languages use the same script in their writing systems.
By consolidating a complete set of symbols used in the writing systems
across a family of languages, one can get a script that caters to all of
them. The Latin script with its supplementary characters and extended symbol
has about 550 different characters and this is quite adequate to handle
almost anything that has appeared in print in respect of the Latin script.
Hence in the geometrical view above, some planes may be larger (wider)
than others and more than one script could have characters from logically
similar groups specified in a plane.

The fact that several
languages/scripts of the world require many more than 128 codes has necessitated
assignments of more than one basic plane (i.e., multiples of 128 code values)
for them. Languages such as Greek, Arabic or Chinese have larger
planes assigned to them. In particular, Unicode has allowed nearly 20000
characters of Chinese, Japanese and Korean scripts to be included in a
contiguous region of the Code Space. Currently fewer than a hundred different
groups of symbols or specific scripts are included in Unicode.

Even though
it is a sixteen bit code and can therefore handle more than 65000 code
values, Unicode should not be viewed as a scheme which allows several thousand
characters for each and every language. That it has provision for fewer
than 128 characters for many scripts is a general observation since many
languages do not require more than 128 characters to display text.

In respect of Indian
languages which use syllabic writing systems, one might think that Unicode
would have provided several thousands of codes for the syllables similar
to the nearly 11000 Hangul syllables already included. On the contrary,
Unicode has pretty much accepted the concept behind ISCII and has provided
only for the most basic units of the writing systems which include the
vowels, consonants and the vowel modifiers.

Unlike ISCII, which
has a uniform coding scheme for all the languages, Unicode has provided
individual planes for the nine major scripts of India. Within these planes
of 128 code values each, assignments are language specific though the ISCII
base has been more or less retained. Consequently, Unicode suffers from
the same limitations that ISCII runs into. There are some questionable
assignments in Unicode in respect of Matras. A Matra is not a character
by itself. It is a representation of a combination of a vowel and consonant,
in other words the representation of a medial vowel. A vowel and NOT its
Matra is the basic linguistic unit. Consequently linguistic processing
will be difficult with Unicode with Indian languages, just as in ISCII.

Here is the
Unicode assignment for Sanskrit (Devanagari). The language code for Sanskrit
(Devanagari) is 09 (hex) and so the codes span the range 0901 to 097f (Hexadecimal
values). In this chart, the characters of Devanagari with a dot beneath,
are grouped in the range 0958 to 095f. These are the characters used in
Hindi which are derived from Persian and seen in Urdu as well. Likewise
in locations 0929, 0931 and 0934 the letters are dotted. The codes are
similar to ISCII in ordering but Unicode includes characters not specified
in ISCII. Also, the assignments for each language more or less adhere to
the same relative locations for the basic vowels and consonants as in ISCII
but include many language dependent codes. The code positions in Unicode
will not exactly match the corresponding ISCII assignments.

Shown below are the
Unicode representations for some strings in different scripts. These are
the same strings shown earlier under ISCII.

From the
discussion above, it will be seen that ISCII and Unicode provide multibyte
representations for syllables. This is not unlike the case for English
and other European languages where syllables are shown only with the basic
letters of the Alphabet. However, in all the writing systems used in India,
each syllable is individually identifiable through a unique shape and one
has to provide for thousands of shapes while rendering text.

While these thousands
of shapes may be composed from a much smaller set of basic shapes for the
vowels, consonants and vowel modifiers, one must admit that several hundreds
of syllables have unique shapes which cannot be derived by putting together
the basic shapes. It is estimated that in practice, more than 600 different
glyphs would be required to adequately represent all the different syllables
in most of the scripts. The main problem of dealing with Unicode for Indian
languages/scripts has to do with the mapping between a multibyte code for
a syllable and its displayed shape. This is a very complex issue requiring
further understanding of rendering rules. As such a full discussion of
this would require that the viewer understand the intricacies of the writing
systems of India. We cover this in a separate page.

It must be observed, in the light of the above discussion that displaying
a Unicode string in Indian language requires a complex piece of processing
software to identify the syllables and get the corresponding glyphs from
an appropriate font for the script. The multibyte nature of Unicode (for
a syllable) makes a table driven approach to this quite difficult. Even
though it is possible to write such modules which can go from Unicode to
the display of text using some font, one faces a formidable problem in
respect of data entry, where formation of syllables from multiple key sequences
Is truly overwhelming. With limited number of keys available in standard
keyboards, it is often not possible to accommodate all the symbols one
would require to produce meaningful printouts in each script consistent
with quality typesetting systems.

Unicode based applications
employ the concept of "Locales" to permit data entry of multilingual text.
Each Locale is associated with its own keyboard mapping and application
software can switch Locales to permit data entry of multilingual text.
It will be seen that for Indian scripts, the Locales themselves have limitations
since they do not permit a full complement of letters and special
characters to be typed in, much less the standard punctuation that has
become part of Indian scripts today.

While
it is possible to write special keyboard driver programs which implement
a state machine to handle key sequences to produce syllables, the approach
is not universal enough to be included into the Operating Systems, certainly
not when a single driver should cater to all the Indian scripts.
There is no meaning in having a Hindi version of OS with its own Data entry
convention which differs substantially from a Tamil or Telugu version.

Here is a summary
of the issues that confront us when dealing with Unicode for Indian scripts.

Rendering text in a manner that
is uniform across applications is quite difficult. Windowing applications
with cut,copy/paste features suffer due to problems in correctly identifying
the width of each syllable on the screen. Also, applications have to worry
about specific rendering issues when modifier codes are present. How
applications run into difficulties in rendering even simple strings is
illustrated with
examples in a separate page.

Interpreting the syllabic content
involves context dependent processing, that too with a variable number
of codes for each syllable.

A complete set of symbols used
in standard printed text has not been included in Unicode for almost all
the Indian scripts.

Displaying text in scripts other
that what Unicode supports is not possible. For instance, many of the scripts
used in the past such as the Grantha Script, Modi, Sharada etc., cannot
be used to display Sanskrit text. This will be a fairly serious limitation
in practice when thousands of manuscripts written over the centuries have
to be preserved and interpreted.

Transliteration across Indian
scripts will not be easy to implement since appropriate symbols currently
recommended for transliteration are not part of the Unicode set. In the
Indian context, transliteration very much a requirement.

The unicode assignments bear
little resemblance to the linguistic base on which the aksharas of Indian
scripts are founded. While this is not a critical issue,
it is desirable to have codes whose values are based on some linguistic
properties assigned to the vowels and consonants, as has been the practice
in India.

Details
of Unicode for Indian scripts have been published in the standard available
from the Unicode consortium. The Unicode
web site does have useful information but one will have to resort to
the printed text to get the real details. These are also available in PDF
format from the above web site.

The answer
is certainly No. The main purpose of the Unicode is to transport information
across computer systems. As of today, Unicode is reasonably adequate to
do this job since it does provide for representing text at the syllable
level though not in the fixed size units (Bytes).

Applications
dealing with Indian Languages will have to include a special layer which
transforms Unicode text into a more meaningful layer for linguistic or
text processing purposes. The point to keep in mind is that the seven bit
ASCII based representation for most World language serves both purposes
well i.e., not only are text strings transferable across systems, but linguistic
processing is consistent with the seven bit representation . It so happens
that the phonetic nature of our Indian Languages has necessitated a different
representation for linguistic analysis.

With majority of the Languages
of the World, which use a relatively small set of symbols to represent
the letters of their alphabet, 8 bit (or even 7 bit) character codes are
adequate to represent the letters.

Please refer to the FAQ
provided at the Unicode web site which provides answers to some of
the questions raised here. The real issue to understand
is whether Unicode is adequate from the point of view of efficient text
processing of Syllables so that one may attempt meaningful processing of
text in Indian languages, consistent with the syllabic writing system.

Acharya Logo
Distant views of the Himalayan Peaks are unforgettable and awe inspiring!