Understanding characters, keystrokes, codepoints and glyphs

Encoding and working with multilingual text

Peter Constable, 2003-05-28

Software systems that are used for working with multilingual data are evolving, and it is increasingly important for users and support personnel to have an understanding of how these systems work. This chapter explains some of the most basic concepts involved in working with multilingual text: characters, keystrokes, codepoints, and glyphs. Each notion is explained, as is the way they relate to one another and interact within a computer system.

1 Introduction

Computer users working with multilingual text data face many challenges, especially when minority languages or non-Roman scripts are involved. Commercial operating system and application software is evolving and becoming more sophisticated in handling multilingual data. For many, this has resulted in workable solutions. For others working with minority languages, problems often still remain. Indeed, for some, current software has only seemed to make it harder to work with their minority language data.

Part of the problem is that the computer industry is still evolving in areas related to proper handling of multilingual data. The largest vendors have begun to implement good, multilingual-capable systems, but many smaller vendors are still creating products designed for only one language or for a very limited set of languages. A bigger part of the problem is that, for years, there were not adequate industry standards for dealing with multilingual data, and users became accustomed to cobbling together anything that would get their information onto a printed page. This usually involved creating “hacked” solutions that broke (whether knowingly or unknowingly) whatever standards may have existed.

While industry standards for multilingual data have advanced significantly in recent years, many users are still using “hacked” solutions. For many minority languages, industry standards are still not fully adequate. In those situations, users have no option but to find whatever customized solution that will get a job done. These users face the biggest obstacles, as they try to do things that software developers never expected their software to do.

The current situation is, therefore, quite varied: some users have their multilingual data needs adequately met by recent software using up-to-date multilingual standards and technologies. Others could be using such software, but are still using older systems based on non-standard solutions, sometimes because they don’t realise they have an alternative, or because they need to maintain existing data. Some are caught in between, having to share data between users in the previous two scenarios—and often facing particular difficulties in doing so. A few are trying to push the envelope of current technologies and standards, trying to work with languages that are just beyond the limit of what the current software, technologies or standards were designed to handle. Yet others continue to use non-standard implementations because they simply have no alternative: the current standards and technologies have not yet advanced far enough to include the needs of the language they are working with. Those that provide computer support are in the middle of this confusion, having to help users in very different situations.

For all of these people, it can be very helpful to have a better understanding of some of the basic issues involved in working with multilingual data. Sometimes, people can have a pretty good understanding of the issues but have difficulty communicating with others because of confusion in terminology. Often, though, the issues are simply not well understood. This document attempts to help by explaining the most fundamental of these issues: characters, and their relationships to keystrokes, codepoints and glyphs.

2 Characters

There are, in fact, different senses of the word character that are important for us. In common usage, though, the distinctions are not typically recognised. These differences must be understood in working with multilingual software technologies.

2.1 Orthographies, characters and graphemes

The first and most common sense of the term character has to do with orthographies and writing systems: languages are written using orthographies,1 and a character in this first sense, an orthographic character, is an element within an orthography. For example, the lower case letter “a” used to write English, the letter “” used for Tai Lue, and the IPA symbol for a voiced, inter-dental fricative, , are characters.

It is easy to provide clear examples of characters in this sense of the word. Providing a formal definition is not so easy, though. To see why, let’s consider some of the things that can occur in an orthography.

Some orthographies contain elements that are complex, using multiple components to write a single sound. For example, in Spanish as well as in Slovak, “ch” functions as a single unit. This is an example of what is sometimes called a digraph. Some languages may have orthographies with even more complex elements. For instance, the orthographies of some languages of Africa have elements composed of three letters, such as “ngb”. Such combinations of two or more letters or written symbols that are used together within an orthography to represent a single sound are sometimes referred to as multigraphs or polygraphs.

Also, many languages use dependent symbols known as accents or diacritics. These are common in European languages; for example, in Danish “ë” and “å”, or French “é” and “ô”.

So, are multigraphs one character or several characters? And are the diacritics by themselves considered characters? There are not always clear answers to these kinds of questions. For a given written symbol, different individuals or cultures may have different perceptions of that symbol based on their own use of it. Speakers of English would not recognise the dot in “i” as a character, but they also would not hesitate to acknowledge the components of the digraph “th” as characters since “t” and “h” function independently in English orthography. The case of “th” might not be as clear to users of another language if, suppose, that language does not make independent use of “h”. Likewise, English speakers would probably not be as confident in commenting about the ring in “å”.

We might avoid this uncertainty by using a distinct term, grapheme: a grapheme is anything that functions as an distinct unit within an orthography. By this definition, the status of multigraphs are clear: multigraphs, such as Spanish “ch”, and “ngb” in the orthography of some Bantu language, are all graphemes.2

Do base plus diacritic combinations constitute graphemes? Well, this depends upon the particular orthography—recall that the definition given above for grapheme is orthography-dependent. In Northern Sámi, for example, the combination “á” functions as an independent unit: it is enumerated separately in the alphabet and has its own place in the sort order. Thus, it is considered a grapheme in the orthography of that language. In contrast, “á” is considered to be a variant of “a” in Danish, and so is not a separate grapheme in that language.

What about the diacritics by themselves, without the base: are they graphemes? Again, this depends upon whether they function as distinct units within an orthography. For some tonal languages of Africa, acute and grave diacritics may function this way, having distinct identities and purposes, and functioning in distinct ways in terms of operations such as sorting. In these cases, they might be considered distinct graphemes.

The notion of grapheme is important for us. Obviously, though, it would still be helpful to be able to talk about things like the “h” in “th” or the ring diacritic in general terms, even if they don’t correspond to a grapheme in a given orthography. The best we can do for the moment is to have an approximate, informal definition: when speaking in terms of writing systems and orthographies, a character (or orthographic character) is a written symbol that is conventionally perceived as a distinct unit of writing in some writing system. This would include clear cases, such as the letters that make up the English alphabet (regardless of whether a given occurrence is a complete grapheme or part of a multigraph). There are also clear cases of written marks that would be excluded, such as the dot in “i”. In the less clear cases (certain diacritics, perhaps), our informal definition is inconclusive (but gives us the freedom to call something a character if it is helpful to do so).

While we will not try to formalise this general sense of character from the domain of writing (though we might use it informally in this way), this basic notion can be formalised within the domain of information systems and computers in a very useful and important way. It is this sense of character that we will discuss next.

2.2 Characters as elements of textual information

There is an important sense of the term character that is applicable within the domain of information systems and computers: a minimal unit of textual information that is used within an information system. In any given case, this definition may or may not correspond exactly with either our informal sense of the term character (i.e. orthographic character) or with the term grapheme. This will be made clearer as we consider some examples.

Note that this definition for character is dependent upon a given system. Just as the definition we gave for grapheme was dependent upon a given orthography, such that something might be a grapheme in one orthography but not another, so also something may exist as a character in one information system but not another.

For example, a computer system may represent the French word “hôtel” by storing a string consisting of six elements with meanings suggested by the sequence <h, o, ^, t, e, l>. Each of those six component elements, which are directly encoded in the system as minimal units, is a character within that system.

Note that a different system could have represented the same French word differently by using a sequence of five elements, <h, ô, t, e, l>. In this system, the O-CIRCUMFLEX3 is a single encoded element, and hence is a character in that system. This is different from the first system, in which O and CIRCUMFLEX were separate characters.

Given this definition of character as a minimal unit of textual information, how do diacritics and multigraphs fare? We have seen in the case of the CIRCUMFLEX that the treatment of diacritics depends upon a given system, and the same is true for any other diacritic. It also applies to multigraphs: a system could represent “ngb” as a single unit of textual information—i.e. a single character. It could equally represent it as a sequence of three characters, <N, G, B>, however. The two systems are different, but both are possible and, all other things being equal, neither is necessarily preferred over the other.

Up to now the characters we have considered are all visible, orthographic objects (or are direct representations of such graphical objects within an information system). In using computers to work with text, we also need to define other characters of a more abstract nature that may not be visible objects. The SPACE character may or may not be considered to be visible, depending upon your perspective, but the following are not likely to come immediately to mind when one thinks of characters:

horizontal tab

carriage return

line feed

These are just a few of several abstract “control” characters that may be used behind the scenes when working with text within a given computer system. Abstractness in an important aspect of characters, in this sense of this term, as we explain in the next section.

In technical discussions related to information systems, in talking about multilingual software, for example, it is the sense of the term character discussed in this section that is usually assumed. From here on, we will adopt that usage, referring to (abstract) characters as meaning units of textual information in a computer system, and using the term grapheme when talking about units within orthographies. Thus, we might say something like, “The Dutch grapheme ‘ij’ is represented in most systems as a character sequence, <i, j>, but in this system as a single character, <ij>.” Where we wish to speak of (orthographic) characters in the informal sense discussed above, we will state that explicitly.

2.3 The relationship between graphemes and abstract characters for textual representation

Graphemes and orthographic characters are fairly concrete objects, in the sense that they are familiar to common users—non-experts, who are typically taught to work in terms of them from the time they first learn their “ABCs” (or equivalent from their writing system, of course).

In the domain of information systems, however, we have a different sense of character: abstract characters which are minimal units of textual representation within a given system. These are, indeed, abstract in two important senses: first, some of these abstract characters may not correspond to anything concrete in an orthography, as we saw above in the case of HORIZONTAL TAB. Secondly, the concrete objects of writing (graphemes and orthographic characters) can be represented by abstract characters in more than one way, and not necessarily in a one-to-one manner, as we saw above in the case of “ô” being represented by a sequence <O, CIRCUMFLEX>.

In developing a system for working with multilingual text, it is important to understand the distinction between abstract characters and graphemes. We implement systems to serve the needs of users, and users think in terms of the concrete objects with which they are familiar: the graphemes and orthographic characters that make up orthographies. They do not need to be aware of the internal workings of the system, provided it behaves as they expect.

In other words, it does not matter what abstract characters are used internally to represent text, just so long as users get the behaviour and results they expect. When possible, it certainly makes sense to use abstract characters that correspond closely to the orthographic elements that users expect to behave as units. Potentially, though, the characters may be somewhat different from the elements of the orthography, just so long as the system can be implemented to give the right behaviour. To explain how this might be possible, though, we first need to understand the key components that are used in a computer system for working with text. This will be done by examining keystrokes, codepoints, and glyphs.

3 Codepoints and glyphs

Within a computer system, we work with text and writing systems primarily in terms of three main components:

input methods: how we create the data, typically using keyboards

encodings: how the data are stored, and

fonts & rendering: how the data is displayed.

We have talked about graphemes and abstract characters as individual units. We need to look at their counterparts in these three components, and understand how they interact. These counterparts are keystrokes, codepoints, and glyphs. In this section, we will introduce codepoints and glyphs, and look at how they interact within a system.

3.1 Codepoints

Computers store information internally as numbers. A codepoint is merely the number that is used to store an abstract character in the computer.4 When working with text, each abstract character of the text (including control characters) is stored as a number with a unique number assigned to each different character. For example, whenever you enter SHIFT F ” on an English keyboard, the computer will (on most systems) insert the number 70 at the current position in the data. This is the codepoint that is used on those systems for storing the character CAPITAL LETTER F.

An encoding (or character encoding) is a system of numbers used to represent a set of characters within a digital information system, such as a computer. There is, in principle, nothing special about the numbers that are used. For instance, in the example above, there is no a priori reason that the number 42 could not have been used to represent CAPITAL LETTER F. The actual numbers that are used are specified by the encoding designer. There are only two necessary restrictions:

Each abstract character that is represented in the encoding system must have exactly one numerical representation—one codepoint.

In order for users to exchange data, their computer systems must agree on what the meaning is for a given number.

To achieve the latter end, encoding standards are devised, either by individual vendors, or across the entire industry. Two important examples of encoding standards are ASCII and Unicode. Every DOS, Windows and Macintosh computer understands the ASCII encoding standard and would know, for example, that the codepoint 104 corresponds to the character SMALL LETTER H (“h”).

The numerical value of codepoints can be expressed in different ways. Most computer users are aware that computers store numbers in binary rather than decimal. Thus, “70” to us would be “01000110” to a computer. Programmers often use a system known as hexadecimal, or hex. Thus, “70” to the man on the street would also be “x46” to a programmer. Any advanced computer user ought to be at least familiar with hex notation, and anyone who expects to be doing a lot of work at the level of codepoints and encodings needs to be able to work with it. For example, in any discussion of Unicode, codepoint values are almost always expressed using hex.

3.2 Glyphs

Glyphs are the graphical elements used to visually represent characters. Because of their graphical nature, a user is likely to associate them closely with the (relatively) concrete objects in the domain of writing and orthographies. For our purposes, the notion of glyph has an additional, specific characteristic: glyphs are graphic objects that are stored within a font. Basically, they are the shapes for displaying characters that you see on a screen or a printer. In a simple sense, then, a font is simply a collection of glyphs, usually with common design characteristics.5 Since glyphs are contained within fonts, which are part of a computer system, glyphs are therefore a component within the domain of information systems, like abstract characters.

So, at the basic level, a glyph is different from a grapheme in that one is a graphic object located in a font within an information system, while the other is an element within an orthography. But there are other important differences reflected in the fact that graphemes and glyphs do not correspond to each other in terms of one-to-one relationships. This is easily seen in the case of a multigraph: for instance, we would expect a grapheme “ngb” to be displayed using a sequence of three glyphs. Of course, this is reminiscent of our discussion about the relationships between graphemes and abstract characters, or graphemes and the informally-defined notion of orthographic character from the domain of writing (as discussed in Section 2). Thus, it may be more interesting to consider how glyphs relate to abstract or orthographic characters, and whether or not glyphs and characters are merely different ways of conceiving of a single notion. In this section, we will consider glyphs in relation to orthographic characters as objects from the domain of writing systems and orthographies that users recognise as distinct elements. We will make the discussion more formal in the next section, in which we consider the relationship between glyphs and abstract characters of textual representation.

It is not difficult to show that the notions of character and glyph are different. For example, the English character “a” can be displayed using any of a number of different glyphs:

Figure 1. different fonts: one character, different glyphs

You might wonder if we couldn’t replace glyph with another notion that was font-independent and end up with something equivalent to character (in whatever sense of that term). The answer is “no”. The reason is that, in some scripts, characters can have more than one shape due to certain behaviours of the script. This has nothing to do with changing fonts. For example, in Greek script, the sigma has two different shapes, according to its position within a word.

Figure 2. Greek sigma: one abstract character, two glyphs

There is another way in which a single character may correspond to more than one glyph. In the Greek example, sigma can be displayed by more than one glyph, but in each instance only one glyph is used. There can be situations in which a single character is displayed by multiple glyphs in every instance of its use. For example, Indic scripts of South and Southeast Asia are well known for having vowels that are written using multiple shapes that are distributed around the shape for the initial consonant. So, for example, in the following Bengali-script example, the two highlighted shapes represent a single instance of one character (one grapheme), the vowel o:

: one character displayed using two discontiguous glyphs

We have seen that one character can have many glyphs. The opposite is also possible: one glyph for multiple characters. In Lanna script, when the character is followed by the character , they may be written as . In this case, we have two characters that are presented by a single shape, forming what is known as a ligature.

Figure 4. Lanna ligature: two characters, one glyph

These examples suggest that the number of glyphs is determined by the character elements in an orthography and by their behaviours. That is largely true, but not necessarily so, however. The glyphs used in a font are determined by the font designer, and a font designer may choose to implement behaviours in different ways. For example, for Greek, a font designer may choose to present using a single, composite glyph, or by using two glyphs, one for alpha and another for the over-striking diacritic:

single, composite glyph

alpha glyph + overstriking oxia

Figure 5. Alternate glyph implementations for Greek alpha with oxia

Some font implementations may even use glyphs that only represent a portion of the written symbols in the orthography:

Figure 6. Glyphs for portions of a Gujarati character

These examples raise some important questions: Does this mean that, within a computer system, there can be a mismatch between the characters that are stored and the glyphs that are needed to display them? If so, how is this handled? This brings us to the general issue of how glyphs relate to characters within an information system, which we will explore in the next section.

3.3 From codepoints to glyphs

In this section, we will explain what happens within a computer with regard to characters and glyphs. We are now discussing implementation of information systems and computers, and so character here is once again used in the second sense that was described, that of an abstract unit of textual information.

Textual information is stored within a computer as codepoints that directly correspond to abstract characters. In a process known as rendering, software will take a sequence of codepoints and transform that into a sequence of glyphs displayed on some presentation device (a monitor or a printer).

Let’s consider a simple example: the English word “picture”. As I created this document on my system, that word was stored on my computer as a string of seven characters, <p, i, c, t, u, r, e>, and was displayed on my monitor using seven glyphs selected from a particular font and arranged horizontally (using information also found in the font to control relative positioning). In this case, there was a simple one-to-one mapping between the codepoints in my data and the glyphs in the font.

That much was fairly obvious. What is more interesting is what happens in the more complicated situations described above, in which there is not a one-to-one relationship between “characters” and glyphs. In general, the answer is that it depends upon the given system. But to see what might possibly happen, let’s consider the same English example again, yet with a twist.

Suppose I am a font designer, and I want to create a font that closely mimics my handwriting. Of course, I will write English letters in many different ways, and I can’t capture every subtle variation. If I am willing to stay within some limits, though, perhaps I can have my font show each letter the way I might typically write it within a certain set of combinations. So, for example, I might write “c” with different types of connection on the left: in some instances with no connection (at the beginning of words, say), or with a connection near the bottom (after letters that terminate near the bottom, such as “a”), or in other instances with a connection near the top (for instance, after “o”). As I work through all the details, I might actually decide that the only way to really get the word “picture” to display the way I want is to create it as a single, complex glyph for the entire word. (This may seem unusual, but such things are possible in fonts, and some fonts have even done things like this.) So, I have a single glyph for “picture”, but this is stored in data as a sequence of seven characters. What I need, then, is for some process to intervene between the stored data and the glyphs that will recognise this particular sequence of characters and select the single, combined glyph, rather than seven separate glyphs.

This is precisely the kind of processing that happens in modern systems that are designed to support complex scripts. These systems are sometimes referred to as “smart font” or “smart rendering” systems. Examples include Apple’s TrueType GX, which has more recently been renamed as Apple Advanced Typography (AAT); the OpenType font standard, developed by Microsoft and Adobe; and SIL’s Graphite rendering system. It would go beyond the scope of this discussion to examine how these systems work in any detail. The main point to grasp is that they mediate between characters that are stored and the glyphs used to display them, and allow there to be complex processes that give many-to-many mappings between sequences of characters and sequences of positioned glyphs.

So let’s revisit the realistic examples presented above, and consider how the rendering process might apply in those cases. First, we saw that the Greek sigma is displayed using different shapes according to word position. Within a system, the single grapheme sigma can be represented as a single character, SIGMA, and a rendering process will determine when it does or does not occur at the end of a word and select glyphs on that basis:

In the case of Bengali, a similar process may occur: the system may store a sequence of two characters, <LETTER KA, VOWEL O>, and the rendering process will somehow transform that into the appropriate sequence of glyphs. The actual number of glyphs would be dependent upon a particular font implementation. It could be one composite glyph for the entire combination of characters. More likely, though, it would be rendered using three glyphs. Note, though, that the ordering of the three glyphs does not correspond to the ordering of the stored characters.

stored characters:

displayed glyphs:

Figure 8. Complex rendering process for Bengali <LETTER KA, VOWEL O>

Similar processing could occur in rendering the Lanna ligature. In that case, an implementation will likely involve two stored characters displayed using a single glyph.

In these examples, we have described one way in which support for each of these examples can be implemented. But, as has been mentioned, the actual glyphs and number of glyphs can vary from one implementation to another. Recall, too, from Section 2.2 that different systems might represent the same orthographic content in different ways by using different abstract characters; for example, “ô” being represented either as a single character O-CIRCUMFLEX or as a sequence of characters, <O, CIRCUMFLEX>. The same is true for the examples discussed in this section.

In the case of Bengali, for instance, we have seen that the grapheme for the /o/ vowel (“”) can be represented in terms of a single character, VOWEL O. But another system could perhaps implement support for this grapheme using a pair of characters, <, >. This might make particular sense if each of these corresponded to other graphemes in the orthography or script being implemented. For Bengali script, it turns out that these do have separate identities. Thus, many systems would represent “” using a character sequence of <VOWEL E, VOWEL AA>. Of course, such a difference would have an affect on how the rendering process needs to operate in order to generate the correct sequence of glyphs. The glyphs used to display this character sequence is still at the discretion of the font designer, however. It is still possible, for instance, to choose between a sequence of three glyphs or a single, composite glyph, and to have the rendering system handle the transformation from characters to glyphs.

The point here is that a “smart” rendering system that can support many-to-many mappings between characters and glyphs makes it possible to have different implementations for a given writing system. This flexibility can provide alternatives for a developer, or can also be utilised to provide special functionality for particular purposes.

For some, a “smart” rendering capability that can handle many-to-many transformations from characters to glyphs may be unfamiliar. Indeed, such systems are only beginning to appear. Up to now, most systems have used rendering systems that support only one-to-one relationships between characters and glyphs. Such systems are sometimes known as “basic”, or “dumb” rendering systems. For a writing system like that of English, for which the standard behaviours are very simple, a “dumb” rendering system is adequate for most use.6 For complex scripts, however, this limitation presents a problem.

For instance, if a Greek SIGMA requires context-based glyph selection but the system is limited to only one glyph per character, then the only possible solution is to have more than one SIGMAcharacter: one character for each of the two glyphs. Since the mapping from characters to glyphs is a simple, one-to-one mapping, the rendering process becomes essentially transparent:

Figure 9. Greek sigma: presentation-form encoding and rendering

This approach to implementation is important for us to understand. It is important not because it reflects good practice or good technology—it is neither. Rather, it is important because it has been used for many years in a large number of implementations to support writing systems that involved complex behaviours, ranging from Arabic to IPA to Thai. This way of implementing a writing system imposes requirements not on the glyphs, but on the abstract characters: since there is a one-to-one mapping from characters to glyphs, one abstract character is required for every glyph that is needed.7 (Note that, as a result, the relationship between graphemes and characters will usually become significantly less direct.) For this reason, encoding systems that are designed to work in this way are often referred to as “glyph encodings”, “display encodings”, or “presentation-form encodings”. In general, an encoding should be devised to accommodate the needs of all processes that need to be performed on the text: rendering, input, sorting, searching, etc. In the case of a presentation-form encoding, however, the encoding is designed to accommodate the needs of rendering alone. If any other processes can still be performed without additional processing, that is coincidental. In most situations, however, other processes are made significantly more difficult, or are considered expendable.

In summary, then, the relationship between abstract characters and glyphs can involve a complex, many-to-many transformation process, provided there is a “smart” rendering component within the system. If there is not, then a presentation-form encoding must be used. Exactly what happens within a system will vary from one implementation to another. In general, both the glyphs used in the font to display text and the abstract characters used to represent the text as data can vary between implementations. Where character encoding standards exist, the characters are less likely to differ between implementations, but many systems have been developed using custom encodings, and a large proportion of these have involved presentation-form encodings. Modern systems that use “smart” rendering capabilities present opportunities to do away with presentation-form encodings, along with their limitations.

3.4 From grapheme to codepoint to glyph

In the previous section, we showed how glyphs in general relate to abstract characters and the codepoints used to represent them. We also reviewed the fact that characters can be used to represent a grapheme in more than one way. Notice, then, that we don’t make any direct connection between graphemes and glyphs. As we have defined these two notions here, there is no direct connection between them. They can only be related indirectly through the abstract characters. This is a key point to grasp: the abstract characters are the element in common through which the others relate.

Graphemes are the units in terms of which users are usually accustomed to thinking. Within the computer, however, processes are done in terms of characters. The challenge is for the implementer to provide behaviours to the user that conform to their expectations using whatever means are available to process the characters and glyphs. We will discuss this further, but first we need to complete our picture by including the input component.

4 Keystrokes and codepoints

Input methods represent the third of three components for working with text data on a computer that were introduced at the beginning of Section 3. In general, input methods can include things like voice- or handwriting-recognition. Keyboards are the most common form of input, however, and also the only one that is easily extended or modified. This discussion will therefore focus on keyboard input. After considering the nature of keyboard input methods, and how keystrokes relate to codepoints, we will revisit the model in figure 10 to show how keystrokes relate to graphemes and the other items in the model.

4.1 From keystrokes to codepoints

Just as codepoints and glyphs are the counterparts to characters in the encoding and rendering components, keystrokes are the counterpart to characters in the keyboarding component. Whereas characters (or codepoints) get transformed into glyphs in the rendering process, keystrokes are transformed into codepoints in the input process.

All computer operating systems include software to handle keyboard input, and many provide more than one keyboard layout; that is, more than one set of mappings from keys to characters. Many keyboard input methods use a strictly one-to-one mappings from keystrokes to characters: for each keystroke, there is one particular character that is generated.8 But some keyboards provide alternative mappings based on a previously-entered “dead key” (a key that does not enter a character, but rather changes the character entered by the following key). For example, typing “`” followed by “a” to get “à”, but “`” followed by “o” to get “ò”.

Just as the mapping from characters to glyph might involve complex, many-to-many mappings, the same is potentially true for keyboard input. For example, it would be possible to have a single keystroke that generated a sequence of several characters, such as <n, g, b>, or <SMALL ALPHA, COMBINING ROUGH BREATHING MARK, COMBINING ACUTE, COMBINING IOTA SUBSCRIPT>. Similarly, it would be possible for a sequences of keystrokes to generate single characters, perhaps with each keystroke in the sequence changing the previous output.

Different input methods might generate exactly the same characters, though in different ways. For example, one may use a single keystroke to generate a given character, while another uses two or more keystrokes (the first, perhaps, being a “dead key”) to generate that character.

Figure 12. Two input methods: same character, different number of keystrokes

Input methods for Far-Eastern languages can use extremely complex mapping processes. Because the Han ideographs include tens of thousands of characters, entering these from a keyboard with a hundred or so keys presents an interesting challenge. Special input methods are used, known as input method editors (IMEs). These typically function by using a phonemic keying sequence to successively narrow down possibilities until a unique character is determined or, at least, until the number of possibilities is reduced to a small number. IMEs can involve what may almost appear to be a separate application program, with special editing and candidate-selection windows:

Languages that are written using Han characters represent an exception in terms of the number of characters that are required. There are other writing systems that require a relatively large number of characters (compared to most) without the extremes of Chinese writing. For example, the Ethiopic and Yi scripts are syllabaries with characters numbering in the hundreds (approximately 450 for Ethiopic, 1200 for Yi). These writing systems do not necessarily require the full power of an input method editor, with a separate editing or candidate-selection window, yet they still require input methods that can use easily-remembered keystroke sequences to enter each syllable.

Systems that use presentation-form encodings typically involve a one-to-many mapping between graphemes and characters—graphemes that have alternate appearances will have multiple character representations corresponding to the various visual forms of that grapheme. These systems are typically implemented using a complex input method to select the appropriate variant form of a grapheme according to the context. (The input method is, in effect, performing the processing that would otherwise be handled by a “smart” rendering system.)

For example, in the case of Greek sigma, an input method may use a single key, such as the “s” key, to enter all forms of sigma, with the input method generating either a FINAL SIGMA or NON-FINAL SIGMA according to the context:

Note that, when the “s” key is pressed, the sigma is at that point word-final. It is the next character that is entered that determines whether or not the sigma will remain word-final. If another word-forming character is entered, then the FINAL SIGMA is changes to NON-FINAL SIGMA.9

The point to see in this discussion of input methods is that the mapping of keystrokes into characters is potentially a complex process involving many-to-many mappings, just as in the rendering process. Users familiar with systems designed to support English would certainly be familiar with keyboards that use one-to-one mappings only. But keyboard processing need not be limited in this manner, and many keyboard implementations are not.

4.2 From keystroke to codepoint to grapheme

As was mentioned earlier, users are accustomed to working with text in terms of the particular elements of the orthography with which they are familiar, such as graphemes. Because users make decisions about which keys to press, keystrokes and sequences of keystrokes need to bear some close correlations to the units of the orthography as perceived by the target users.

This is somewhat different from the situation with glyphs. In the rendering process, users will not necessarily be aware of the individual glyphs used to display output; their interactions with glyphs is only in the combinations and contexts that are presented to them. So, for example, when a user sees “ô” on the display, they have no way to know whether this text was presented to them with one glyph or two (or, perhaps, even more). But users do know exactly what keystrokes they are entering.

Note that it would be possible to have an input method with keystrokes that do not closely match the units of the given orthography as perceived by users, but instead is radically different. In particular situations, this may be appropriate. For example, an IME for traditional Chinese might use keystrokes that are phonetically based rather than relating to the Chinese characters and their shapes. Given the nature of Chinese script, this is not an unreasonable way to handle input given the physical constraints of a keyboard. In most situations, though, a keyboard that does not bear similarities to the orthography would hinder usability.

The mapping from keystroke to codepoints (or abstract characters), on the other hand, could happen in any number of ways and need not involve direct, one-to-one correspondences. Likewise, we have seen that the relationship between codepoints and graphemes need not involve one-to-one correspondences. Clearly, then, these two mappings inter-relate: if both keystrokes and codepoints are in close correspondence to the elements of the orthography, then they will have to be in close correspondence with one another. If, however, the keystrokes closely match the graphemes but the characters do not, then the relationship between keystrokes and characters will not be a very direct one. So, while the keystrokes themselves will typically correspond closely to the orthography, the mapping from keystrokes to codepoints is not constrained in this way.

5 Orthographies and their computer implementations

We have examined the elements of orthographies (graphemes) and how they are implemented in a computer system in terms of three components: input methods (keystrokes), encoding (abstract characters/codepoints), rendering (glyphs). We now want to consider how all these components work together to form a complete system, and look deeper at the role of the encoding within the system.

As was mentioned in Section 2.3, abstract characters used to represent text do not necessarily have to bear any close resemblance to the graphemes being represented, just so long as the system can provide the behaviour the user expects. We also saw that the same was true of glyphs: they do not have to match the elements of the orthography as long as they give the visual feedback that users expect. The internal workings of the system can be very abstract, with significant differences from the concrete orthographic objects being represented. Overall, then, there is considerable flexibility in the way a system can be implemented.

People often fail to grasp the potentially abstract nature of these systems, however. I have often encountered two areas of misunderstanding. First, people often do not fully appreciate that input, encoding and rendering are distinct components that can relate from one to another by complex, many-to-many mappings. They will talk of keystrokes or glyphs as though they are identical to codepoints. Usually, these same people assume that data must be encoded in terms of presentation forms. Clearly, this comes from years of working with “dumb” rendering systems in which presentation-form encodings are the only possible solution for working with complex scripts. Secondly, people often do not appreciate that the characters used to represent text can be somewhat abstract, not matching the graphemes in a one-to-one manner. This appears to come from not fully understanding the various components of a system and the potential that each has to process information in different ways.

While character encodings are used to represent orthographies, their design does not have to be constrained by encodings. The primary concern in designing an encoding is that the characters used meet the needs of the various types of processing that will be performed on the text. Input and rendering are two such processes, and are constrained by the orthography in ways that we have seen, yet there can be complex mappings between these components and the encoding. The encoding must serve other processes as well; for example, searching or sorting. The user interacts with most or all of these processes, providing input and viewing results, and they think in terms of the elements of the orthography. Internally, however, all of these processes are performed in relation to the given encoding, which may be somewhat abstract.

Thus, we arrive at a complete model for orthographies and their implementation on a computer, incorporating the input, encoding and rendering components, along with a component representing other processes that may be performed on the text:

Figure 15. Orthographies and components to implement them on a computer

In this model, the encoding is the central component in the computer system to which the other components relate. The encoding also relates to the orthography being implemented, since it is the encoding that directly represents that orthography, though it may not do so in a one-to-one manner. From the user’s perspective, however, the orthography is perceived as being related to the keystrokes that they enter and the resulting glyphs that they see on the display.

In designing a system to implement a given orthography, it is important to understand all of these components, how they relate to one another, and the potential processing power that can be brought to bear in any of the data processing that might occur, including input and rendering as well as other processes such as sorting. (Of course, that processing power may depend on the software with which the implementation will be used.) To understand this better, let’s consider a few applications of the ideas we have discussed.

5.1 Possibilities for implementing Latin diacritics

We saw earlier that a Roman base + diacritic combination such as “á” might be encoded either as a single character SMALL A WITH ACUTE or as a sequence of characters, <SMALL A, COMBINING ACUTE>. We also saw that “á” might be rendered using a single composite glyph, or using separate glyphs for “a” and for overstriking acute. In addition, we saw that, in some orthographies, the combination “á” would be considered a grapheme, while in other orthographies both “a” and the acute accent would each be considered graphemes. The parallelism suggests that these should correspond in any given implementation: if an orthography has “á” as a grapheme, then it should be encoded as a single character and rendered with a single glyph, but if an orthography has separate graphemes “a” and acute, then these should be encoded as separate characters and rendered with separate glyphs.

By now we should see, however, that these limitations on implementation are not necessary. The behaviours that users expect when “á” is a single grapheme can be implemented by encoding using either a single character or a sequence of two characters. Where two characters are used, processes such as sorting may need to be configured to recognise the sequence as a single unit for processing purposes. The input method may provide a single keystroke to enter the two-character sequence, though users may prefer a keystroke sequence, even though this is a single grapheme. With regard to rendering, either one or two glyphs may be used, regardless of whether one character or two are used for encoding. (Indeed, in the case of either encoding, some fonts may use a single glyph while others use two.) Also, the system can be implemented to prevent either the base or the diacritic from being selected independently, though like keystrokes, the users may still prefer to be able to edit in terms of the individual pieces; for example, to delete just the acute to change the grapheme “á” into the distinct grapheme “a”.10

Likewise, if users understand “a” and acute to be separate graphemes, the appropriate behaviours can be implemented, regardless of how the text is encoded or what glyphs are used for rendering.

Of course, implementation will be easier when the number of codepoints, glyphs and keystrokes match the number of graphemes, since one-to-one mappings can be used. It may not always be practical to do this, however. This may be true, for example, when implementing a system using Unicode as the encoding, since the encoding component is, in that case, pre-defined and not open to redesign by the implementer. Where complex rendering and input are supported, though, this is not an obstacle.

5.2 Implementing multigraphs

I mention this example because there have been many occasions on which people have requested an addition to the Unicode standard to include a digraph character from the orthography of their language. For example, to add a digraph “ch” for Slovak. The usual arguments are along these lines: “Unicode needs a character ‘ch’ because this is a distinct character in our alphabet—we perceive it as a distinct character, and has its own place in the sort order.”

It should be noted that these arguments are based upon orthographic notions, orthographic character and grapheme. What is not considered is whether the necessary behaviour can be implemented using existing abstract characters that are already defined in the encoding.

In the case of “ch”, this can potentially be encoded as a sequence of characters, <c, h>. Of course, it may be necessary to sort data for a given language so that “cha” sorts after “cu”. It might even be necessary to implement a system so that “ch” always appears to the users to be treated like an indivisible unit. For example, users may prefer behaviour in which it is never possible to place an insertion point between the “c” and the “h”. But there is more than one way to achieve this result in software; creating a new abstract character for “ch” is only one way to do this. The sorting behaviour, for instance, can be achieved by configuring the sorting process to treat the sequence <c, h> as a unit for sorting purposes. This kind of approach to sorting has been used for many years in many systems.

As mentioned earlier, the requirements on the abstract characters in an encoding are that they are adequate for performing whatever processes will be applied to the data and to give the desired results and behaviours. Evaluating the adequacy of representing a digraph such as “ch” using a sequence <c, h>, therefore, needs to be done in terms of text processing issues rather than directly in terms of what the graphemes are in the given orthography. The kind of question that needs to be asked is whether there are any situations in which a process needs to distinguish the grapheme “ch” from the sequence of graphemes “c” followed by “h”, and whether encoding the digraph “ch” as a character sequence <c, h> will not provide the necessary distinction. If such situations would occur, then the next questions might be whether there are any other existing mechanisms provided by the encoding system that can solve that problem, and what the implications might be for using some other mechanism versus adding a new digraph character.11

The point here is to see that, in determining what is needed to implement support for a given writing system, it is necessary to understand the distinct components that make up the text processing system. It is also important to understand the potential solutions that can be offered, and to think in terms of what it takes to make the processes within the system work in a way that displays the behaviours that users expect.

5.3 “Multi-script” implementations

Some languages are written using several writing systems. For various reasons, different portions of the language community end up writing the language using orthographies based on different scripts. This is common, for example, with language communities that cross national borders, where the national languages of the different countries use different scripts.

In these situations, it may be helpful for certain purposes to create a “multi-script” implementation in which a single encoding is used which can support input or rendering in terms of the different writing systems. (This can be useful, for example, in projects to compile a large corpus of information that will have to be published in each of the various writing systems. Having a single data corpus allows for easier data management when numerous revisions and additions are being made over a long period of time.)12

For example, suppose a situation in which a language is written with Roman, Thai, and Lanna scripts. Of course, since these scripts are different, we expect that the representation of phonemes may be done in somewhat different ways in each script. For instance, the vowel /U+0259LATIN SMALL LETTER SCHWA/ might be written as a single character in Roman script, but in Thai it is written as a discontiguous digraph, and in Lanna as a discontiguous trigraph. Thus, /s/ might be written as “s” in Roman, but would be written in Thai script as “” and in Lanna script as “.

The aim in the encoding is to have a single representation for the text, regardless of the writing system. In this example, an encoding might be devised such that the phoneme // is an abstract character in the encoded representation. So, for instance, the sequence of codepoints d152 d208 could be used to encode a character sequence that corresponds to /s/, and this sequence of codepoints would be rendered in Roman script as “s”, in Thai script as “” and in Lanna script as “:

for Roman, Thai and Lanna

Likewise, separate input methods could be provided with behaviours suited to each writing system. For instance, a Roman keyboard might use a single keystroke for “”, whereas the Thai keyboard would use separate keystrokes for “” and “”, while the Lanna keyboard uses separate keystrokes for each of “”, “” and “”.13

In such an encoding, the codepoints might not have a one-to-one relationship with either the elements of any of the orthographies or with the glyphs. Rather, portions of the data would be represented in terms of more abstract units of information that are independent of the individual writing systems but from which the orthographic characters in any of the writing systems can be recovered. They might correspond, for example, to phonemes, as in the example of //.

In order to understand how a “multi-script” implementation might work, it is necessary to understand that input, encoding and rendering are distinct, and that the encoding is also distinct from the orthographic system(s) being represented. It is also important to understand that the transformations between keystrokes, codepoints and glyphs can be complex, with many-to-many mappings.

6 Conclusions

We have taken an in-depth look at orthographies and the elements that make up orthographies, and at the various components of an information system that are used to implement support for orthographies. It has been important to see that there are distinct components—input, encoding and rendering, and that the interactions between them can involve complex, many-to-many mappings.

We have seen that users will typically think in terms of the elements of orthography, which is what they are most familiar with. In contrast, the processing done within the system is done in terms of abstract characters and the numeric codepoints used to encode them. Furthermore, because of the processing power available in the input and rendering components, the abstract characters need not exactly match the graphemes that make up the orthography.

We have also seen that the design of an encoding is not determined directly by the elements of the orthography being represented, but rather by the processes that will be performed on the data. It is only indirectly, as users interact with those processes to input text or view the results, that the encoding design is influenced by the orthography. Ultimately, we want our system to provide users with the behaviours they expect, but there can be many ways in which to accomplish this. Successfully implementing an orthography on a computer requires a good understanding not only of the writing system, but also of the pieces that make up the implementation and how they work together.

This detailed understanding of characters, keystrokes, codepoints, and glyphs should be helpful in understanding multilingual software. It should also be helpful in understanding emerging standards and technologies, such as Unicode, OpenType, AAT or Graphite, and in understanding how these technologies can be useful for working with multilingual or minority-language data.

7 Further reading

There are many related topics that we have not been able to go into. The following resources may provide useful information to go deeper into particular topics:

To learn more about the Unicode encoding standard, there is a lot of useful information on the Unicode web site: http://www.unicode.org/. In particular, consult the “What is Unicode” page, the FAQ page, and the “Technical Intro

Note: If you want to add a response to this article, you need to enable cookies in your browser, and then restart your browser.

Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.

I want to send some mathematical lines by e-mail. The Greek letter sigma is used for summation and I have yet to learn the keystokes to put it in an e-mail. Your article has been helpful. I will have to read it over again. Thank you.

I have just read your article at https://scripts.sil.org/IWS-Chapter02 and would like to ask whether you recognise that the value of multiscript encodings in some instances could be a valid ground for implementing a code block in Unicode for a multiscript language? From its website, the unicode organisation appears to strongly oppose this, at least in general.

Nevertheless I believe there are numerous very strong grounds for implementing Pali as a unique code block in Unicode. Implementing Pali as a language attached to dozens of scripts scattered all over the unicode area has so many problems, both in terms of operating system/software implementation - most or all do not recognise Pali as a language (for example when typing Pali in major software packages it is not possible to specify the language, nor that for spelling purposes it is not x-language because there is no valid alternative to select), and the intrinsic difficulties of maintaining, editing, exchanging, publishing, comparing and analysing a very large corpus of text which has many variations in different editions and different traditions. As the intrinsic medium of instruction and study to Theravada Buddhism, Pali is of importance to numerous hundreds of millions of people around the world; it is also the fastest growing religion in the world (especially amongst traditionally non-Buddhist countries in the West), and there is a need to write Pali in dozens of different languages/scripts. It is also not _entirely_ true to say that Pali is a dead language - there are new and original publications written entirely in Pali language every year (just one example: Dhātudīpikā by Sayadaw U Nandamala, published both on Roman and Myanmar editions, without a single non-Pali word); there are also some villages in Myanmar where Pali is spoken as an everyday language; and Pali language has continually evolved over the centuries and continues to evolve. In Myanmar alone there numerous distinct orthographies for Pali; in Laos there is a completely separate script just for Pali, in addition to the Lao language script. Also the historical variations are a factor, in transcribing ancient texts. It would be wonderful to be able to copy an ancient manuscript in its original script using (for example) a Pali-Roman keyboard, and simultaneously (while typing) see the output in both Roman and original script onscreen without any re-encoding, or to be able to display an original text in any script without re-encoding.

I have not tried to directly specify in this comment the problems of using existing script-based unicode blocks for representing Pali as that is a huge topic, but your comments in the article suggest you are already aware of such problems. I would be very interested in your comments on this question.

The familiar term orthography is used here in place of the more correct and more specialized but less well-known term writing system. Writing systems include not only conventional systems of graphic objects used for written linguistic communication—commonly known as orthographies, but also systems of written notation used to describe or transcribe language and linguistic utterances, such as IPA or shorthand.

Note that graphemes are not necessarily related to phonemes. For example, the English phoneme //// is written as “th”, but “th” does not function as a unit in terms of the behaviours of English orthography.

From this point, we will use small caps notation in named references to specific characters, as understood in the second sense of that term: as units of textual information within an information system. We will also continue to use angle-brackets to represent sequences of characters, though the individual characters might be referenced using representative shapes, such as “h”, rather than by names.

There is more that can be said about codepoints and encoding schemes than we have space to discuss here. In particular, there are some possible complications, such as multi-byte encodings, that we are ignoring since the aim of this discussion is to focus on other issues. For our purposes here, it is sufficient to think of codepoints merely as the numbers used to encode abstract characters, and to ignore the details of how this might be done. The explanation and examples provided in this section adequately convey the essential concepts that are important given the scope of this discussion. For more information regarding encoding schemes, see The Unicode Standard, Version 3.0 [7], and especially Unicode Technical Report #17 [8]. For information on multi-byte encodings that are used for East Asian character sets, see Lunde (1999) [3].

There is a little bit more to fonts than that, but this simplification is sufficient for the moment. One additional point worth mentioning, though: most font formats are not limited to 256 glyphs. Any TrueType font, for example, may potentially contain more than 65,000 glyphs.

Most implementations using presentation-form encodings have been created for use on systems that use single, 8-bit values for codepoints. Since 8-bit codepoints gives an upper limit of 256 characters (in actual practice, the limit is usually around 221), this has imposed an additional limitation of using at most 256 glyphs for an implementation. As mentioned in footnote 5, however, most font formats support far more than 256 glyphs.

We will use the term keystroke to refer to the pressing of any basic (non-modifier) key in combination with zero or more modifier keys. By modifier keys, we mean keys such as alt , control , shift , alt-graph , option , and command .

Not all rendering systems would necessarily allow drawing selections that correspond to portions of a glyph, such as selecting only the acute in a composite glyph “á”. There are other subtle issues that may need to be considered. For example, if “á” is encoded as a single character, the system would not know what underlying text corresponded to a selection that includes only the acute.

In the case of Unicode, it turns out that there are some significant implications that would have a detrimental impact on many existing implementations of the standard. To deal with problematic situations such as this in Unicode, a decision has been made to provide another mechanism, rather than adding new multigraphs to the standard. The details of this are beyond the scope of this discussion, however.