Mapping codepoints to Unicode encoding forms

In Section 4 of “Understanding Unicode™”, we examined each of the three character encoding forms defined within Unicode. This appendix describes in detail the mappings from Unicode codepoints to the code unit sequences used in each encoding form.

In this description, the mapping will be expressed in alternate forms, one of which is a mapping of bits between the binary representation of a Unicode scalar value and the binary representation of a code unit. Even though a coded character set encodes characters in terms of numerical values that have no specific computer representation or data type associated with them, for purposes of describing this mapping, we are considering codepoints in the Unicode codespace to have a width of 21 bits. This is the number of bits required for binary representation of the entire numerical range of Unicode scalar values, 0x0 to 0x10FFFF.

1 UTF-32

The UTF-32 encoding form was formally incorporated into Unicode as part of TUS 3.1. The definitions for UTF-32 are specified in TUS 3.1 and in UAX#19 (Davis 2001). The mapping for UTF-32 is, essentially, the identity mapping: the 32-bit code unit used to encode a codepoint has the same integer value as the codepoint itself. Thus if U represents the Unicode scalar value for a character and C represents the value of the 32-bit code unit then:

U = C

The mapping can also be expressed in terms of the relationships between bits in the binary representations of the Unicode scalar values and the 32-bit code units, as shown in Table 1.

Codepoint range

Unicode scalar value (binary)

Code units (binary)

U+0000..U+D7FF, U+E000..U+10FFFF

xxxxxxxxxxxxxxxxxxxxx

00000000000xxxxxxxxxxxxxxxxxxxxx

Table 1 UTF-32 USV to code unit mapping

2 UTF-16

The UTF-16 encoding form was formally incorporated into Unicode as part of TUS 2.0. The current definitions for UTF-16 are specified in TUS 3.0.1

U = (CH – D80016) * 40016 + (CL – DC0016) + 1000016

Likewise, determining the high and low surrogate values for a given Unicode scalar value is fairly straightforward. Assuming the variables CH, CL and U as above, and that U is in the range U+10000..U+10FFFF,

Expressing the mapping in terms of a mapping of bits between the binary representations of scalar values and code units, the UTF-16 mapping is as shown in Table 2:

Codepoint range

Unicode scalar value (binary)

Code units (binary)

U+0000..U+D7FF, U+E000..U+EFFF

00000xxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxx

U+10000..U+10FFFF

Uuuuuxxxxxxyyyyyyyyyy

110110wwwwxxxxxx 110111yyyyyyyyyy
(where uuuuu = wwww + 1)

Table 2 UTF-16 USV to code unit mapping

3 UTF-8

The UTF-8 encoding form was formally incorporated into Unicode as part of TUS 2.0. The current definitions for UTF-8 are specified in TUS 3.1.2 As with the other encoding forms, calculating a Unicode scalar value from the 8-bit code units in a UTF-8 sequence is a matter of simple arithmetic. In this case, however, the calculation depends upon the number of bytes in the sequence. Similarly, the calculation of code units from a scalar value must be expressed differently for different ranges of scalar values.

Let us consider first the relationship between bits in the binary representation of codepoints and code units. This is shown for UTF-8 in Table 3:

Codepoint range

Scalar value (binary)

Byte 1

Byte 2

Byte 3

Byte 4

U+0000..U+007F

00000000000000xxxxxxx

0xxxxxxx

U+0080..U+07FF

0000000000yyyyyxxxxxx

110yyyyy

10xxxxxx

U+0800..U+D7FF, U+E000..U+FFFF

00000zzzzyyyyyyxxxxxx

1110zzzz

10yyyyyy

10xxxxxx

U+10000..U+10FFFF

uuuzzzzzzyyyyyyxxxxxx

11110uuu

10zzzzzz

10yyyyyy

10xxxxxx

Table 3 UTF-8 USV to code unit mapping

Note

There is a slight difference between Unicode and ISO/IEC 10646 in how they define UTF-8 since Unicode limits it to the roughly one million characters possible in Unicode’s codespace, while for the ISO/IEC standard, it can access the entire 31-bit codespace. For all practical purposes, this difference is irrelevant since the ISO/IEC codespace is effectively limited to match that of Unicode, but you may encounter differing descriptions on occasion.

As mentioned in Section 4.2 of “Understanding Unicode™”, UTF-8 byte sequences have certain interesting properties. These can be seen from the table above. Firstly, note the high-order bits in non-initial bytes as opposed to sequence-initial bytes. By looking at the first two bits, you can immediately determine whether a code unit is an initial byte in a sequence or is a following byte. Secondly, by looking at the number of non-zero high-order bits of the first byte in the sequence, you can immediately tell how long the sequence is: if no high-order bits are set to one, then the sequence contains exactly one byte. Otherwise, the number of non-zero high-order bits is equal to the total number of bytes in the sequence.

Table 3 also reveals the other interesting characteristic of UTF-8 that was described in Section 4.2 of “Understanding Unicode™”. Note that characters in the range U+0000..U+007F are represented using a single byte. The characters in this range match ASCII codepoint for codepoint. Thus, any data encoded in ASCII is automatically also encoded in UTF-8.

Having seen how the bits compare, let us consider how code units can be calculated from scalar values, and vice versa. If U represents the value of a Unicode scalar value and C1, C2, C3 and C4 represent bytes in a UTF-8 byte sequence (in order), then the value of a Unicode scalar value U can be calculated as follows:

Going the other way, given a Unicode scalar value U, then the UTF-8 byte sequence can be calculated as follows:

If U <= U+007F, then

C1 = U

Else if U+0080 <= U <= U+07FF, then

C1 = U 64 + 192

C2 = U mod 64 + 128

Else if U+0800 <= U <= U+D7FF, or if U+E000 <= U <= U+FFFF, then

C1 = U 4,096 + 224

C2 = (U mod 4,096) 64 + 128

C3 = U mod 64 + 128

Else

C1 = U 262,144 + 240

C2 = (U mod 262,144) 4,096 + 128

C3 = (U mod 4,096) 64 + 128

C4 = U mod 64 + 128

End if

where “” represents integer division (returns only integer portion, rounded down), and “mod” represents the modulo operator.
If you examine the mapping in Table 3 carefully, you may notice that by ignoring the range constraints in the left-hand column, certain codepoints can potentially be represented in more than one way. For example, substituting U+0041LATIN CAPITAL LETTER A into the table gives the following possibilities:

Codepoint

Pattern

Byte 1

Byte 2

Byte 3

Byte 4

000000000000001000001

00000000000000xxxxxxx

01000001

000000000000001000001

0000000000yyyyyxxxxxx

11000001

10000001

000000000000001000001

00000zzzzyyyyyyxxxxxx

11100000

10000001

10000001

000000000000001000001

uuuzzzzzzyyyyyyxxxxxx

11110000

10000000

10000001

10000001

Table 4 “UTF-8” non-shortest sequences for U+0041

Obviously, having these alternate encoded representations for the same character is not desirable. Accordingly, the UTF-8 specification stipulates that the shortest possible representation must be used. In TUS 3.1, this was made more explicitly clear by specifying exactly what UTF-8 byte sequences are or are not legal. Thus, in the example above, each of the sequences other than the first is an illegal code unit sequence.

Similarly, a supplementary-plane character can be encoded directly into a four-byte UTF-8 sequence, but someone might (possibly from misunderstanding) choose to map the codepoint into a UTF-16 surrogate pair, and then apply the UTF-8 mapping to each of the surrogate code units to get a pair of three-byte sequences. To illustrate, consider the following:

Again, the Unicode Standard expects the shortest representation to be used for UTF-8. For certain reasons, non-shortest representations of supplementary-plane characters are referred to as irregular code unit sequences rather than illegal code unit sequences. The distinction here is subtle: software that conforms to the Unicode Standard is allowed to interpret these irregular sequences as the corresponding supplementary-plane characters, but is not allowed to generate these irregular sequences. In certain situations, though, software will want to reject such irregular UTF-8 sequences (for instance, where these might otherwise be used to avoid security systems), and in these cases the Standard allows conformant software to ignore or reject these sequences, or remove them from a data stream.

The main motivation for making the distinction and for considering these 6-byte sequences to be irregular rather than illegal is this: suppose a process is re-encoding a data stream from UTF-16 to UTF-8, and suppose that the source data stream had been interrupted so that it ended with the beginning of a surrogate pair. It may be that this segment of the data will later be re-united with the remainder of the data, it also having been re-encoded in UTF-8. So, we are assuming that there are two segments of data out there: one ending with an unpaired high surrogate, and one beginning with an unpaired low surrogate.

Now, as each segment of the data is being trans-coded from UTF-16 to UTF-8, the question arises as to what should be done with the unpaired surrogate code units. If they are ignored, then the result after the data is reassembled will be that a character has been lost. A more graceful way to deal with the data would be for the trans-coding process to translate the unpaired surrogate into a corresponding 3-byte UTF-8 sequence, and then leave it up to a later receiving process to decide what to do with it. Then, if the receiving process gets the data segments assembled again, that character will still be part of the information content of the data. The only problem is that now it is in a 6-byte pseudo-UTF-8 sequence. Defining these as irregular rather than illegal is intended to allow that character to be retained over the course of this overall process in a form that conformant software is allowed to interpret, even if it would not be allowed to generate it that way.

Can someone give a specific example??? PLEASE! going from simple to difficult?

I know I am retarded compared to the gurus, but Im just an amateur, so no bashing please. Your help is appreciated. I had no teacher to show me anything so I don't understand quite a few things.

I am trying to recreate keysymdef.h in GNU/Linux Ubuntu10.04 so as to reassign some unicode characters, and in turn change the keyboard mapping bindings. I am not a computer expert or programmer. I know that I need to add i.e. U+1FFD to a hexadecimal value in order to find its value?

and what does this "Byte 3=0" or Byte 3=1" or Byte 3=7" mean in this file??

i.e.

/*

* Latin 1

* (ISO/IEC 8859-1 = Unicode U+0020..U+00FF)

* Byte 3 = 0

*/

#define XK_0 0x0030 /* U+0030 DIGIT ZERO */

Here U+0030 is the same as the hex value 0x0030, and it also corresponds to the system character map file where all the glyphs are shown (Digit zero is U+0030 and its hex value is UTF-16 0x0030)

Here U+0104 corresponds to the hex value 0x01a1. How did we get to this? How do I use the Byte 3=1 information for all this stuff? I know I am a bozo, but I'm completely self taught, forgive me for asking silly questions to computer gurus, but to me they are rather complicated.

For instance, how to I find out what the hexadecimal value would be for U1FFD? (so that it corresponds to the Unicode character when it is called by the machine? Obviously this would not be 0x1FFD would it? How can I find out?)

No revisions to UTF-16 were made in TUS 3.1. The calculation for converting from the code values for a surrogate pair code unit sequence to the Unicode scalar value of the character being represented is reasonably simple: if CH and CL represent the values of the high and low surrogate code units in a well-formed surrogate pair, then the corresponding Unicode scalar value U is calculated as follows: