What everyone should know about Unicode

Written on
February
27th,
2017
by Kishu Agarwal

I will try to explain in this article what is Unicode and why it is important for you to know about it.

But first, some background?

Computers don’t understand characters. They understand numbers. So, all the characters are basically numbers to computers. Now the question arises, Which number is assigned to which character. This is where encoding schemes comes into picture.

One of the most popular encoding schemes that was developed in around 1960’s and is still widely used is ASCII. ASCII is a 7 bit encoding scheme which maps characters to 7 bit code. For example, capital A in English is encoded as 65 in ASCII.

Similarly, other characters in english, control characters, punctuations characters were mapped to some 7 bit numbers.

Things were fine if english is the only language in the whole world, but as we all know that is not the case. ASCII is a 7 bit encoding scheme, which means it can encodes only 128 unique characters. But what about characters in Russian, Chinese and other languages. If someone in Russia has to write a document in Russian, he won’t be able to use the ASCII since, ASCII doesn’t have the mapping for the Russian characters.

Russian people started using KOI encoding which supported the Russian characters. But these documents when sent to people who use ASCII as their encoding scheme, would see gibberish. The reason being ASCII would try to read 65 as an A, but actually 65 is supposed to mean some other character in KOI encoding scheme.

Problems doesn’t end here only. Chinese language has more than 128 characters, which can’t be represented in 7 bit code, and thus complicating the issue further.

Unicode to the rescue

Unicode Standard was developed to resolve this issue arising from different encodings and there incompatibility with each other.

Unicode is nothing but a simple mapping from characters to numbers. Unicode maps all of the characters in every language known to human beings, even Klingon and emojis symbols.(Really!)

But here is one very crucial point about Unicode. Unicode doesn’t just map characters to numbers which will be actually used while storing or sending files. Unicode maps characters to numbers which are more of abstract kind of a mapping. These numbers are called Code Points.

Code Points are useful because they are kind of a identification of a character without taking into account any visual distinction or encoding. This essentially means that the english character A is assigned some number and that number is not linked to any particular glyph for that character or any particular encoding scheme.

To actually store or send a document comprising of Unicode characters, the document has to be first encoded using one of the Unicode encoding schemes. Over the years many Unicode encoding schemes have came. I have described two of the most popular ones.

UTF-8

UTF-8 is by far, the most commonly used Unicode encoding scheme. 8 in the UTF-8 denotes the fact the this encoding uses 8-bit bytes as the basis for encoding. Unlike the above encoding schemes which uses fixed number of bytes to encode characters, UTF-8 is a variable multi-byte encoding scheme.

U+10FFFF is the last codepoint defined as of now. If you are having trouble in understanding the above table, then let me explain this with a quick example. Suppose you have to encode the unicode character Ω with code point U+03A9. (U+ is commonly used to denote the unicode code points)

This unicode character lies in the second row of the table and thus 2 bytes would be used for this character. Out of the 16 bits, 3 bits in the first byte and 2 bits in the second byte are already filled by the encoding. And the remaining 11 bits are filled by using the 11 bits of the code point. Thus giving the final encoding of CE A9.

You may be wondering why there are some reserved bits in the table above. Why just don’t use the all the available bits?. Well, there is a reason for that.

The number of 1 bits in the first byte denotes the number of bytes used for this character. Extra 0 bit serves the purpose of delimiting. In the example above, since the number of bytes were 2, so the first byte high-order bits were set to 110.

Each continuation byte is denoted by a 10 in the high order bits.

Having these extra bits, gives UTF-8 some useful properties-

It is very easy to see just from a single byte, whether it is a single byte (starting from 0), leading byte of a multibyte sequence (starting from 11), or a continuation byte in a multibyte sequence (starting from 10)

It gives clear indication of the number of bytes for a multibyte sequence. Simply lookup the number of 1’s in the leading byte.

Unicode uses the same value for its code points as the ASCII encoding for the first 128 characters. UTF-8 exploits this fact to store these characters as it is, in a 1 byte, and thus ensuring that all ASCII documents are valid UTF-8 documents without any extra modification. And this also means that all the old C functions and programs and other parsers which assume ASCII as the encoding scheme would continue to run just fine if they are passed the UTF-8 encoded text, assuming of course that the characters are just from the range 0 to 127.

UTF-16

Now let’s move to one more popular encoding scheme, UTF-16. Similary to UTF-8, 16 here denotes that fact that it uses 16-bit (2 bytes) as the basic unit of encoding. UTF-16 is also a multibyte encoding scheme like UTF-8, but it either uses 2 bytes or 4 bytes depending upon the character.

The way UTF-16 encodes characters is as follows-

For all the unicode codepoints from U+0000 to U+D7FF and from U+E000 to U+FFFF are encoded as it is in a 2 byte. You may be wondering what about the codepoint from U+D800 to U+DFFF? Otherwise no one would be able to use the characters that these codepoints are representing ? Well, you are right. And here is one more interesting thing for you to know about Unicode. Unicode has permanently reserved these specific range of codepoints for the UTF-16 encoding of the low and high surrogates, and thus will never be assigned a character. So, it doesn’t matter if these range of codepoints aren’t assigned any encoding. Okay, but what are these surrogates, you ask? To know what they are, follow along.

For the unicode codepoints from U+010000 to U+10FFFF, it is slightly difficult to understand. But I have prepared a simple illustration to make it easier for you to follow along. Firstly 0x010000 is subtracted from the code point, giving us a 20-bit number in the range 0x000000 to 0x0FFFFF. This is done to get the offset of the code point from the first code point in this range. You remember the unused code points from above? Let’s bring them back.

UTF-16 divides these range into two buckets 0xD800...0xDBFF and 0xDC00...0xDFFF (let’s call them Aand B ) where each bucket has 10 free bits and 6 fixed bits(shown in grey in the image). The 20-bit number that we got above after subracting, is now divided into two parts of 10-bit each. The first 10-bits are used to the fill the 10 free bits of A while the remaining 10-bits are used to fill the 10 free bits of B.
These pairs A and B are called surrogate pairs of each other. A is the high surrogate and B is the low surrogate. This AB is the final encoding.

And this how UTF-16 encoding works. There is one issue with UTF-16 that you should know about. And that is of byte order. UTF-8 is a byte-oriented scheme so it doesn’t matter whether the machine is big-endian or little-endian. But it matters in the case of UTF-16 since it is word (2-byte) oriented scheme.

Suppose that we have to store the UTF-16 encoding for the unicode character Ω(U+03A9) which is 2 byte 0x03A9 (See the first point above to check how we got this). If the machine on which we are working on is Big-Endian, the bytes would be stored in the same order, 0x03A9, but if we are working on Little-Endian machine, the bytes would be stored in the reverse order, 0xA903. As long as you keep working on the same machine, everything is fine. It doesn’t matter which type machine is. But as soon as you want to send the file to someone else, who has different type of endian-ness on his machine, things start to break down.

If you send the same character Ω from a big-endian machine to a little-endian machine, the little-endian machine would get the bytes as 0x03A9 and would intrepret them as bytes in reverse order 0xA903, which stands for the unicode character ꤃.

To solve this byte-order issue, UTF-16 uses something which is called Byte-Order Mark(BOM). While saving a UTF-16 document, an additional unicode codepoint with the value
U+FEFF is stored before the actual first codepoint. This codepoint gives the hint to parser of the document, whether the document is stored in the big-endian or little-endian format.

Now if this codepoint is also sent with the character Ω from a big-endian machine to a little-endian machine, the little-endian machine would receive the bytes in the order, 0xFEFF03A9 and would intrepret them in the reverse order as 0xFFFEA903. Now the parser on the little-endian machine seeing the character U+FFFE (which is also reserved by the Unicode solely for this purpose), knows that it has to reverse the byte-order of the rest of the characters and thus correctly reads the next pair of bytes as 0x03A9.

Conclusion

Knowing about Unicode and different types of encoding like ASCII, UTF-8, UTF-16 will really help you a lot in solving problems arising due to incorrect usage of them. It will also be a step toward making your programs truly accessible to global audience who use different langauge than you use.

Thank you for reading my article. Let me know if you liked my article or any other suggestions for me, in the comments section below. And please, feel free to share :)