Some gotchas about Unicode that EVERY programmer should know

Written by: Danny Spencer

February 26, 2015

Unicode?

Yes, Unicode.

Unicode is a text-encoding standard used in virtually every web and desktop application in the world. It is responsible for encoding text written in just about every language and character set. In fact, you are reading Unicode right now.

I believe that all developers who deal with text have a responsibility to thoroughly understand the concepts of Unicode.
Unicode knowledge should be required by all web developers, database designers, back-end developers…
well, everyone.
The requirement to process text is ubiquitous for nearly every program, so
every programmer needs to know how to do it correctly.

Honestly, I’m surprised by how little Unicode is emphasized in schools and
elsewhere. At my college, we were basically taught to assume that foreign
languages don’t exist (i.e. ASCII only). I can’t fathom why.
I don’t know if it’s because they were unaware of Unicode, or if they wanted to
“simplify” the material (which is wrong to do).
Everyone else just seems to forget that Unicode exists.

For those of us who know what Unicode is, we know it’s not simple.
There’s a lot to know, and it’s still incredibly easy to make mistakes.
I hope this list of gotchas will help somebody.

Gotchas

Characters that appear the same might not test equal

"Å" == "Å" => true and false?

You can try this yourself in your favorite Unicode-supporting programming language.
The below example’s lines can be entered into your browser’s JavaScript REPL (F12 -> Console on most browsers):

This issue tests positive in SQLite, MySQL, Oracle SQL, and likely most others.

Unfortunately, no SQL databases I know of have the option to auto-normalize inputted strings. If strings must meet unique constraints, then you’ll likely need to normalize said strings at the application layer.

Note: Usernames on most web apps are traditionally restricted to alphanumeric characters for reasons such as this.

There’s no such thing as a “universal sort” for strings

“Ö” comes before and after “U”?

Your favorite programming language probably has a list.sort() method. You’ve used it to sort a list of integers, and it works just fine. But what about strings?

Recall the SQL table from the previous section. Let’s try ordering some data in it:

In the wacky world of Unicode, the letter “ö” (depending on its form) comes both before and after the letter “u”. This is because in most programming languages, strings are simply sorted by each character’s code points: numbers that map to a Unicode character.

In the NFC form, “ö” is U+00F6.
In the NFD form, “ö” is U+006F, U+0308.
In both NFC and NFD forms, “u” is U+0075.

Because 0x6F < 0x75, the NFD “ö” comes before “u”, just like “o” does.
Because 0xF6 > 0x75, the NFC “ö” comes after “u”.

Let’s now assume that normalization is not an issue (all strings have been
normalized to NFC).
Vanilla sorts still may not be enough, depending on your use case.
For locale-specific sorts, we need to talk about collations!

A collation specifies the natural ordering of characters in a language.

In German, the letter “ö” implicitly becomes “oe” when sorting names. That means the correct order should be:

Web frameworks do not normalize strings!

For developers hearing about this for the first time, this could open a whole new can of worms.
Many of us, being the privileged Anglocentric scum that we are, assume that the world only operates on English characters. As such, we may not realize that our assumptions might be compromising the security of web applications.

If normalization is done haphazardly, someone could log in as both “Jörg” and “Jörg”.
There’s certainly potential here for targeted phishing or backdoor attacks.
In fact, Spotify has had this exact issue.

Unicode has multiple representations of English characters

There is not a single ASCII English letter in that tweet.
Seriously, copy it into your favorite code editor. It’s legit!

The way to extract the canonical characters from a string like this is with NFKC or NFKD normalization.
Unlike NFC and NFD normalization, NFKC/NFKD will normalize characters that may look different, but are semantically the same as others.