Archive for the ‘Unicode’ Category

Note: This article may not display properly if your browser has no font installed with some special Unicode characters.

Unicode is a beautiful thing. Where in the past, we had to be content with text sequences of merely 128 different characters, we can now use over 110,000 characters from almost every script in use today or throughout history. But there’s a problem: despite the fact that Unicode has been around and relatively unchanged for the past 16 years, most modern programming languages do not give programmers a proper abstraction over Unicode strings, which means that we must carefully deal with a range of encoding issues or suffer a plethora of bugs. And there are still languages coming out today that fall into this trap. This blog post is a call to language designers: your fundamental string data type should abstract over Unicode encoding schemes, such that your programmer sees only a sequence of Unicode characters, thus preventing most Unicode bugs.

Over the past few years, I have encountered a surprising amount of resistance to this idea, with counter-arguments including performance reasons, compatibility reasons and issues of low-level control. In this post, I will argue that having a consistent, abstract Unicode string type is more important, and that programmers can be given more control in the special cases that require it.

To give a quick example of why this is a problem, consider the Hello World program on the front page of the Go programming language website. This demo celebrates Go’s Unicode support by using the Chinese word for “world”, “世界”:

fmt.Println("Hello, 世界")

But while Go ostensibly has Unicode strings, they are actually just byte strings, with a strong mandate that they be treated as UTF-8-encoded Unicode text. The difference is apparent if we wrap the string in the built-in len function:

fmt.Println(len("Hello, 世界"))

This code, perhaps surprisingly, prints 13, where one might have expected it to print 9. This is because while there are 9 characters in the string, there are 13 bytes if the string is encoded as UTF-8: in Go, the string literal “Hello, 世界” is equivalent to “Hello, \xe4\xb8\x96\xe7\x95\x8c”. Similarly, indexing into the string uses byte offsets, and retrieves single bytes instead of whole characters. This is a dangerous leaky abstraction: it appears to be a simple sequence of characters until certain operations are performed. Worse, English-speaking programmers have a nasty habit of only testing with ASCII input, which means that any bugs where the programmer has presumed to be dealing with characters (but is actually dealing with bytes) will go undetected.

In a language that properly abstracts over Unicode strings, the above string should have a length of 9. Programmers should only be exposed to encoding details when they explicitly ask for it.

The folly of leaky abstraction

The fundamental problem lies in the difference between characters and code units. Characters are the abstract entities that correspond to code points. Code units are the pieces that represent a character in the underlying encoding, and they vary depending on the encoding. For example, consider the character OSMANYA LETTER BA or ‘𐒁’ (with code point U+10481 — I go into details on the difference between characters and code points in this blog post). In UTF-8, it is represented by the four byte sequence F0 90 92 81; code units are bytes, so we would say this character is represented by four code units in UTF-8. In UTF-16, the same character is represented by two code units: D801 DC81; code units are 16-bit numbers. The problem with many programming languages is that their string length, indexing and, often, iteration operations are in code units rather than in characters (or code points).

(An aside: I find it interesting that the Unicode Standard itself defines a “Unicode string” as “an ordered sequence of code units.” That’s code units, not code points, suggesting that the standard itself disagrees with my thesis that a string should be thought of as merely a sequence of code points. This section of the standard seems concerned with how the strings should be represented, not the programming interface for accessing such strings, so I feel that it doesn’t directly contradict this post. This post is about the string interface, not the underlying representation.)

There are three main types of culprit languages that exhibit this issue:

Languages with byte-oriented string operations and no specified encoding. These leave Unicode support up to programmers and third-party library authors, resulting in a general mish-mash of encoding across different code bases. These include C, Python 2, Ruby, PHP, Perl, Lua, and almost all pre-Unicode (1992) programming languages.

Languages with code-unit-oriented string operations and no specified encoding. These are probably the worst offenders, as the language manages the encoding, but leaks abstraction details that the programmer has no control over (for example, different builds of the compiler may have different underlying encoding schemes, affecting the behaviour of programs). These include certain builds of Python 3.2 and earlier, and Mercury.

Languages with code-unit-oriented string operations, but a specific encoding scheme. These languages at least have well-defined behaviour, but still expose the programmer to implementation details, which can result in bugs. Most modern languages fit into this category, including Go (UTF-8), JVM-based languages such as Java and Scala (UTF-16), .NET-based languages such as C# (UTF-16) and JavaScript (UTF-16).

Edit: A number of comments indicate that some of the other languages have optional Unicode support. In Perl 5, you can write “use utf8″ to turn on proper Unicode strings. In Ruby 1.9, you can attach encodings to strings to make them behave better. In Python 2, as I’ll get to later, you have a separate Unicode string type.

Above, I talked about the leaky abstraction of UTF-8 in Go, but it is equally important to recognise the leaky abstraction of UTF-16 in many modern language frameworks; in particular, the Java platform (1996) and the .NET platform (2001). Both treat strings as a sequence of UTF-16 code units (and the corresponding “char” data type as a single UTF-16 code unit). This is disappointingly close to the ideal. In Java, for instance, you can essentially treat a string to be a sequence of characters. ‘e’ (U+0065) is a character, and ‘汉’ (U+6C49) is a character — it feels like the language is abstracting over Unicode characters, until you realise that it isn’t. ‘𐒁’ (U+10481) is not a valid Java character. It is an astral character (one that cannot be represented in a single UTF-16 code unit) — it needs to be represented by the two-code-unit sequence D801 DC81. So it is possible to store this character in a Java string, but not without understanding the underlying encoding details. For example, the Java string “𐒁” has length 2, and worse, if you ask for “charAt(0)” you will get the character U+D801; “charAt(1)” gives U+DC81. Any code which directly manipulates the characters of the string is likely to fail in the presence of astral characters. Java has the excuse of history: Java was first released in January 1996; Unicode 2.0 was released in July 1996, and before that, there were no astral characters (so for six months, Java had it right!). The .NET languages, such as C#, behave exactly the same, and don’t have the excuse of time. JavaScript is the same again. This is perhaps worse than the UTF-8 languages, because even non-English-speaking programmers are likely to forget about astral characters: there are no writing systems in use today that use astral characters. But astral characters include historical scripts of interest to historians, emoticons which may be used by software, and large numbers of mathematical symbols which are commonly used by mathematicians, including myself, so it is important that software gets them right.

I must stress how harmful the second class of languages are in terms of string encoding. In 2011, the Mercury language switched from a type 1 language (strings are just byte sequences) to a type 2 language — the language now specifies all of the basic string operations in terms of code units, but does not specify the encoding scheme. Worse, Mercury has several back-ends — if compiling to C or Erlang, it uses UTF-8, whereas compiling to Java or C# results in UTF-16-encoded strings. Now programmers must contend with the fact that string.length(“Hello, 世界”) might be 13 or 9 depending on the back-end. (Fortunately, character-oriented alternatives, such as count_codepoints, are provided.) This is a major blow to code portability, which I would recommend language designers avoid at all costs.

Python too has traditionally had this problem: the interpreter can be compiled in “UCS-2″ or “UCS-4″ modes, which represent strings as UTF-16 or UTF-32 respectively, and expose those details to the programmer (the UCS-2 build behaves much like the other UTF-16 languages, while the UCS-4 build behaves exactly as I want, with one character per code unit). Fortunately, Python is about to correct this little wart entirely with the introduction of PEP 393 in upcoming version 3.3. This version will remove UCS-2/UCS-4 build, so all future versions will behave as the UCS-4 build did, but with a nice optimisation: all strings with only Latin-1 characters are encoded in Latin-1; all strings with only basic multilingual plane (BMP) characters are encoded in UCS-2 (UTF-16); all strings with astral characters are encoded in UTF-32. (It’s more complicated than that, but that’s the gist of it.) This ensures the correct semantics, but allows for compact string representation in the overwhelmingly common case of BMP-only strings.

Somewhat surprisingly, Bash (1989) appears to support character-oriented Unicode strings. The only other languages I have found which properly abstract over Unicode strings are from the functional world: Haskell 98 (1998) and Scheme R6RS (2007). Haskell 98 specifies that “the character type Char is an enumeration whose values represent Unicode characters” (with a link to the Unicode 5.0 specification). Scheme R6RS was the first version of Scheme to mention Unicode, and got it right on the first go. It specifies that “characters are objects that represent Unicode scalar values,” and goes into details explicitly stating that a character value is in the range [0, D7FF16] ∪ [E00016, 10FFFF16], then defines strings as sequences of characters.

With only four languages that I know of properly providing character-oriented string operations by default (are there any more?), this is a pretty poor track record. Sadly, many modern languages are being built on top of either the JVM or .NET framework, and so naturally absorb the poor character handling of those platforms. Still, it would be nice to see some more languages that behave correctly.

Some real-world problems

So far, the discussion has been fairly academic. What are some of the actual problems that have come about as a result of the leaky abstraction?

Edit: I mention a heap of technologies here, in a way that could be interpreted to be derisive. I don’t mean any offense towards the creators of these technologies, but rather, I intend to point out how difficult it can be to work with Unicode on a language that doesn’t provide the right abstractions.

A memorable example for me was a program I worked on which sent streams of text from a Python 2 server (UTF-8 byte strings) to a JavaScript client (UTF-16 strings). We had chosen the arbitrary message size of 512 bytes, and were chopping up text arbitrarily on those boundaries, and sending them off encoded in UTF-8 to the client, where they were concatenated back together. This almost worked, but we discovered a strange problem: some non-ASCII characters were being messed up some of the time. It occurred if you happened to have a multi-byte character split across a 512-byte boundary. For example, “ü” (U+00FC) is encoded as hex C3 BC. If byte 511 of a message was C3 and byte 512 was BC, the first packet would be sent ending in the byte C3 (which JavaScript wouldn’t know how to convert to UTF-16), and the second packet would be sent beginning with the byte BC (which JavaScript also wouldn’t know how to convert to UTF-16). So the result is “��”. In general, if a language exposes its underlying code units, then strings may not be split, converted to a different encoding, and re-concatenated.

An obvious example of an operation that goes wrong in these situations is the length operator. If you use UTF-8, Chinese characters are going to each report a length of 3, while in UTF-16, astral characters will each report a length of 2. Think this doesn’t matter? What about Twitter? On Twitter, you have to type messages into 140 characters. That’s 140 characters, not 140 bytes. Imagine if Twitter ate up three “characters” every time you hit a key (if you spoke Chinese, for example). Fortunately, Twitter does it correctly, even for astral characters. I just tweeted the following:

Notice that the text is unusually bold — these are no ordinary letters. They are MATHEMATICAL BOLD letters, which are astral characters. For example, the ‘𝐓’ is U+1D413 MATHEMATICAL BOLD CAPITAL T. The tweet contains exactly 140 characters — 28 ASCII, 3 in the basic multilingual plane, and 109 astral. In UTF-16, it comprises 249 code units, while in UTF-8, it comprises 473 bytes. Yet it was accepted in Twitter’s 140 character limit, because they count characters. I would guess that Twitter’s counting routines are written in both Scala and JavaScript, which means that in both versions, the engineers would need to have specially considered surrogate pairs as a single character. Had Twitter been written in a more Unicode-aware language, they could just have called the length method and not given it another thought.

Let’s quickly look at the way Tomboy (a note taking app written in C# on the Mono/.NET platform) loads notes from XML files. I discovered a bug in Tomboy where the entire document would be messed up if you type an astral character, save, quit, and load. It turns out that one function is keeping an int “offset” which counts the number of “characters” (actually code units) read, and passes the offset to another function, which “returns the location at a particular character offset.” This one really means character. Needless to say, one astral character means that the character offset is too high, resulting in a complete mess. Why is it the programmer’s responsibility to deal with these concepts?

Other software I’ve found that doesn’t handle astral characters correctly includes Bugzilla (ironically, I discovered this when it failed to save my report of the above Tomboy bug) and FileFormat.info (an otherwise-fantastic resource for Unicode character information). The latter shows what can happen when a UTF-16 string is encoded to UTF-8 without a great deal of care, presumably due to an underlying language exposing a UTF-16 string representation.

A common theme here is programmer confusion. While these problems can be solved with enough effort, the leaky abstraction forces the programmer to expend effort that could be spent elsewhere. Importantly, it makes documentation more difficult. Any language with byte- or code-unit-oriented strings forces programmers to clarify what they mean every time they say “length” or “n characters”. Only when the language completely hides the encoding details are programmers free to use the phrase “n characters” unambiguously.

What about performance?

The most common objection to character-oriented strings is that it impacts the performance of strings. If you have an abstract Unicode string that uses UTF-8 or UTF-16, your indexing operation (charAt) goes from O(1) to O(n), as you must iterate over all of the variable-length code points. It is easier to let the programmer suffer an O(1) charAt that returns code units rather than characters. Alternatively, if you have an abstract Unicode string that uses UTF-32, you’ll be fine from a performance standpoint, but you’ve made strings take up two or four times as much space.

But let’s think about what this really means. The performance argument really says, “it is better to be fast than correct.” That isn’t something I’ve read in any software engineering book. Conventional wisdom is the opposite: 1. Be correct — in all cases, not just the common ones, 2. Be fast (here it is OK to optimise for the common cases). While performance is an important consideration, is unbelievable to me that designers of high-level languages would prioritise speed over correctness. (Again, you could argue that a code unit interface is not “incorrect,” merely “low level,” but hopefully this post shows that it frequently results in incorrect code.)

I don’t know exactly what the best implementation is, and I’m not here to tell you how to implement it. But for what it’s worth, I don’t consider UTF-32 to be a tremendous burden in this day and age (in moving to 64-bit platforms, we doubled the size of all pointers — why can’t we also double the size of strings?) Alternatively, Python’s PEP 393 shows that we can have correct semantics and optimise for common cases as well. In any case, we are talking about software correctness here — can we get the semantics right first, then worry about optimisation?

But it isn’t supposed to be abstract — that’s what libraries are for

The second most common objection to character-oriented strings is “but the language just provides the basic primitives, and the programmer can build whatever they like on top of that.” The problem with this argument is that it ignores the ecosystem of the language. If the language provides, for example, a low-level byte-string data type, and programmers can choose to use it as a UTF-8 string (or any other encoding of their choice), then there is going to be a problem at the seams between code written by different programmers. If you see a library function that takes a byte-string, what should you pass? A UTF-8 string? A Latin-1 string? Perhaps the library hasn’t been written with Unicode awareness at all, and any non-ASCII string will break. At best, the programmer will have written a comment explaining how you should encode Unicode characters (and this is what you need to do in C if you plan to support Unicode), but most programmers will forget most of the time. It is far better to have everybody on the same page by having the main string type be Unicode-aware right out of the box.

The ecosystem argument can be seen very clearly if we compare Python 2 and 3. Python 3 is well known for better Unicode support than its predecessor, but fundamentally, it is a rather simple difference: Python 2’s byte-string type is called “str”, while its Unicode string type is called “unicode”. By contrast, Python 3’s byte-string type is called “bytes”, while its Unicode string type is called “str”. They behave pretty much the same, just with different names. But in practice, the difference is immense. Whenever you call the str() function, it produces a Unicode string. Quoted literals are Unicode strings. All Python 3 functions, regardless of who wrote them, deal with Unicode, unless the programmer had a good reason to deal with bytes, whereas in Python 2, most functions dealt with bytes and often broke when given a Unicode string. The lesson of Python 3 is: give programmers a Unicode string type, make it the default, and encoding issues will mostly go away.

What about binary strings and different encodings?

Of course, programming languages, even high-level ones, still need to manipulate data that is actually a sequence of 8-bit bytes, that doesn’t represent text at all. I’m not denying that — I just see it as a totally different data structure. There is no reason a language can’t have a separate byte array type. UTF-16 languages generally do. Java has separate types “byte” and “char”, with separate semantics (for example, byte is a numeric type, whereas char is not). In Java, if you want a byte array, just say “byte[]”. Python 3, similarly, has separate “bytes” and “str” types. JavaScript recently added a Uint8Array type for this purpose (since JavaScript previously only had UTF-16 strings). Having a byte array type is not an excuse for not having a separate abstract string type — in fact, it is better to have both.

Similarly, a common argument is that text might come into the program in all manner of different encodings, and so abstracting away the string representation means programmers can’t handle the wide variety of encodings out there. That’s nonsense — of course even high-level languages need encoding functions. Python 3 handles this ideally — the string (str) data type has an ‘encode’ method that can encode an abstract Unicode string into a byte string using an encoding of your choice, while the byte string (bytes) data type has a ‘decode’ method that can interpret a byte string using a specified encoding, returning an abstract Unicode string. If you need to deal with input in, say, Latin-1, you should not allow the text to be stored in Latin-1 inside your program — then it will be incompatible with any other non-Latin-1 strings. The correct approach is to take input as a byte string, then immediately decode it into an abstract Unicode string, so that it can be mixed with other strings in your program (which may have non-Latin characters).

What about non-Unicode characters?

There’s one last thorny issue: it’s all well and good to say “strings should just be Unicode,” but what about characters that cannot be represented in Unicode? “Like what,” you may ask. “Surely all of the characters from other encodings have made their way into Unicode by now.” Well, unfortunately, there is a controversial topic called Han unification, which I won’t go into details about here. Essentially, some languages (notably Japanese) have borrowed characters from Chinese and changed their appearance over thousands of years. The Unicode consortium officially considers these borrowed characters as being the same as the original Chinese, but with a different appearance (much like a serif versus a sans-serif font). But many Japanese speakers consider them to be separate to the Chinese characters, and Japanese character encodings reflect this. As a result, Japanese speakers may want to use a different character set than Unicode.

This is unfortunate, because it means that a programming language designed for Japanese users needs to support multiple encodings and expose them to the programmer. As Ruby has a large Japanese community, this explains why Ruby 1.9 added explicit encodings to strings, instead of going to all-Unicode as Python did. I personally find that the advantages of a single high-level character sequence interface outweigh the need to change character sets, but I wanted to mention this issue anyway, by way of justifying Ruby’s choice of string representation.

What about combining characters?

I’ve faced the argument that my proposed solution doesn’t go far enough in abstracting over a sequence of characters. I am suggesting that strings be an abstract sequence of characters, but the argument has been put to me that characters (as Unicode defines them) are not the most logical unit of text — combining character sequences (CCSes) are.

For example, consider the string “𐒁̸”, which consists of two characters: ‘𐒁’ (U+10481 OSMANYA LETTER BA) followed by ‘◌̸’ (U+0338 COMBINING LONG SOLIDUS OVERLAY). What is the length of this string?

Is it 6? That’s the number of UTF-8 bytes in the string (4 + 2).

Is it 3? That’s the number of UTF-16 code units in the string (2 + 1).

Is it 2? That’s the number of characters (code points) in the string.

Is it 1? That’s the number of combining character sequences in the string.

My argument is that the first two answers are unacceptable, because they expose implementation details to the programmer (who is then likely to pass them on to the user) — I am in favour of the third answer (2). The argument put to me is that the third answer is also unacceptable, because most people are going to think of “𐒁̸” as a single “character”, regardless of what the Unicode consortium thinks. I think that either the third or fourth answers are acceptable, but the third is a better compromise, because we can count characters in constant time (just use UTF-32), whereas it is not possible to count CCSes in constant time.

I know, I know, I said above that performance arguments don’t count. But I see a much less compelling case for counting CCSes over characters than for counting characters over code units.

Firstly, users are already exposed to combining characters, but they aren’t exposed to code units. We typically expect users to type the combining characters separately on a keyboard. For instance, to type the CCS “e̸”, the user would type ‘e’ and then type the code for ‘◌̸’. Conversely, we never expect users to type individual UTF-8 or UTF-16 code units — the representation of characters is entirely abstract from the user. Therefore, the user can understand when Twitter, for example, counts “e̸” as two characters. Most users would not understand if Twitter were to count “汉” as three characters.

Secondly, dealing with characters is sufficient to abstract over the encoding implementation details. If a language deals with code units, a program written for a UTF-8 build may behave differently if compiled with a UTF-16 string representation. A language that deals with characters is independent of the encoding. Dealing with CCSes doesn’t add any further implementation independence. Similarly, dealing with characters is enough to allow strings to be spliced, re-encoded, and re-combined on arbitrary boundaries; dealing with CCSes isn’t required for this either.

Thirdly, since Unicode allows arbitrarily long combining character sequences, dealing with CCSes could introduce subtle exploits into a lot of code. If Twitter counted CCSes, a user could write a multi-megabyte tweet by adding hundreds of combining characters onto a single base character.

In any case, I’m happy to have a further debate about whether we should be counting characters or CCSes — my goal here is to convince you that we should not be counting code units. Lastly, I’ll point out that Python 3 (in UCS-4 build, or from version 3.3 onwards) and Haskell both consider the above string to have a length of 2, as does Twitter. I don’t know of any programming languages that count CCSes. Twitter’s page on counting characters goes into some detail about this.

Conclusion

My point is this: the Unicode Consortium has provided us with a wonderful standard for representing and communicating characters from all of the world’s scripts, but most modern languages needlessly expose the details of how the characters are encoded. This means that all programmers need to become Unicode experts to make high-quality internationalised software, which means it is far less likely that programs will work when non-ASCII characters, or even worse, astral characters, are used. Since most English programmers don’t test with non-ASCII characters, and almost nobody tests with astral characters, this makes it pretty unlikely that software will ever see such inputs before it is released.

Programming language designers need to become Unicode experts, so that regular programmers don’t have to. The next generation of languages should provide character-oriented string operations only (except where the programmer explicitly asks for text to be encoded). Then the rest of us can get back to programming, instead of worrying about encoding issues.

Note: This article may not display properly if your browser has no font installed with some special Unicode characters.

I’m writing another rather lengthy Unicode post, but before I finish it, I want to clear up a terminology problem I ran into, as I seem not to be the only one. This post is about clearing up the definition of the word character, with respect to the Unicode standard.

Let me briefly define the problem. It’s pretty obvious that ‘a’ (U+0061 LATIN SMALL LETTER A) is a character, as is ‘∞’ (U+221E INFINITY). But things get more complicated when we consider combining characters. When I type ‘∞’ (U+221E INFINITY) followed by ‘◌̸’ (U+0338 COMBINING LONG SOLIDUS OVERLAY), I get the single thing “∞̸”. Is this thing a character? If so, what do we call the two things that we combined to make this character?

This problem becomes quite important when we need to count the characters in a string (or index into the string by character). How many characters are in the string “∞̸”, one or two? It depends on whether you use the word character to refer to the ‘∞’ and the ‘◌̸’, or to refer to the combined “∞̸”. There’s some ambiguity here. How does the Unicode standard resolve it?

I have commonly seen the following terminology used to resolve this conflict: code point refers to the individual things being combined (the ‘∞’ and the ‘◌̸’), while character refers to the combined thing (the “∞̸”). So the above string has two code points, but only one character. This interpretation is wrong.

Firstly, this is quite unsatisfactory terminology because it misuses the term code point. Even if we ignore combining characters, code point is still not a synonym for character. A code point is just an integer, usually expressed in U+xxxx notation. Consider plain old ASCII. It is often said that “a character is a byte,” but this is not strictly true, even in ASCII land: ‘a’ is a character; 0x61 is a byte. Just because there is a one-to-one correspondence between characters and (7-bit) bytes does not mean that character and byte are synonyms. ASCII describes the encoding between characters and bytes. Similarly, in Unicode land, character and code point are not synonyms: ‘∞’ is a character; U+221E is a code point. Unicode describes the encoding between characters and code points. So please do not say that ‘∞’ is a code point. This leaves us with a gap in our terminology: if “∞̸” is a character, then what term, if not code point, do we use to describe the ‘∞’ and ‘◌̸’ that combined to produce it?

The correct answer (as I read the spec — it’s not exactly clear) is that “∞̸” is not a character. Even though in colloquial speech we may call it that, in Unicode, a character is something that has a corresponding code point. Every character in Unicode has a code point: ‘∞’ is a character, having code point U+221E; ‘◌̸’ is a (combining) character, having code point U+0338. Since “∞̸” doesn’t have a code point of its own, it isn’t a character. So what is it?

It’s a combining character sequence. The Unicode glossary defines a combining character sequence as:

A maximal character sequence consisting of either a base character followed by a sequence of one or more characters where each is a combining character…

So please do not call “∞̸” a character. If it consists of more than one code point, then it is more than a single character.

Note that, formally, we should use the term abstract character for what I call a character above. The term “character” is just short-hand for “abstract character” — the “abstract” prefix does not imply a combining character sequence; it is just used to distinguish the word “character” from its various other colloquial meanings. I should also point out that when I say an abstract character is something that has a code point, it doesn’t have to be a unique code point. In rare cases, an abstract character can have several equivalent encodings. For example, the abstract character ‘Å’ can be encoded as U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE, or as U+212B ANGSTROM SIGN (for historical reasons). The important point is that it can be encoded with a single code point.

This is all the more confusing because it is possible to represent certain abstract characters using combining characters. For example, the characters ‘A’ (U+0041 LATIN CAPITAL LETTER A) and ‘◌̊’ (U+030A COMBINING RING ABOVE) together form the combining character sequence “Å”, which is canonically equivalent to the abstract character ‘Å’ (even though it might display a little bit differently in your browser). In these cases, the abstract character ‘Å’ can be said to be encoded as the code point sequence U+0041 U+030A. But this doesn’t mean in general that you can call a combining sequence a “character” — rather, it means that in certain cases, there is a character that is equivalent to a combining sequence. (This particular example is given in Figure 2-8 of Chapter 2 of The Unicode Standard 6.0 [PDF].) The reason I have used a less common example, “∞̸”, is that there is no equivalent single character (to my knowledge), and therefore “∞̸” can only be called a combining character sequence, and not a character.

In summary:

A character is something that can be encoded with a single code point (integer). When using combining characters, each of the individual combining elements is a separate character.

A code point is an integer, which identifies a character.

A combining character sequence is the result of combining a base (normal) character with one or more combining characters.

This question comes up quite a lot: “I have a URI. How do I percent-encode it?” In this post, I want to explore this question fully, because it is a hard question, and one that, on the face of it, is nonsense. The short answer to this question is: “You can’t. If you haven’t percent-encoded it yet, you’re too late.” But there are some specific reasons why you might want to do this, so read on.

Let’s get at the meaning of this question. I’ll use the term URI (Uniform Resource Identifier), but I’m mostly talking about URLs (a URI is just a bit more general). A URI is a complete identifier of a resource, such as:

http://example.com/admin/login?name=Helen&gender=f

A URI component is any of the individual atomic pieces of the URI, such as “example.com”, “admin”, “login”, “name”, “Helen”, “gender” or “f”. The components are the parts that the URI syntax has no interest in parsing; they represent plain text strings. Now the problem comes when we encounter a URI such as this:

http://example.com/admin/login?name=Helen Ødegård&gender=f

This isn’t a legal URI because the last query argument “Helen Ødegård” is not percent-encoded — the space (U+0020, meaning that this is the Unicode character with hexadecimal value 20), as well as the non-ASCII characters ‘Ø’ (U+00D8) and ‘å’ (U+00E5) are forbidden in any URI. So the answer to “can’t we fix this?” is “yes” — we can retroactively percent-encode the URI so it appears like this:

A digression: note that the Unicode characters were first encoded to bytes with UTF-8: ‘Ø’ (U+00D8) encodes to the byte sequence hex C3 98, which then percent-encodes to %C3%98. This is not actually part of the standard: none of the URI standards (the most recent being RFC 3986) specify how a non-ASCII character is to be converted into bytes. I could also have encoded them using Latin-1: “Helen%20%D8deg%E5rd,” but then I couldn’t support non-European scripts. This is a mess, but it isn’t the subject of this article, and the world mostly gets along fine by using UTF-8, which I’ll assume we’re using for the rest of this article.

Okay, so that’s solved, but will it work in all cases? How about this URI:

But how did we know to encode the “#”? What if whoever typed this URI genuinely meant for there to be a query of “redirect=http://example.com/news” and a fragment of “funny&name=Helen&gender=f”. It is wrong for us to meddle with the given URI, assuming that the “#” was intended to be a literal character and not a delimiter. The answer to “can we fix it?” is “no“. Fixing the above URI would only introduce bugs. The answer is “if you wanted that ‘#’ to be interpreted literally, you should have encoded it before you stuck it in the URI.”

The idea that you can:

Take a bunch of URI components (as bare strings),

Concatenate them together using URI delimiters (such as “?”, “&” and “#”),

Percent-encode the URI.

is nonsense, because once you have done step #2, you cannot possibly know (in general) which characters were part of the original URI components, and which are delimiters. Instead, error-free software must:

Take a bunch of URI components (as bare strings),

Percent-encode each individual URI component,

Concatenate them together using URI delimiters (such as “?”, “&” and “#”).

This is why I previously recommendednever using JavaScript’s encodeURI function, and instead to use encodeURIComponent. The encodeURI function “assumes that the URI is a complete URI” — it is designed to perform step #3 in the bad algorithm above, which by definition, is meaningless. The encodeURI function will not encode the “#” character, because it might be a delimiter, so it would fail to interpret the above example in its intended meaning.

The encodeURIComponent function, on the other hand, is designed to be called on the individual components of the URI before they are concatenated together — step #2 of the correct algorithm above. Calling that function on just the component “http://example.com/news#funny&#8221; would produce:

http%3A%2F%2Fexample.com%2Fnews%23funny

which is a bit of overkill (the “:” and “/” characters do not strictly need to be encoded in a query parameter), but perfectly valid — when the data reaches the other end it will be decoded back into the original string.

So having said all of that, is there any legitimate need to break the rule and percent-encode a complete URI?

URI cleaning

Well, yes there is. (I have been bullish in the past that there isn’t, such as in my answer to this question on Stack Overflow, so this post is me reconsidering that position a bit.) It happens all the time: in your browser’s address bar. If you type this URL into the address bar:

http://example.com/admin/login?name=Helen Ødegård&gender=f

it is not typically an error. Most browsers will automatically “clean up” the URL and send an HTTP request to the server with the line:

GET /admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f HTTP/1.1

(Unfortunately, IANA will redirect this immediately, but if you inspect the packets or try it on a server you control, then check the logs, you will see this is true.) Most browsers don’t show you that they’re cleaning up the URIs — they attempt to display them as nicely as possible (in fact, if you type in the escaped version, Firefox will automatically convert it so you can see the space and Unicode characters in the address bar).

Does this mean that we can relax and encode our URIs after composing them? No. This is an application of Postel’s Law (“Be liberal in what you accept, and conservative in what you send.”) The browser’s address bar, being a human interface mechanism, is helpfully attempting to take an invalid URI and make it valid, with no guarantee of success. It wouldn’t help on my second example (“redirect=http://example.com/news#funny”). I think this is a great idea, because it lets users type spaces and Unicode characters into the address bar, and it isn’t necessarily a bad idea for other software to do it too, particularly where user interfaces are concerned. As long as the software is not relying on it.

In other words, software should not use this technique to construct URIs internally. It should only ever use this technique to attempt to “clean up” URIs that have been supplied from an external source.

So that is the point of JavaScript’s encodeURI function. I don’t like to call this “encoding” because that implies it is taking something unencoded and converting it into an encoded form. I prefer to call this “URI cleaning”. That name is suggestive of the actual process: taking a complete URI and cleaning it up a bit.

Unfortunately (as pointed out by Tim Cuthbertson), encodeURI is not quite good for this purpose — it encodes the ‘%’ character, meaning it will double-escape any URI that already has percent-escaped content. More on that later.

We can formalise this process by describing a new type of object called a “super URI.” A super URI is a sequence of Unicode characters with the following properties:

Any character that is in any way valid in a URI is interpreted as normal for a URI,

Any other character is interpreted as a URI would interpret the sequence of characters resulting from percent-encoding the UTF-8 encoding of the character (or some other character encoding scheme).

Now it becomes clear what we are doing: URI cleaning is simply the process of transforming a super URI into a normal URI. In this light, rather than saying that the string:

http://example.com/admin/login?name=Helen Ødegård&gender=f

is “some kind of malformed URI,” we can say it is a super URI, which is equivalent to the normal URI:

Note that super URIs have nothing to do with not percent-encoding of delimiter characters — delimiters such as “#” must still be percent-escaped in the super URI. They are only about not percent-encoding invalid characters. We can consider super URIs to be human-readable syntax, while proper URIs are required for data transmission. This means that we can also take a proper URI and convert it into a more human-readable super URI for display purposes (as web browsers do). That is the purpose of JavaScript’s decodeURI function. Note that, again, I don’t consider this to be “decoding,” rather, “pretty printing.” It doesn’t promise not to show you percent-encoded characters. It only decodes characters that are illegal in normal URIs.

It is probably a good idea for most applications that want to “pretty print” a URI to not decode control characters (U+0000 — U+001F and U+007F), to avoid printing garbage and newlines. Note that decodeURI does decode these characters, so it is probably unwise to use it for display purposes without some post-processing.

Update: My “super URI” concept is similar to the formally specified IRI (Internationalized Resource Identifier) — basically, a URI that can have non-ASCII characters. However, my “super URIs” also allow other ASCII characters that are illegal in URLs.

Which characters?

Okay, so exactly which characters should be escaped for this URI cleaning operation? I thought I’d take the opportunity to break down the different sets of characters described by the URI specification. I will address two versions of the specification: RFC 2396 (published in 1998) and RFC 3986 (published in 2005). 2396 is obsoleted by 3986, but since a lot of encoding functions (including JavaScript’s) were invented before 2005, it gives us a good historical explanation for their behaviour.

RFC 2396

This specification defines two sets of characters: reserved and unreserved.

The reserved characters are: $&+,/:;=?@

The unreserved characters are: ALPHA and NUM and !'()*-._~

Where ALPHA and NUM are the ASCII alphabetic and numeric characters, respectively. (They do not include non-ASCII characters.)

There is a semantic difference between reserved and unreserved characters. Reserved characters may have a syntactic meaning in the URI syntax, and so if one of them is to appear as a literal character in a URI component, it may need to be escaped. (This will depend upon context — a literal ‘?’ in a path component will need to be escaped, whereas a ‘?’ in a query does not need to be escaped.) Unreserved characters do not have a syntactic meaning in the URI syntax, and never need to be escaped. A corollary to this is that the escaping or unescaping an unreserved character does not change its meaning (“Z” means the same as “%5A”; “~” means the same as “%7E”), but escaping or unescaping a reserved character might change its meaning (“?” may have a different meaning to “%3F”).

The URI component encoding process should percent-encode all characters that are not unreserved. It is safe to escape unreserved characters as well, but not necessary and generally not preferable.

Together, these two sets comprise the valid URI characters, along with two other characters: ‘%’, used for encoding, and ‘#’, used to delimit the fragment (the ‘#’ and fragment were not considered to be part of the URI). I would suggest that both ‘%’ and ‘#’ be treated as reserved characters. All other characters are illegal. The complete set of illegal characters, under this specification, follows:

The ASCII control characters (U+0000 — U+001F and U+007F)

The space character

The characters: “<>[\]^`{|}

Non-ASCII characters (U+0080 — U+10FFFD)

The URI cleaning process should percent-encode precisely this set of characters: no more and no less.

RFC 3986

The updated URI specification from 2005 makes a number of changes, both to the way characters are grouped, and to the sets themselves. The reserved and unreserved sets are now as follows:

The reserved characters are: !#$&'()*+,/:;=?@[]

The unreserved characters are: ALPHA and NUM and -._~

This version features ‘#’ as a reserved character, because fragments are now considered part of the URI proper. There are two more important additions to the restricted set. Firstly, the characters “!'()*” have been moved from unreserved to reserved, because they are “typically unsafe to decode.” This means that, while these characters are still technically legal in a URI, their encoded form may be interpreted differently to their bare form, so encoding a URI component should encode these characters. Note that this is different than banning them from URIs altogether (for example, a “javascript:” URI is allowed to contain bare parentheses, and that scheme simply chooses not to distinguish between “(” and “%28″). Secondly, the characters ‘[‘ and ‘]’ have been moved from illegal to reserved. As of 2005, URIs are allowed to contain square brackets. This unfortunate change was made to allow IPv6 addresses in the host part of a URI. However, note that they are only allowed in the host, and not anywhere else in the URI.

The reserved characters were also split into two sets, gen-delims and sub-delims:

The gen-delims are: #/:?@[]

The sub-delims are: !$&'()*+,;=

The sub-delims are allowed to appear anywhere in a URI (although, as reserved characters, their meaning may be interpreted differently if they are unescaped). The gen-delims are the important top-level syntactic markers used to delimit the fields of the URI. The gen-delims are assigned meaning by the URI syntax, while the sub-delims are assigned meaning by the scheme. This means that, depending on the scheme, sub-delims may be considered unreserved. For example, a program that encodes a JavaScript program into a “javascript:” URI does not need to encode the sub-delims, because JavaScript will interpret them the same whether they are encoded or not (such a program would need to encode illegal characters such as space, and gen-delims such as ‘?’, but not sub-delims). The gen-delims may also be considered unreserved in certain contexts — for example, in the query part of a URI, the ‘?’ is allowed to appear bare and will generally mean the same thing as “%3F”. However, it is not guaranteed to compare equal: under the Percent-Encoding Normalization rule, encoded and bare versions of unreserved characters must be considered equivalent, but this is not the case for reserved characters.

Taking the square brackets out of the illegal set leaves us with the following illegal characters:

The ASCII control characters (U+0000 — U+001F and U+007F)

The space character

The characters: “<>\^`{|}

Non-ASCII characters (U+0080 — U+10FFFD)

A modern URI cleaning function must encode only the above characters. This means that any URI cleaning function written before 2005 (hint: encodeURI) will encode square brackets! That’s bad, because it means that a URI with an IPv6 address:

which refers to the domain name “[2001:db8:85a3:8d3:1319:8a2e:370:7348]” (not the IPv6 address). Mozilla’s reference on encodeURI contains a work-around that ensures that square brackets are not encoded. (Note that this still double-escapes ‘%’ characters, so it isn’t good for URI cleaning.)

So what exactly should I do?

If you are building a URI programmatically, you must encode each component individually before composing them.

Escape the following characters: space and !”#$%&'()*+,/:;<=>?@[\]^`{|} and U+0000 — U+001F and U+007F and greater.

Do not escape ASCII alphanumeric characters or -._~ (although it doesn’t matter if you do).

If you have specific knowledge about how the component will be used, you can relax the encoding of certain characters (for example, in a query, you may leave ‘?’ bare; in a “javascript:” URI, you may leave all sub-delims bare). Bear in mind that this could impact the equivalence of URIs.

If you are parsing a URI component, you should unescape any percent-encoded sequence (this is safe, as ‘%’ characters are not allowed to appear bare in a URI).

If you are “cleaning up” a URI that someone has given you:

Escape the following characters: space and “<>\^`{|} and U+0000 — U+001F and U+007F and greater.

You may (but shouldn’t) escape ASCII alphanumeric characters or -._~ (if you really want to; it will do no harm).

You must not escape the following characters: !#$%&'()*+,/:;=?@[]

For an advanced URI cleaning, you may also fix any other syntax errors in an appropriate way (for example, a ‘[‘ in the path segment may be encoded, as may a ‘%’ in an invalid percent sequence).

An advanced URI cleaner may be able to escape some reserved characters in certain contexts. Bear in mind that this could impact the equivalence of URIs.

If you are “pretty printing” a URI and want to display escaped characters as bare, where possible:

Unescape the following characters: space and “-.<>\^_`{|}~ and ASCII alphanumeric characters and U+0080 and greater.

It is probably not wise to unescape U+0000 — U+001F and U+007F, as they are control characters that could cause display problems (and there may be other Unicode characters with similar problems.)

You must not unescape the following characters: !#$%&'()*+,/:;=?@[]

An advanced URI printer may be able to unescape some reserved characters in certain contexts. Bear in mind that this could impact the equivalence of URIs.

Some implementations

JavaScript

As I stated earlier, never use escape. First, it is not properly specified. In Firefox and Chrome, it encodes all characters other than the following: *+-./@_. This makes it unsuitable for URI construction and cleaning. It encodes the unreserved character ‘~’ (which is harmless, but unnecessary), and it leaves the reserved characters ‘*’, ‘+’, ‘/’ and ‘@’ bare, which can be problematic. Worse, it encodes Latin-1 characters with Latin-1 (instead of UTF-8) — not technically a violation of the spec, but likely to be misinterpreted, and even worse, it encodes characters above U+00FF with the malformed syntax “%uxxxx”. Avoid.

JavaScript’s “fixed” URI encoding functions behave according to RFC 2396, and assuming Unicode characters are to be encoded with UTF-8. This means that they are lacking the 2005 changes:

encodeURIComponent does not escape the previously-unreserved characters ‘!’, “‘”, “(“, “)” and “*”. Mozilla’s reference includes a work-around for this.

encodeURI erroneously escapes the previously-illegal characters ‘[‘ and ‘]’. Mozilla’s reference includes a work-around for this.

decodeURI erroneously unescapes ‘[‘ and ‘]’ (although there doesn’t seem to be a practical case where this is a problem).

Edit: Unfortunately, encodeURI and decodeURI have a single, critical flaw: they escape and unescape (respectively) percent signs (‘%’), which means they can’t be used to clean a URI. (Thanks to Tim Cuthbertson for pointing this out.) For example, assume we wanted to clean the URI:

Note that these do not work properly at all on Unicode strings — you should first encode the string using UTF-8 before passing it to urllib.quote.

In Python 3, the quote and unquote functions have been moved into the urllib.parse module, and upgraded to work on Unicode strings (by me — yay!). By default, these will encode and decode strings as UTF-8, but this can be changed with the encoding and errors parameters (see urllib.parse.quote and urllib.parse.unquote).

I don’t know of any Python built-in functions for doing URI cleaning, but urllib.quote can easily be used for this purpose by passing safe=”!#$%&'()*+,/:;=?@[]~” (the set of reserved characters, as well as ‘%’ and ‘~'; note that alphanumeric characters, and ‘-‘, ‘.’ and ‘_’ are always safe in Python).

Mozilla Firefox

Firefox 10’s URL bar performs URL cleaning, allowing the user to type in URLs with illegal characters, and automatically converting them to correct URLs. It escapes the following characters:

space, “‘<>` and U+0000 — U+0001F and U+007F and greater. (Note that this includes the double and single quote.)

Note that the control characters for NUL, tab, newline and carriage return don’t actually transmit.

I would say this is erroneous: on a minor note, it should not be escaping the single quote, as that is a reserved character. It also fails to escape the following illegal characters: \^{|}, sending them to the server bare.

Firefox also “prettifies” any URI, decoding most of the percent-escape sequences for the characters that it knows how to encode.

Google Chrome

Chrome 16’s URL bar also performs URL cleaning. It is rather similar to Firefox, but encoding the following characters:

space, “<> and U+0000 — U+0001F and U+007F and greater. (Note that this includes only the double quote.)

So Chrome also fails to escape the illegal characters \^`{|} (including the backtick, which Firefox escapes correctly), but unlike Firefox, it does not erroneously escape the single quote.

Something strange I just noticed, when playing around with bidirectional Unicode text. If you haven’t seen bidirectional (“bidi”) text, it’s weird. This is the part of Unicode that deals with right-to-left scripts, like Hebrew, and their insanely complicated interactions with left-to-right text.

Consider the five characters stored in sequence:

“ר” then ” ” then “∈” then ” ” then “מ”

I separated them with English words so your browser’s bidi doesn’t touch them yet. These five characters might appear in a Hebrew maths paper (?), to mean “ר is a member of the set מ”, much like “x ∈ s” means “x is a member of the set s”.

Let’s render this out:

ר ∈ מ

If your browser is doing what mine is (Firefox 3), you’ll see that as Hebrew dictates, the five characters are written from right to left (the spaces and “∈” are neutrals, so they take the directionality of the surrounding text, so count as right-to-left characters in this instance). What I’m amazed by is that on my display, the “∈” (U+2208) has actually been rendered flipped horizontally, so as to correctly read that “ר is a member of the set מ” even though “ר” is on the right of the operator. It’s been rendered as a “∋” (U+220B). I’m not sure if it’s specifically rendering U+220B instead of U+2208, or if it’s actually flipping the U+2208 glyph horizontally. I can’t find any mention of horizontal flipping in the bidi spec.

Can anyone explain what’s going on?

(Note: I doubt Hebrew-speaking mathematicians use Hebrew characters as variable names; it was just an example.)