Down with Unicode! Why 16 bits per character is a right pain in the ASCII

We were sold a lie. It's time to go back to 8-bit

Stob I recently experienced a Damascene conversion and, like many such converts, I am now set on a course of indiscriminate and aggressive proselytising.

Ladies and gentlemen, place your ears in the amenable-to-bended position, and stand by to be swept along by the next great one-and-only true movement.

The beginning

In the beginning - well, not in the very beginning, obviously, because that would require a proper discussion of issues such as parity and error correction and Hamming distances; and the famous quarrel between the brothers ASCII, ISCIIVISCII and YUSCII; and how in the 1980s if you tried to send a £ sign to a strange printer that you had not previously befriended (for example, by buying it a lovely new ribbon) your chances of success were negligible; and, and...

But you are a busy and important person.

So in the beginning that began in the limited world of late MS-DOS and early Windows programming, O best beloved, there were these things called "code pages".

To the idle anglophone Windows programmer (ie: me) code pages were something horrible and fussy that one hoped to get away with ignoring. I was dimly aware that, to process strings in some of the squigglier foreign languages, it was necessary to switch code page and sometimes, blimey, use two bytes per character instead of just one. It was bad enough that They couldn't decide how many characters it took to mark the end of a line.

I emphatically wanted none of it, and I was not alone.

So we put our heads down and kept to our code page - our set of 8-bit characters - which was laid out according to the English Imperialism algorithm first identified by the renowned 1960s songster Michael Flanders.

(Flanders had noticed that there were only two kinds of postage stamps: English stamps, in sets, at the beginning of the album; and foreign stamps, all mixed up, at the end of the album. This philatelistic observation remains the key organisational idea behind ASCII-derived systems to this day. Proper letters and characters are represented by the codes 32 for space through 126 for tilde; foreign stuff, with its attendant hooks and loops and stray bits of fly dirt, appears somewhere above.)

Unicode Then

As far as I know, there isn't a creation myth associated with the unification of the world's character sets.

I like to imagine one Mr Unicode, who helped out at the United Nations building, being tasked with creating the new Fire Practice instruction card, which included the same three or four sentences rendered in 117 languages. When he tried to print it out, the Epsom Salts NoisyMatrix 800000 wrote a single smiley face at the top of the page, and switched on its Out Of Paper LED, and crashed.

That night, Mr Unicode sat down at his kitchen table with Mrs U and their two little glyphs (who were allowed to stay up until after their bed time for the purpose) and counted all the characters in all the languages in all the world. And when they had finished adding them up, it turned out there were just 60,000 of them, give or take.

Even if this account of the initial assessment is not quite right, the real life Mr U homed in on the discovery that the total number could be accommodated in two bytes. He claimed that "the idea for expanding the basis for character encoding from 8 to 16 bits is so sensible, indeed so obvious, that the mind initially recoils from it" [see Unicode 88 Section 2.1, PDF].

Mr Unicode did admit that there were actually a few more than 65,536, the 216 limit, but only if you included "unreasonable characters". So there.

This idea was a hit. In the early 1990s, Unicode was hailed as the cure for character set problems, and incorporated into technologies of the day.

We techies of that era were dead impressed. Nobody minded much that half the bytes in a string were zero. All those extra holes made it easier to air cool in-memory databases.

Besides, a few extra bytes, and some hideous constructs such as Visual C++'s TCHAR, seemed like a fair price to pay in exchange for the glorious simplification that all characters were the same length. The future seemed full of cheerful, well-fed people of all creeds, colours and, most of all, tongues happily sharing each other's laptops and pointing at things and laughing, as seen in Microsoft marketing photographs.

Unicode Now

It turned out that that there were uses for unreasonable characters after all. Wikipedia says that many of the omitted Chinese characters were part of personal and place names. One can imagine the sentiment of a person with a name that was the Chinese equivalent of 'Higginbottom' discovering that, using the original Unicode character set, his or her name must be transmuted to the Chinese equivalent of 'Figgingarse'.

The Figgingarses of this world - aka "the government of the People's Republic of China" - were understandably not best pleased. In 1996, Unicode High Command conceded the issue and published a revised standard to accommodate unreasonable outliers. It currently contains around 110,000 characters.

You will have noticed that this is considerably above the original two-byte limit. And, yes, it forced the dread return of variable length characters.

This is a key point. The standard whose principal benefit was that all characters were encoded to the same length lost that benefit in 1996.

It didn't even survive long enough to see the Dome.

In Joel Spolsky's famous essay of ten years ago, he royally patronises programmers who believe that "Unicode is simply a 16-bit code where each character takes 16 bits... It is the single most common myth about Unicode, so if you thought that, don't feel bad".

Myth? Myth? Oh, do sod off, Joel.

I mean, it's not as if the original Unicode's miserably borked condition was widely advertised, is it? If it were called, say, "Unicode, the mighty sword of Babel, that was broken and has been hastily re-glued" then we would know where we were.

Terminological intermission

By the way, Joel uses the name 'UCS-2' rather than 'Unicode'. This is probably more correct, but I refuse to follow him because:

It confuses the issue with extra jargon - I am writing a rant here, not a bloody technical manual.

The whole of Microsoft doesn't bother, so I don't see why I should.

It has something of the connotation of dignifying an unpopular "poll tax" with the more reasonable-and-official-sounding "community charge", and

I believe it should be 'UTF-16' anyway, if we were really going to go down this accurate terminology route.

While we are here, note that despite your objections I am also refusing properly to introduce/define the following terms: 'UTF-1', 'UTF-7', 'grapheme cluster', 'code point', 'UCS-4' and 'ISO 10646'. As soon as one starts along this way, one becomes hopelessly entangled in tedious explanations that prevent one reaching the point. For example, the leading 'U' in all these abbreviations stands for 'Unicode', which in itself demands about three paras of dull, thumb-twiddling explanation.

But I have no problem with you looking them all up, or arguing about them in the comments, if you want to. In your own time.

The alternative

By the early 2000s a plausible alternative had taken root: UTF-8. This is a byte-oriented encoding which retains compatibility with original 7-bit ASCII, but (like post-1996 Unicode) suffers the curse of variable length characters to cope with Your Foreign. Individual characters are represented by one to four bytes.

However, UTF-8's squiggles are encoded using an elegant scheme devised by proper grown up, i.e. Ken Himself. Thompson's scheme has a "self-synchronising" feature, meaning you can discover the character boundaries at any point in a string without needing to go back to the beginning. It's not as nice to process as a string of uniform characters, but it feels like the absolute best of a bad job.

I had been aware of UTF-8 for many years, and had clocked that it was the favoured system among the GNUdal tendency. Sure, Linux used UTF-8 rather than submit to the horror of 16-bit characters; but I supposed this was due to Linux users who preferred to code in C over impertinent upstart C++, and who regarded GUIs in general as a barely satisfactory system for marshalling their half dozen terminal sessions.

The manifesto

I do urge you to read it for yourself, but in brutal summary it argues cogently what I have argued frivolously:

That 16-bit Unicode is hopelessly broken.

That UTF-8 is intrinsically superior for everything except very specialist tasks.

That, where possible, all new code should avoid using the former and prefer the latter.

Obviously that last point is going to be a bit tricky.

For Windows C++ programmers, the manifesto identifies specific techniques to make one's core code UTF-8 based, including a proto-Boost library designed for the purpose. (Ironically, the first thing you have to do is turn the Unicode switch in the Visual C++ compiler to 'on'.)

For users of other tools, it is an invitation to review your position. For example, my fellow Delphi users should notice that Embarcadero has dropped support for the UTF8String type from its fancy new LLVM-based compilers. Hum.

As the manifesto says, "UTF-16 [...] exists for historical reasons, adds a lot of confusion and will hopefully die out".

Amen to that. Next weekend I will be scraping all my Unicode files off my hard disk, taking them to the bottom of the garden, and burning them. As good citizens of the digital world, I urge you all to do the same.