Putting The ‘You’ Back Into Unicode

Original Author: Garett Bass

Today I will be sharing a public domain C++ implementation I developed for iterating, measuring, comparing, transcoding, and validating UTF-8, UTF-16, and UTF-32 strings. But first, a very brief overview of what I’m working toward, and a bit about Unicode. Then a few implementation details, and finally the obligatory download link.

Inspired by projects I’ve worked on professionally, I’ve developed a plan for a fairly general purpose gameplay data editor I’d like to build as a hobby project. Though I’ve done all of this before in C#, I’m very keen to craft this tool in C++. To some that may sound masochistic, but for all it’s dark, dusty corners, I’m still rather fond of C++. I’m looking forward to taking back the reigns of explicit memory management, and trying some out some DOD on a project where I already have a good grasp on the requirements.

UTF? More Like Double-U.T.F!

The good old ASCII character set was developed in the 1960′s based on telegraphic codes, and was only large enough to represent symbols commonly found in Roman-based languages. A single 7-bit ASCII character could represent 128 unique alphanumeric symbols, which severely limited its practical application to international text. This becomes less surprising once you discover that ASCII is short for the American Standard Code for Information Interchange (With Other Americans Only, No Doubt). I’m not sure why they didn’t just call it USCII. Of course it wasn’t long before the international community began to take advantage of that 8th underutilized bit, developing the various permutations of THEMSCII.

In the late 80′s, Joe Becker of Xerox, and Mark Davis of Apple proposed the creation of a universal character set to facilitate multilingual computing. Initially a 16-bit character system, Hamlet for reference.

U-Encode

Twenty-one bits makes for an awfully wide character, or “codepoint” in the Unicode parlance. A specific codepoint is denoted by a hexadecimal number prefixed with U+. For example, U+0058 is the codepoint for “LATIN CAPITAL LETTER X”, and U+1F46F is the codepoint for “WOMAN WITH BUNNY EARS”. Clearly the Unicode Consortium is a very serious bunch.

Fortunately, we rarely need all 21 bits. The majority of codepoints in common use are in the range from U+0000 to U+FFFF, also known as the Basic Multilingual Plane (or BMP). These fit comfortably into the 16-bit character system originally envisioned by Becker and Davis. Still, if most of your data is in the ASCII range, it seems terribly wasteful to use two bytes where you could be using only one.

In order to accomodate this broad range of values, a variety of encodings have been developed to map a particular codepoint to a sequence of bytes. The three most common of these encodings are UTF-32, UTF-16, and UTF-8, where UTF stands for Unicode Transformation Format, and the numeric designation is the bit width of the individual “code units”. Each code unit is a fragment of a complete codepoint, and each encoding specifies how a sequence of code units must be combined to represent a specific codepoint.

The simplest encoding is UTF-32, which uses a single 32-bit code unit to represent each codepoint. Due to this one-to-one mapping, UTF-32 is ideal for comparing the relative values of two codepoints. Unfortunately, this format is very space-inefficient, wasting at least 11 bits per 21-bit codepoint, and more than 16 bits in the most common cases. Since each code unit is a contiguous four-byte block, UTF-32 also complicates communication between machines of differing endianness.

The middle road is UTF-16, which uses either one or two 16-bit code units. In many cases, this encoding is ideal for runtime representation of text. In most common cases, where the majority of codepoints are within the BMP, UTF-16 is much more space efficient than UTF-32. Still, the two-byte code units of UTF-16 also introduce endianness issues if used as a data-exchange format.

According to Wikipedia, UTF-8 is “the de facto standard encoding for interchange of Unicode text”. This is likely a result of two key advantages. First, since each code unit is a single byte, UTF-8 is an endian-agnostic format. Second, the first 128 Unicode characters fit in a single UTF-8 byte, and map directly to ASCII characters in the same range. This means that any valid ASCII file is also a valid UTF-8 file, providing some legacy support. However, UTF-8 is a variable width encoding, requiring between one and four bytes to represent a unique codepoint. As a result, it is the most complex of the three formats, requiring more computation to reconstruct a full 21-bit codepoint.

Signed vs Unsigned Character Types

So, let’s take a look a some code to implement the kinds of transformations illustrated above.

Here I take a pointer to an input array of uchar32, and an output array of uchar8. I then increment each pointer as I consume enough bytes and produce enough bytes to transcode a single codepoint. All of the core algorithms are like this, performing an operation on a single codepoint, which makes it easy to write the template Transcode() function that processes an entire string.

However, the astute observer will notice that this algorithm requires unsigned character types, which could be rather inconvenient on some of our favorite platforms. Sadly, the C standard doesn’t specify whether char is a signed or unsigned type. But, with a little template magic, and the help of some good inlining, the compiler can generate all of the relevant algorithms for us, so that we can happily make use of native char and wchar_t literals.

All of the template arguments should be char8, char16, or char32, all of which I’ve defined for you with some more template magic. Particularly handy is that wchar_t maps to either char16 or char32 as needed, so native string literals should just work. If you have your own fixed-width types by the same name, you may have to do a bit more work, because I rely on my templates to help with the signed/unsigned cast.

These are all inline definitions, and I hope the names are pretty obvious, because I haven’t taken the time to comment each method. With the exception of the Validate() methods, none of the other methods attempt any validation, so garbage in, garbage out. This seemed pretty reasonable to me, but YMMV.

Testes, Testes, 1..2..3?

It wouldn’t do much good to write all this code without some reasonable test coverage. I wrote tests to exercise all of the above functions across UTF-8, UTF-16, and UTF-32 input and output for each of the following cases:

I included the test harness with the project. It relies on UnitTest++, for which I include a pre-built MSVC static library. I’ve run the tests under both Visual Studio 2010 Professional, and CodeLite v2.9.0.4684.

TODO: I really ought to test against MultiByteToWideChar()

Miscellany

I used a very lightly customized build of premake4 to create the project files, but I expect the standard build would do nearly as well. I really love premake. It has never been easier to spin up a new C++ project without having to tinker with every last setting.

It is getting late here, so I’m just going to host the code right here for the moment, but I will try to put it up on github soon.