Blog

Apr 16, 2008

Success with MySQL and non-Western Character Sets

Languages are tricky, quickly evolving systems, but almost all of them in their current form predate computers. Indeed, most predate typesetting. So there are a ton of writing systems out there that are designed for writing with a trained human hand, and not optimised for discrete mechanical reproduction. Those languages persist, and computers are finally getting powerful enough to reproduce them elegantly. So how do we deal with them?

Once you’ve done that, you’ll find that regular web applications need a little more to work properly. For instance, what if you want to use full-text indexing on a script that isn’t derived from Latin?

Fortunately, MySQL as of version 4.1.1 makes that a lot easier…but there are some tricks.

You’ll need to learn about collations in MySQL. It is entirely possible in PHP to set everything you’re doing to UTF-8 and have a site that can accept and reproduce lots of different scripts, but the database is storing them as something else. However, you give up a lot. First, your database dumps will be unrecoverable gibberish. Second, functions that depend on MySQL understanding how the underlying data behaves, such as full text indexes, will fail.

MySQL out of the box to have latin1_swedish_ci be the default collation, which I find an extremely odd choice. OK, so they’re fighting Amerocentrism, but they’re not exactly promoting international standardization. So unless you specify that you want the database storing information as utf8_unicode_ci (which I’ve found to be the most hassle-free for the widest range of character sets), you’ll need to specify that your database and all its tables and rows use the utf8_unicode_ci collation, which will also cause them to store everything as UTF-8.

But wait–you’re not done. You’ll also need to make sure your connection defaults to UTF-8. That’s right, you have to have a clean UTF-8 path through your whole application. Just one place where UTF-8 isn’t respected will turn everything to unreadable gibberish or a string of question marks. You can have your application issue 'SET NAMES utf8' for every request. The best option is to make sure you have an environment where you can control the MySQL configuration until they come up with some better defaults.

Once you do this, and providing you do all the things you need to do in HTML and PHP (or the scripting language of your choice) to make them UTF-8-clean as well, you should be able to do all the things you’re used to in other languages.