Persian Sorting for MySQL

It seems like sorting Persian strings stored in a MySQL database is a common problem to many people who develop web-based database applications that support Persian. I’ve been asked about this problem numerous times, and I’ve decided to write an article about the issue, and present my own solution to it as well.

The problem

Speaking in plain, non-technical words, MySQL versions up to 4.0.x don’t support Unicode, so any language (Persian included) which needs to use a Unicode encoding system (such as UTF-8) is not supported. If you’re curious about more technical details, read on; otherwise, feel free to skip the next paragraph.

MySQL versions up to the 4.0.x series do not have support for Unicode strings. All of its support for multilingual data is by use of code-page based encoding systems. Unfortunately, Persian is not among the languages it supports. If Persian used a fixed-length encoding system, then it was possible to write a special collation algorithm and feed it to MySQL for proper sorting of Persian strings, but MySQL 4.0.x doesn’t support variable length encoding systems, and since in UTF-8 (the standard encoding system for Persian) the length of UTF-8 characters varies between 1 and 4 bytes, there is no way to add support for Persian collation (and other string manipulation features) in MySQL 4.0.x, unless you wouldn’t mind reading and patching a huge database source code that has no built-in support for Unicode.

Later MySQL versions

The Unicode Consortium defines a global collation algorithm which is able to sort all languages in the Unicode coding system (practically, all languages spoken on the Earth). MySQL 4.1.x and 5.x are more Unicode compliant, and support UTF-8 internally, so the problem should not happen in those versions. There was a team working on Persian support of MySQL 4.1.x and 5.x at Research Center of Informatic Industries. They once invited me to join their team, but that didn’t work out. I have not followed the results of their efforts yet, but you might find more information on www.rcii-ir.org. Anyway, Support for Persian collation has been added to MySQL since version 4.1 by Jody McIntyre, so my solution may only be needed for those who are still stuck with older MySQL versions (or work in other environment in which support for Persian sorting is not implemented).

The solution

There is a way to store and use your UTF-8 encoded data in MySQL right now. You have to convince MySQL that you’re feeding it ASCII data, and then delay all your string processing (including sorting) until you extract them from the database. That means that you’re going to need support for UTF-8 in the language you’re using to access MySQL.

Many people use the PHP language to access MySQL over the web. I personally do all my web development in PHP, and I’ve written a number of support routines in PHP for sorting the strings read from the database at the PHP side. I’ve released the code under the GPL v2 license, and you can download it here.

Using this code couldn’t be any simpler! The code assumes that you’ve read the data into an array, which has the column names as keys, and data as values. This is the format returned by the mysql_fetch_assoc( ) function in PHP. After including the db_sort.php file inside your PHP script, you should call the mysql_persian_sort( ) function, which takes two parameters. The first one is the name of the array containing the data read from the database, and the second one is a string which contains the name of the column to sort on. For example, if you have a table in MySQL named contacts, with a last_name field, and you want to sort the data fetched using mysql_fetch_assoc( ) into the $rows array according to that field, it’s enough to write the following: