MySQL: Converting an incorrect latin1 column to utf8

I recently stumbled across a major character encoding issue on one of the websites I run. Through resolving the issue, I learned a lot about the complexities of supporting international character sets in a LAMP (Linux, Apache, MySQL, PHP) environment. Fixing the problem was a challenge, so I wanted to share some of the knowledge I gained in case anyone else finds similar issues on their own websites.

The post below is a long yet detailed account of my experience. Let me know if you’ve had similar experiences or found another solution for this type of issue.

Oh My Münchhausen!!!

A couple of days ago I was notified by a visitor of one of my websites that searching for a term with a non-ASCII character in it (in this case, “Münchhausen”) was returning over 500 results, though none of the results actually matched the given search term.

I started looking into the issue, and saw the same thing he was. The debug logs from the search page showed the following SQL query being used:

The exact same query, run instead from the command line, returned 0 rows.

Assuming this had something to do with the “ü” character, I started a long journey of re-learning what characterencodings are all about, including what UTF-8, latin1 and Unicode are, and how they are used in MySQL.

What’s Going On?

latin1, AKA ISO 8859-1 is the default character set in MySQL 5.0. latin1 is a 8-bit-single-byte character encoding, as opposed to UTF-8 which is a 8-bit-multi-byte character encoding. latin1 can represent most of the characters in the English and European alphabets with just a single byte (up to 256 characters at a time). UTF-8, on the other hand, can represent every character in the Unicode character set (over 109,000 currently) and is the best way to communicate on the Internet if you need to store or display any of the world’s various characters.

You can specify a default character set per MySQL server, database, or table. The defaults for a database will get applied to new tables, and the defaults for a table will get applied to new columns. You can change the defaults at any time (ALTER TABLE, ALTER DATABASE), but they will only get applied to new tables and columns.

You can also specify the character set you’re using for client connections (via the command line, or through an API like PHP’s mysql functions).

Getting back to the MünchhausenProblem, one of the things I initially checked was what character set PHP was talking to MySQL with:

Knowing the character “ü” is represented differently in latin1 versus UTF-8 (see below), and taking a wild stab in the dark, I tried to force my PHP application to use UTF-8 when talking to the database to see if this would fix the issue:

mysql_set_charset('utf8');

Voila! The problem was fixed! Searching for “Münchhausen” on the site returned 0 results ( the correct number of matches).

The Problem With “Ã¸” Is That It’s Not “ø”

Or was it? A couple minutes later, I was browsing the site and started coming across funky characters everywhere. For example, a page that previously had the text “Graffiti by Dolk and Pøbel” was now reading “Graffiti by Dolk and PÃ¸bel”. These strange character sequences also looked like an issue I had noticed from time to time in phpMyAdmin with edit fields showing strange characters. As long as I didn’t edit the strange characters, they displayed correctly when PHP spit them back out as HTML, so I hadn’t though much of it until now.

Seeing these strange characters sequences everywhere scared me enough to look into the problem a bit more. I disabled the call to mysql_set_charset() and the site reverted to the previous “correct” behavior of talking to the server via latin1 and displaying “Graffiti by Dolk and Pøbel”.

Back to square one.

It’s All Bits From MySQL’s Point of View

Since the term “Münchhausen” was returning inappropriate results, I tried other search terms that contained non-ASCII characters. For example, I searched for the city “São Paulo”:

This search worked a bit better — it found rows with cities of both Sao Paulo and São Paulo. However, it returned the character sequence “Ã£” for São Paulo for some reason.

The reason for this is, from MySQL’s point of view, the data stored within its tables are all just bits. The column type and character set of a column determine how queries work against the data and how the data is returned as a result of a SELECT query.

It’s probably pretty obvious by now that my city column wasn’t the right character set. It was set to latin1 when the database was created.

The interesting thing is that my web application, which uses PHP, didn’t seem to mind this very much. Storing and retrieving from the city column is binary-safe — that is, MySQL doesn’t modify the data PHP sends it via the mysql extension. Later, MySQL will give PHP the exact same data (bits) back. The problems only occur when you ask MySQL to, on its own, analyze the column or present it.

So all this time, my PHP web application had been storing UTF-8-encoded data in the city column, and later retrieving the exact same (binary) data which it display on the website. MySQL doesn’t modify the data for simple UPDATEs and SELECTs, so the UTF-8 characters were all still displayed properly on the website.

To fix the above SQL query, we can actually force MySQL to re-interpret the data as a specific character encoding by first converting the data to a BINARY type then casting that as UTF-8. If you try to simply CONVERT USING utf8, MySQL will helpfully convert your garbage-latin1 characters to garbage-utf8 characters. The two-step process of temporarily converting to BINARY ensures that MySQL doesn’t try to re-interpret the column in the other character encoding. If we don’t convert to BINARY, MySQL would end up displaying the same Ã£ characters even in UTF-8 output.

So we CAST to BINARY temporarily first, then CONVERT this USING UTF-8:

UTF-8 encoding turns our ã, represented as 0xE3 in latin1, into two bytes, 0xC3A3 in UTF-8.

Note that these two bytes 0xC3 and 0xA3 in UTF-8 happen to look like this in latin1:

0xC3 = Ã in latin1

0xA3 = £ in latin1

So the UTF-8 encoding of ã explains precisely why we see it reinterpreted as Ã£ in latin1.

What To Do, What To Do

At this point, it’s obvious that I messed up somewhere. More precisely, the city column should be UTF-8, since PHP has always been putting UTF-8 data in it. The big reason I hadn’t noticed an issue up to this point is that while the MySQL column is latin1, my PHP app was getting this data and calling htmlentities to convert the UTF-8 characters to HTML codes before displaying them. My website’s visitors saw proper UTF-8 characters on the website even though the MySQL column was latin1.

As we’ve seen, issues start occurring when you do queries against the data. The SELECT above was using a UTF-8 character “ü” for “Münchhausen”, and when comparing this to “latin1” data in the column, MySQL gets confused (can you blame it?). Additional issues can appear with applications that display the “natural” encoding of the column (such as phpMyAdmin): they show the strange character sequences as seen above, instead of UTF-8 decoded characters.

The core of the problem is that the MySQL database was created several years ago and the default collation at the time was latin1_swedish_ci. Over the years, I changed the default to utf8_general_ci for new columns, but existing tables and columns weren’t changed. I have over 100 tables in latin1 that should be UTF-8 and need to be converted.

So I started investigating what it takes to convert my existing latin1 tables to UTF-8 as appropriate. Once I set the character encoding properly, queries against the database should work better and I shouldn’t have to worry about these types of issues in the future.

There are a couple ways to make the conversion.

Some people have successfully exported their data to latin1, converted the resulting file to UTF-8 via iconv or a similar utility, updated their column definitions, then re-imported that data. Unfortunately this requires taking the database down as tables are dropped and re-created, and this can be a bit time-consuming.

I was hoping for a process that I could apply to an online database, and luckily I found some good notes by Paul Kortman and fabio, so I combined some of their ideas and automated the process for my site. I hit a couple issues along the way, so I wanted to share the steps that worked for me.

The Process

Using the method described on fabio’s blog, we can convert latin1 columns that have UTF-8 characters into proper UTF-8 columns by doing the following steps:

Convert the column back to the original type and set the character set to UTF-8 at the same time (ALTER TABLE MyTable MODIFY MyColumn TEXT CHARACTER SET utf8 COLLATE utf8_general_ci)

This is a similar approach to our SELECT CONVERT(CAST(city as BINARY) USING utf8) trick above, where we basically hide the column’s actual data from MySQL by masking it as BINARY temporarily.

If you simply force the column to UTF-8 without the BINARY conversion, MySQL does a data-changing conversion of your “latin1” characters into “UTF-8” and you end up with improperly converted data. Converting the column to BINARY first forces MySQL to not realize the data was in UTF-8 in the first place.

Additionally, the MODIFYs to BINARY and back need to retain the entire column definition. For TEXT types, a simple TEXT to BLOB conversion is sufficient. But for column definitions that have specified lengths, defaults or NOT NULL:

MyColumn VARCHAR(255) DEFAULT '' NOT NULL

We need to MODIFY keeping the same attributes, or the column definition will be fundamentally changed (see notes in ALTER TABLE). In this case, we would specify:

If we don’t specify the length, default and NOT NULL, the columns aren’t the same as before the conversion.

The Script

The script at the bottom of this post automates the conversion of any UTF-8 data stored in latin1 columns to proper UTF-8 columns.

I modified fabio’s script to automate the conversion for all of the latin1 columns for whatever database you configure it to look at. It converts the columns first to the proper BINARY cousin, then to utf8_general_ci, while retaining the column lengths, defaults and NULL attributes.

Warning: This script assumes you know you have UTF-8 characters in a latin1 column. Please test your changes before blindly running the script!

Here are the steps you should take to use the script:

Step 1. Determine Which Columns Need Updating

If you’re like me, you may have a mixture of latin1 and UTF-8 columns in your databases. Not all of the columns in my database needed to be updated from latin1 to UTF-8. For example, some of the tables belonged to other PHP apps on the server, and I only wanted to update the columns that I knew had to be fixed. The script will currently convert all of the tables for the specified database – you could modify the script to change specific tables or columns if you need.

Additionally, the script will only update appropriate text-based columns. Character sets are only appropriate for some types of data: CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT and LONGTEXT. Other column types such as numeric (INT) and BLOBs do not have a “character set”.

You can see what character sets your columns are using via the MySQL Administration tool, phpMyAdmin, or even using a SQL query against the information_schema:

mysql> SELECT * FROM COLUMNS WHERE TABLE_SCHEMA = "MyTable" AND COLLATION_NAME LIKE "latin1%";
...
115 rows in set (0.03 sec)

Step 2. Test Convert the Columns

You should test all of the changes before committing them to your database.

The first thing to test is that the SQL generated from the conversion script is correct. To do this, you can dump the structure of your database:

There are TODOs listed in the script where you should make these changes.

Step 3.2. Invalid UTF-8 data

Since my database was over 5 years old, it had acquired some cruft over time. I’m not sure exactly how this happened, but some of the columns had data that are not valid UTF-8 encodings, though they were valid latin1 characters. I believe this occurred before I hardened my PHP application to reject non-UTF-8 data, but I’m not sure. I found this out when initially trying to do the conversion:

At some point, a character sequence that contained invalid UTF-8 characters was entered into the database, and now MySQL refuses to call the column VARCHAR (as UTF-8) because it has these invalid character sequences.

I checked the HTML representation of this column in my PHP website, and sure enough, the garbage shows up there too:

… �? flown …

The � is the actual character that your browser shows. Not the best user experience, and definitely not the correct character.

I fixed that single row (via phpMyAdmin), and ran the ALTER TABLE MODIFY command again — same issue, another row. Looks like there is more than a single corrupt row.

I found a good way of rooting out all of the columns that will cause the conversion to fail. If you SELECT CONVERT (MyColumn USING utf8) as a new column, any NULL columns returned are columns that would cause the ALTER TABLE to fail.

This showed me the specific rows that contained invalid UTF-8, so I hand-edited to fix them. You could manually NULL them out using an UPDATE if you’re not afraid of losing data. I had to do this for 6 columns out of the 115 columns that were converted. Only 30 rows in total were corrupt.

Step 4. Profit?

At this point, it may take some guts for you to hit the go button on your live database.

Personally, I ran the script against a test (empty) database, then a copy of my live data, then a staging server before finally executing it on the live data.

Warning: Please be careful when using the script and test, test, test before committing to it!

I’m not quite getting this to work. I have a InnoDB table which uses utf8_swedish_ci as collation. But I still get the ?-mark when presenting the data on my website. In phpMyAdmin the characters show fine.

I use AJAX to retrieve data from the table in realtime, so I’ve made sure the headers of the retrieved file are using UTF8, but it doesn’t seem to help. I tried your ALTER TABLE-fix, but no change.

The data I filled the table with came from a file, but also that was encoded in UTF8.

Hi, very interesting article and thanks for explaining everything, from the look of it i thought i might have finally found the solution to my problem… but as it looks like i have different problem even if the description is exactly the same… in the end running the convert query i get the exact same result i get when selecting the original data… if i run it using a putty connection, if i run the conosle on my laptop, ssh to the server, and run the query i get the correct italian lettters i’m trying to put in the DB… (à é and so on) in BOTH columns… O_o

I have also
SELECT MyID, MyColumn, CONVERT(MyColumn USING utf8)
FROM MyTable
WHERE CONVERT(MyColumn USING utf8) IS NULL
also returns 0 results. But if I try insert values from MyColumn to other utf8 Table/Column it returns ERROR 1366: Incorrect string value

used your script to convert a typo3 database from 4.2 to 4.7 where character sets seem to have changed, as i had many garbled chars after the update. i just ran it on the live-db after i made a backup and it worked like a charm. all garbled chars are now gone, and i did not even have to change any part of the script. THANKS! this really saved me a lot of time.

Useful script! For the conversion from BINARY back to CHAR, I think the ALTER TABLE command will actually pad extra 0x00 bytes at the end. So if you have an empty string in the column, after converting the column back to CHAR type, it’ll actually inflate your column.

For that case, you may want to do something like this after the ALTER TABLE command: