About MySQL, UTF-8 and saving headaches

One of the most confusing topics ever in web development history is character sets. MySQL has a lot of features to help you with this, but when things go wrong, it can be a real pain to get it solved. Read this and fear no more.

A little background

Most default installations of MySQL (still) default to a latin1 character set. This character set is also called ISO-8859-1, and is pretty much sufficient for most western languages. It is in fact a superset of the original ASCII table, and is akin to Windows-1252 which is used as a default encoding in Microsoft Windows systems. Typically, most content devised on Windows machines (e.g. in Microsoft Office) will use this encoding rather than ISO-8859-1.

All these character sets are based on the principle of code pages: you have (in this case only) one byte per character (no more, no less) and these map to certain characters (or “glyphs”). Therefore the number of characters available in each character set based on this system is limited by the amount of bytes used per character. One-byte codepages consequently offer no more than 256 characters.

The UTF-8 character set however provides an extension system, with which certain control bytes define how many bytes (or “octets”) are used for the character in question. The first set of characters defined within the range of 0-127 (compatible with the ASCII table) take up one byte of space, the next set of characters mostly used in Western languages and such take 2 bytes, up until 6 bytes used by the least frequently used characters. Being backward compatible with ASCII, and having the possibility to display nearly all languages, makes UTF-8 the most practical and popular character set on the web.

What does MySQL do with character sets?

MySQL has some features as it comes to handling different character sets. To understand how MySQL handles character sets, you need to be aware of two types of settings.

1. The character set of a column

The character set of a column defines how MySQL should interpret string operations on that column. For example, counting the number of characters in a string using CHAR_LENGTH() needs to know what character set the content is in, because UTF-8 can have multiple bytes per character:

2. The character set of the current connection

The character set the client is “speaking”, is made known to the server by setting the connection character set. Since MySQL needs to know what character set the client’s strings are in. A good example of this, again, is the CHAR_LENGTH() function, which can be used to determine the length of a client value too:

Silent encoding conversion

If there is a discrepancy between the connection and the storage character set, MySQL silently converts the content from one encoding to another. Let’s say you’re connected to a server with a latin1 character set, and the column specification states that the content in the table is utf8, MySQL will convert any write to that column from latin1 to utf8, and any read vice versa.

However, when you’re actually writing UTF-8 characters to the database over a latin1 connection, each of the UTF-8 byte sequences will be interpreted as separate latin1 characters. For the lower 127 of the character set, this imposes no problem, because they’re ASCII in both latin1 and utf8. However, each byte composing an UTF-8 character is interpreted as a separate latin1 character, and is each converted to UTF-8 when writing to the table, causing the content that is stored in the storage engine to be wrong.

Example 1: UTF-8 data, latin1 connection and latin1 column

Since latin1 is the default character set for most MySQL installations, this is probably the most common situation.

Consider the string ‘á’. This is a lowercase ‘a’ with an acute accent. In UTF-8, the character is represented as two bytes, hexadecimally denoted as C3A1.

As we insert UTF-8 data into this table, MySQL does nothing special. It just treats the two bytes sent as two latin1 characters, does no conversion, and saves them as such into the table. When reading the database, still nothing specific happens. The application gets the two bytes, thinking they are one UTF-8 character represented in two bytes, and nothing specific nor problematic happens.

However, demonstrating that MySQL does not actually know you only wished to store one character, you can try to insert the string of 10 ‘á’ characters into the table. That will render a warning, because only 5 will fit. The problem becomes even more apparent when we change the table specification to an odd length of 5:

The first 4 bytes are consumed by the first two characters (each consisting of two bytes), and the last byte is filled with the first byte of the third character because it is truncated by MySQL. This actually renders an invalid UTF-8 sequence, hence the display of the ‘?’ as the third character above. The following demonstrates that:

The string inserted into the table is sent as the two bytes C3 and A1, which are in turn the characters ‘Ã’ 1 and ‘¡’ (an inverted exclamation mark). MySQL interprets these characters separately and converts them into their UTF-8 counter parts, C2A1 and C383 respectively, taking up a total of 4 bytes of space.

The binary equivalent, which shows how MySQL will interpret this query, is as follows:

The expression _latin1 x'C3A1' means: “Interpret the hexadecimal string ‘C3A1’ as latin1“. It would have been interpreted as UTF-8, if the connection had been utf8.

Now, when you read the data from the database, MySQL interprets the two characters in the database (represented by the four bytes mentioned above) as utf8 needed to convert to latin1 (since your client connection is set as latin1). The conversion as described above is inverted, so the first character is converted to the latin1 ‘Ã’ and the second to ‘¡’. As long as noone interferes, and these two bytes are sent to your application, which in turn sends them to the browser represented with a charset=utf-8 content type, you’ll never notice the problem.

However, the problem does become horribly apparent as soon as you ask MySQL to tell you the length of the value inserted:

Here you can see that MySQL actually stored 4 bytes of data in the table, telling you the string that is in the table is not 2 bytes in length, as you might expect, but 4. Again, displaying the HEX value of the stored data demonstrates that:

Setting the right connection encoding

As you might have gathered from the examples above, the culprit isn’t MySQL. It is your application telling MySQL that it’s about to send ISO-8895-1 encoded data, but is actually sending UTF-8. You need to tell MySQL that the data you’re sending is actually already UTF-8, so no conversion needs to take place when inserting that data into the table.

Now MySQL will interpret all string data as utf8, so now MySQL will interpret all characters sent as UTF-8 and no overhead conversion is done internally. You can also be confident that any data originally stored as latin1 will be converted to utf8, which is the character set your application expects. So the decision is not about what you’re planning to store, it’s about what character set the application wants to feed to and retrieve from the connection.

The other character set configurations

There are more places in MySQL where you can tell MySQL something about character sets. However, they all serve as defaults. So, you don’t really need them but it’s good to know they’re there.

Server level: Defines the default character set for newly created databases

Database level: Defines the default character set for newly created tables

Table level: Defines the default character set for columns added to that table

As long as you keep that in mind: they provide defaults, ultimately for defining column character sets, there’s really nothing more to know.

Figuring out which character set is defined for a column

Typical MySQL clients provide options to show what encoding is used for columns. Whenever you want to ask MySQL directly, you can use the following query:

The third column displayed is the collation that is used. The first part of the collation always indicates the character set, so latin1_swedish_ci is a collation of the latin1 character set, similarly for utf8_general_ci the character set is utf8.

Converting UTF-8 data in a latin1 column to utf8

Whenever you alter a column specification from one charset to another, MySQL assumes that the data in that column was encoded correctly, and will therefore convert the original data to the new encoding. This makes sense, because you shouldn’t have been inserting UTF-8 data into a latin1 column in the first place 😉

With the following little recipe you can change single columns. First alter a column and change it’s definition to a binary intermediate. Then alter the column to the desired character set. This situation is applicable for Example 1 above.

Use BINARY for CHAR, BLOB for TEXT (with respective MEDIUM and LARGE prefixes whenever necessary) and VARBINARY for VARCHAR fields. Remember to include the full column specification (including NULL and DEFAULT and such); they won’t be remembered otherwise. Also, be careful not to truncate values, by assuming that VARCHAR(10) is the same as VARBINARY(10). When converting from multibyte values to binary you probably need more bytes to store all characters. In other words: VARCHAR(10) means at most 10 characters long, VARBINARY(10) means at most 10 bytes long. When unsure, use BLOB instead.

Also, be aware of keys. Whenever you use keys on (VAR)CHAR fields, you need to drop them first, when using a BLOB intermediate format.

Converting doubly encoded data

In example 2 above, the data is in fact encoded twice. First, you need to convert the UTF-8 characters back to latin1. That seems as an unnatural solution, but it is the only right one. You have been telling MySQL you were speaking latin1, so let’s return the favor of converting the latin1 to UTF-8: Reverting that process is converting from UTF-8 to latin1, which can be accomplished by the following query.

Whenever a headache does come up, rethink the problem in binary form. You’ll get a better understanding of the issue and the way to solve it.

Happing unicoding 🙂

I am pretty sure that Ã looks familiar to you, by the way, so will probably Â, which is C2 in ISO-8859-1. These two characters are the number one indication that UTF-8 content is displayed as ISO-8859-1, as they are both control bytes for two-byte characters. ↩

5 Comments

Great article, the basics of good Unicode handling are often skipped because of the ASCII backwards compatibility or just a plain lack of knowledge of character encodings. I have three additions:

UTF-8 characters in MySQL are saved in at most 3 bytes. This makes it incompatible with the ‘normal’ UTF-8 specification which goes up to 4 bytes and most importantly: you can’t save Unicode code point which requires 4 bytes in UTF-8. There’s a big chance however, that you won’t use those characters at all.

MySQL uses 3 bytes per character for indices on UTF-8 columns. If you use InnoDB, MySQL supports a maximum length of 767 bytes per index. This comes down to 767 / 3 = at most 255 characters. If you’d use a Latin column, you could go up to 767 characters. This is tricky, especially when converting columns from one character set to another.

Because of the UTF-8 backwards compatibility, there’s a neat trick with str_replace() in PHP: if you’re not searching and replacing anything outside of the original ASCII characters, it will work with your UTF-8 strings. This is because all the UTF-8 characters that use 2 bytes or more, are outside of the ASCII range by design. From PHP6 on, Unicode support is builtin and you can use any character to search and replace.

The best title for this post is “demystifying enconding points in application database path”.

Very good.
I was searching by post that explains this that you show. Where the conversions happens in the communication of the client with a database.

I know that this is not default for all database products, but its give me a best view of this process.
Also, this help me with mysql. I was using three diferenct clients: mysql command line, workbench and JDBC application.

All three was inserting data incorretly and read also.
I was crazy with this.

The problem in the mysql command line (in windows command prompt) is that the code page send by it is 850 (default for MS-DOS). Mysql read utf8 column data, and thinking the connection wait a latin1 character, convert it to latin1… But, the client wait a code page 850 character, and i see character incorect.
Lookint at SHOW CHARACTER SET, i find the name ‘cp850’ and by description i concluded changing connection to
‘–default-character-set=cp850’. This works!!!!!

The workbechn was in latin1, and i change to utf8 with SET NAMES.
The JDBC was correctly, but beacause the other tow clients was insert with wrong charset, it read data incorrect.

I also have a charset issues with apache, linux and sql server, and think that the understanding of this process, may help me a find where translate problem occurs.

Sorry for bad english, i dont use a Tradutor for learn write in the language!
Thanks by you shared you knowledge.

Thank you very much for this very helpful explanation on app – database dialog.
I’ve come through several partial explanations before, but yours have been the only one crystal clear explaining the ‘role’ of all ‘characters’ involved in this multipart ‘story’.
Thank you very much.

Thank you very much for writing this article. I know I’m years late on this, but I’m trying to fix a site that suffers from doubly encoded data, like in Example 2, and your article has been very helpful.

Would it be smart to finish off with a table conversion back to utf8, just in case the table is expanded in the future? Does this affect something else that I haven’t thought about?