Database, character set and collation

Upgrading of the core itself is not too difficult as long as the data is in good condition. There are quite a few situations from a distant past that can cause problems with current upgrades.

History

In really old installations the database was in the MySQL default Latin1 character set (this is close to the ISO-8859-1 set that was common for HTML documents back then). There were no settings for a character set to be used in the backend. This was no problem as long as all the editors used the same settings and especially if they all used Western European computer with their default settings.

Forcing the backend character set

The situation changed a bit and multi-language sites were made by editors from around the world, each with their own browser settings. Content was put in the database with different character sets and there was no way to figure out how to convert it in the frontend output. This was solved by forcing the character set in the backend to a generally usable one, mostly UTF-8.

In many cases the database was left alone. For Western European languages nobody noticed the problem because UTF-8 characters for these languages are almost the same as the character from the Latin1 set. MySQL takes care of the conversion as long as there are no characters used which are not part of the Latin1 set.

The database character set

To make it easier for the entire process around the time of TYPO3 4.5 it was advised to have all database tables (and columns) in UTF-8 and to use UTF-8 also for the connection with the database. Combined with a complete UTF-8 backend this meant that no conversion was needed and that most languages of the world could be stored.

In later releases of TYPO3 these settings were made mandatory. It's now important to have a healthy database which used UTF-8 everywhere.

Collations. Uh??

For databases it's not enough to know the character set of a field. Because it needs to compare and sort data it also needs to know if an “ä” has its own place in the 'alphabet', or if it is the same as an “a” (for sorting and comparing only). This information is stored in a 'collation'. You can set one for the entire database (new tables and columns will use this if no collation is explicitly set), for a table (new columns will use this if no collation is explicitly set) and for a column.

For each character set the database has a default collation. If you don't specify anything MySQL will use its default character set (Latin1) and collation (latin1_swedish_ci). These defaults still show the origin of MySQL: MySQL AB (the original creators of MySQL) was a Swedish company.

Which collation should we use?

For UTF-8 the default collation is 'utf8_general_ci'. This is a general purpose collation which is also case insensitive (the '_ci' part at the end). Some people prefer 'utf8_unicode_ci' which is a bit better for some languages, but it seems to be a bit slower (more complex rules).

TYPO3 currently does not handle collations internally at all. Different sorting rules for different languages are not taken into account. Collation handling is completely left to the database. Recently some situations were found where the collation of a column could lead to problems. Without going into the technical details, the current advice is to use 'utf8_general_ci' for the entire database, all tables and columns.

Upgrade to TYPO3 6.2 and the database

Many recent upgrades were done between TYPO3 4.5 LTS and TYPO3 6.2 LTS. Because the releases between these two changed the handling of character sets in the backend and the database connection it's important to have a completely UTF-8 database.

There is a wiki page about this process [1] and the smoothmigration extension [2] also tries to handle it. In various installations one can find really weird situations which require more complex approaches.

A common issue is that UTF-8 encoded text is stored in a Latin1 database column but that it was somehow never converted to Latin1. If you use database tools like PhpMyAdmin, Adminer, etcetera such texts will show up as “Ali GÃ¶kgÃ¶z and GÃ¼ltekin Tarcan” instead of “Ali Gökgöz and Gültekin Tarcan”. This can be fixed with a little effort.

The fix

Years ago when it was advised to use UTF-8 and I encountered databases with the double encoding problem I made a script fix exactly that. It needs a little adjustment to make it do its job:

make a backup of your database; really!

make sure you have a backup of your database! We're going to make a lot of changes to your data and it might go wrong somehow. If you don't have a backup the data is gone for real.

empty tables that are not important to keep; these include the cache tables and the tables of indexed search (except of course the one that contains the indexed search configurations!). The script will take less time to complete.

download the zip [3], extract the PHP script and store it in a subdirectory of the TYPO3 installation (for example fileadmin/)

in the beginning of the script (around line 19) there is a constant “SIMULATE” which is set to TRUE. This prevents the script from making changes to the database; it's like a dry-run option. If you set this to FALSE the script actually changes the database

The “one script to fix it all” variation

A later version of the script I describe above includes detection of the double encoding problem and a nice graphical interface. This tool is only compatible with TYPO3 4.x.

make a backup of your database; really!

make sure you have a backup of your database! We're going to make a lot of changes to your data and it might go wrong somehow. If you don't have a backup the data is gone for real.

empty tables that are not important to keep; these include the cache tables and the tables of indexed search (except of course the one that contains the indexed search configurations!). The script will take less time to complete.

download the zip [4], extract the PHP script and store it in the typo3/install/ folder

make sure there is an ENABLE_INSTALL_TOOL file in the typo3conf folder (just like you do for activating the Install Tool)