0000598: adodb-mysql does not set default character set for db connection, this can lead in unusable database content

Description

Hi,
adodb-mysql.inc.php functions _connect() and _pconnect() do not set default character set for database connections nor does web2project code elsewhere.

This results in wrong encoding of non-ascii strings stored in database depending on host configuration - to be specific if system wide mysql default_character_set differs from utf8, i.e. debian's default value is 'latin1'.

Wrong character encoding has at least following critical consequences:

* Sorting of strings with non-ascii charactes does probably not work correctly.
* Strings will be totaly messed up if mysql's default_character_set will be changed or database will be migrated to host with different configuration.

Adding mysql_set_charset('utf8') call to both mentioned functions or setting this later from web2Project code (if adodb modification is not an option) will solve this. This solution will have a side effect of messing up strings already stored in wrong encoding.

Now lets resume what we want and where we are at, and what more needs to be done to get there.

So... we want utf-8 to be generally applicable to web2Project.

Now, what do we have at the moment...
1) The locales are encoded utf-8
2) The forms for the interfaces are encoded utf-8
3) The HTML headers are set to be encoded utf-8
4) There is loads of code inside we2bProject that specifically use php mb_ functions to not break multibyte characters and therefore utf-8 compatible.
5) Our sql scripts, specifically install/sql/mysql/006_special_characters.sql converts the tables to uft-8 encoding

Having said this, is there a need to keep enforcing it, or having the tables defined as "SET utf8 COLLATE utf8_general_ci" is enough for the mysql engine itself behave?

What I means is... I don't really think this is an issue considering that the tables themselves are already set to be utf-8 compatible from within mysql itself.

Thing is I don't want to be introducing mysql specific code inside web2Project even more than we already have, because when we are to add Oracle or SQL server or Firebird or PostgreSQL support in web2Project we are going to face more headaches.

I will skip consequences of this update script and keep focus on current state.
We have all columns marked as utf8 in database already. But we don't know db connection charset because we don't set it. Instead, system-wide setting is used.

1) Web2project sends utf8 string to db server
2) DB server saw it as latin1 (connection charset)
3) Character set of the column is utf8, so db server converts that string (that is in utf8 already) from latin1 to utf8. There are no problems visible in web2project because of reverse process when querying db. But there are 'hidden' problems:
a) Strings are unreadable in database.
b) Sorting probably does not work correctly.
c) If system-wide connection charset will be changed (i.e. by system
administrator or by server update - btw default MSSQL connection charset was
changed from utf8 to latin1 in debian unstable this week - or setting
connection charset in w2p directly) all goes wrong. See next step.
4) If we set db connection charset to utf8 now, we will receive damaged strings from db server.

So before connection charset will be changed (either by setting it by web2project or by system-wide setting change) user must convert database content:

2) change this column charset to utf8 but make sure no conversion will occur.
i.e.
ALTER TABLE `users` CHANGE `user_username` `user_username` varbinary(255);
ALTER TABLE `users` CHANGE `user_username` `user_username` varchar(255) CHARACTER SET `utf8`;
The reason this works is that there is no conversion when you convert to or from BINARY columns. [1]

3) Repeat 1) and 2) for every affected column in database.
4) Rebuild indexes of converted tables.
5) Set connection charset to utf-8.

This five conversion steps must be done at once and be atomic (web2project should be down for maintenance).

So this is why it is not sufficient to simply change database columns charsets (as it is now) nor forcibly set charset of database connection. Users should convert their databases before and only then (but immediately) change connection charset. I design my patch in this manner. It does nothing until users leave $w2Pconfig['dbcharset'] unset. Scripts generating SQL statements for conversion are available at http://github.com/gajdusek/web2Project-tools/tree/master/db_convert_to_utf8/. Conversion script can not be easily and safely integrated into the web2project update code. I tried it, but it involves more code changes and is probably not reliable (connection timeout).

1) Using set_mysql_charset() function as the preferred way to set connection charset and adding compatible function that uses SET NAMES to includes/backcompat_functions.php, because set_mysql_charset() is implemented only in PHP >= 5.2.3 [2].
This all will probably work only in MySQL >= 5.0.7 or, if you're using MySQL 4, then >= 4.1.13. [3], is this an issue?

2) ADODB has implemented charSet property for setting client charset. It supports i.e. Interbase/Firebird, Oracle, sybase and mysqli (according to documentation) but not mysql. My patch set this for all databases and calls set_mysql_charset extra for mysql database. So my patch is not mysql specific, but i did not test its functionality for other databases then mysql.

3) This is why I added $w2Pconfig['dbcharset'] and set connection charset only if user set this variable to something (i.e. utf8). Only for new installations connection charset will be used by default ($w2Pconfig['dbcharset'] = 'utf8' in includes/config-dist.php)

Attached patch does not (hopefully) affect anything until you leave $w2Pconfig['dbcharset'] unset or do not fresh web2project install.

If you set $w2Pconfig['dbcharset'] to utf8 (or other charset) some requirements must fulfilled:

1) Database columns charset (utf8) and encoding of strings stored in them must match (satisfied on fresh installs or if system-wide database connection charset was already set to utf-8, otherwise database conversion must precede).

a) This cannot be satisfied with mysql server < 5.0.7, so minimal requirements must be adjusted. There is mysql 5.0 requirement in the wiki [1], this must be raised to 5.0.7. Older 5.0.x versions do not support 'SET NAMES' query. And/or make advice to ensure that site-wide mysql connection charset is set to utf8, if applicable.

b) Make sure connection charset setting works well for other databases than mysql. i.e. check support in ADODB or insert custom code as for mysql. I have written no conversion scripts for other databases neither (I use only mysql). This does not seem to be problem now because web2project code appears totally unaware about other databases. There should be no problem with fresh installs using other database servers if ADODB can set connection charset via charSet property.

Converting of tables with existing data should be described in some article on the wiki. I just want to repeat: until user will not set $w2Pconfig['dbcharset'] variable or make fresh install nothing will happen. Finally, I can write some article about converting database content, only if somebody could correct my horrible English.

First of all thanks a lot for all this research, because it is very well done.

You touched some sensitive points there and are:
1) The install/upgrade scripts are not prefix compatible, though the code itself is.
2) We need to handle/fix the connection encoding, because as you mentioned the DB engine we support right now is dumb and needs to be educated.
3) Should we worry about systems where their supervisors do not know (or don't care) if their encodings are correctly set (and match) system wide.

About the 1st, well it took some effort to make this thing prefix compatible and we need to sort something to make it that way. Could be string replacement on the install/upgrade scripts or whatever, and by that I mean another way of doing it.

Now nr 2, I believe we must find a way to set the encoding all right, but I prefer a one line of code approach, depending on DB engine, though for now we only have to worry about mysql as you correctly said.
I see that some Dbs react to $db->charSet before the Connect is done on ADODB, I see that mysql reacts to $db->EXECUTE("set names 'utf8'"); after the Connect.
And yes we go and simply change the wiki to the mysql version necessarily.
I don't see why we should add a w2PConfig for this, because utf-8 is the defacto encoding and that's it, period.

As to 3, too bad if they don't care, if they don't care, why should we?
But I appreciate your effort into creating a conversion procedure, but why should we need a conversion procedure when all you needed to do was to make sure everything matched?

It would bee very nice to have this conversation in the wiki, under its own page, to document this whole utf-8 thing.

As to the code itself, all that I see that would change would be to add that utf-8 setting after the connection.

Truth is, right now, if people are connecting against a badly set server, the data retrieved from their servers is all screwed up, right?
So changing it right away would be better than waiting for converting, right?

Like I said we need this "discussion" to have more visibility, also because it carries solutions for already affected systems.

Again thank you very much for the research, code and work, I'll get back to you with what we will implement to force the utf-8 connection.

> 1) The install/upgrade scripts are not prefix compatible, though the code itself is.

Sounds nice that the code is ready.

>About the 1st, well it took some effort to make this thing prefix compatible
>and we need to sort something to make it that way. Could be string replacement
>on the install/upgrade scripts or whatever, and by that I mean another way of doing it.

It will be nice to make some more abstract mechanism for replacing strings in SQL updates. Maybe even make some layer for SQL scripts generators?

>2) We need to handle/fix the connection encoding, because as you mentioned
> the DB engine we support right now is dumb and needs to be educated.

I was wrong about ADODB library. We should switch to 'mysqli', then setting $db->charSet before connection itself will work for mysql too.

>3) Should we worry about systems where their supervisors do not know (or don't
>care) if their encodings are correctly set (and match) system wide.

This is not about system settings, the encodings mismatch is a web2project issue. Keep on reading.

> Now nr 2, I believe we must find a way to set the encoding all right, but I
> prefer a one line of code approach, depending on DB engine, though for now
> we only have to worry about mysql as you correctly said.
> I see that some Dbs react to $db->charSet before the Connect is done on ADODB,
> I see that mysql reacts to $db->EXECUTE("set names 'utf8'"); after the Connect.

Mysqli will react to $db->charSet before the Connect is done on ADODB too.

> I don't see why we should add a w2PConfig for this, because utf-8 is the
> defacto encoding and that's it, period.

Because we could damage data if we simple set $db->charSet('utf8'). This is because we didn't use utf8 for communication with mysql server. Instead we used some 'unknown' character set. It could be utf8, latin1 or something else. We could not know. And we didn't want to know.

And IMO users should be able to use the most proper character set for their usage, it must not be utf8. Another w2PConfig for setting collation is next important thing, without it we will never have correct sorting.

>Truth is, right now, if people are connecting against a badly set server, the
> data retrieved from their servers is all screwed up, right?

No.

> So changing it right away would be better than waiting for converting, right?

IMHO no.

Let me explain a bit more details:

We have database columns character encoding set to utf8 already (006_special_characters.sql) and web2project has no problem with utf8. The missing part is communication between web2project and mysql server.

There are three mysql system variables that define character sets of communication:
1) character_set_client - character set in which statements are sent by the client
2) character_set_connection - to what character set should the server translate a statement after receiving it
3) character_set_results - character set in which the server returns query

All these three are usual set to same value and for particular connection can be changed from PHP code directly (if mysql skip-character-set-client-handshake option is not set).
i.e mysql_set_charset('utf8') or 'set names "utf8"' sql statement set all these three variables to utf8. But mysql_set_charset('utf8') is the preferred way [1]. I will not cover collation here to keep it simple.

Problem is, web2project does not care to tell mysql server in what character set wants to communicate. So now it is mysql's turn.

It will set these variables according to:
1) mysql global or client configuration, if not defined then
2) options passed to mysqld as parameters or
3) compilation options or
4) mysql default - latin1.

Two most common values are utf8 and latin1. Both are entirely proper settings. You cannot say latin1 setting is bad, bad is that web2project does not care to tell mysql server in what charset they will communicate. Web2project is blindly assuming utf8 but MANY servers have latin1 as default setting. Do you know why so many servers have latin1 as default setting? Because there are still so many old/buggy software or sites that do not set connection charset and system-wide change may damage its database content! Yes, web2project is contributing to this unpleasant status.

So now we have two cases:
1) system-wide setting for db connection is utf8 - all is in order RIGHT now.
2) system-wide setting for db connection is latin1 - the retrieved data from server are still correct RIGHT now, but data stored in database are screwed up and sorting (collation) of non-ASCII characters does not work. In fact, sorting will not fully work until we do not change collation of tables/columns to correct language collation, i.e. utf8_czech_ci.

Now,
1) if we migrate (or simply export) our databases to host with another system-wide character set, data will be all screwed up
2) if system administrator decides to change system-wide setting (i.e. from latin1 to utf8) or update system, data will be all screwed up
3) if web2project set connection character set to utf8 right now, data will be all screwed up on systems that have latin1 system-wide.

** SUMMARY **

This is not about badly set servers at all, this is about the bug in web2project. And IMO we should not fix this bug by simple enforcing utf8 connection charset because it may result in data loss if the database content will not be converted before the change.

If you decide, despite my opinion I will try to summarize in further text, to simple force utf8 connection charset, do it, I will not be pushing on you anymore, but please, please, make some BIG FAT warning about that change available to the web2project users.

Simple because it is a bug in web2project.
We shouldn't have been relying on system-wide connection character set because it does not need to be utf8! It can be set i.e. to latin1 and that is entirely proper configuration (forced by history and still existing old or buggy code as currently web2project is), not administrator fault. Web2project connection charset must match the character set of stored data. And we should have ensured this, not system administrators. We did not ensure this in the past and so, I think, we should make everything possible to guarantee database content will not be damaged if we now fix this web2project bug. Otherwise it will be better to do nothing.

There are new hosting servers with utf8 as default mysql charset but there are still ones with latin1. This will not change very soon. I am not web hosting administrator, but if I would be one and would set up a new mysql server I would set utf8 as default charset, but I would probably never change latin1 to utf8 on servers with already existing databases. Administrators of large hosting servers cannot simple change latin1 to utf8 system-wide or migrate sites to new server with utf8 set as default. This will effectively damage sites that do not set db connection charset explicitly. Their databases must be converted first. That is not trivial for sysadmins to achieve, because they must first know what character set for communication with DB has been using by each site. And it is inconvenient for their customers as well, if they should convert it themselves or take a risk their data will be lost. There are many servers with latin1 because of backwards compatibility with old/buggy/legacy projects they are hosting. Consider web2project as one of these. Web2project can be considered as partial reason why so many servers still set latin1 as default charset.

And there are also latin1 as default character set in mysql upstream [2], maybe for same reason.

Please make decision, I am convinced I am right (but who is not?), but I may be wrong (I was wrong many many times already).

After you make decision and changes in the code I will make appropriate article for wiki.

First of all thanks for beating me to a pulp on this one, and please keep doing it because it is being fun. :)

This is yet one more reason I am not too fond of mysql.

If you noticed we have been having a big deal of effort put in this utf8 subject in the past few months.
And the reason this was not being considered such a big thing, was that our primary markets are companies that mainly live with ascii or latin.
But we know that utf8 does not damage those markets and we can make web2project useful to loads more people if we find our way with utf8.

Now having said that, I find it hard to call it a web2Project bug, but rather work in progress.

The way I see it, we should attack this by:
1) Create a wiki page with all this discussion.
2) You have code to make a conversion, right?
Can it detect mysqls server encoding or how the data is encoded and make a conversion?
So lets talk people into making a conversion into utf8, if they require it.
3) Create an option on the system config, a checkbox saying... enforce utf8 data connections to the Database server. With a FAT NOTE pointing to the wiki page, and stating... "IF YOU ACTIVATE THIS OPTION AND HAVEN'T CONVERTED THE DATA ON THE DATABASE TO UTF8, YOUR DATA CHARACTERS WILL MOST PROBABLY END UP SCRAMBLED. YOU HAVE BEEN WARNED!!!
If you are unsure, leave this option unchecked and ask for support on web2Project Forums."
4) If that setting is true:
After line 18 on includes/db_adodb.php we test this setting and then we, depending on the DB engine we do the $db->charSet, or if we are using mysql (not mysqli as you mentioned) we put $db->EXECUTE("set names 'utf8'"); if that config option is enabled.

By default that options is false, or unchecked.

Again, thank you very much for the effort put in this, we appreciate it and you are far from being annoying, don't worry :)

>2) You have code to make a conversion, right?
>Can it detect mysqls server encoding or how the data is encoded and make a conversion?

It detects system-wide connection charset, if it is already utf8 then no conversion is needed, if it differs from utf8, SQL statements needed to utf8 conversion are printed.

> 3) Create an option on the system config, a checkbox saying... enforce utf8
> data connections to the Database server.

I would not make it available in System admin/System configuration page. It will raise even more problems.

Just place it in config-dist.php set to true (for fresh installations using mysql >= 5.0.7) and advice users to set it to utf8 in config.php after they convert existing database. In the code check if this option is defined, if not consider it to be false. Setting it true to config-dist.php is ok until we are setting utf8 connection charset to mysql only.

Optimal would be adding this option to config.php (with appropriate comment) when updating code and make it true if mysql database is used and system default connection charset is utf8 already or make it false otherwise. But I have found no support for this in current code. And config.php must not be writable. So it is probably bad idea.

> 4) If that setting is true:
> After line 18 on includes/db_adodb.php we test this setting and then we,
> depending on the DB engine we do the $db->charSet, or if we are using mysql
> (not mysqli as you mentioned) we put $db->EXECUTE("set names 'utf8'");
> if that config option is enabled.

I would put (conditional variant of) $db->EXECUTE("set names 'utf8'"); only for mysql (after connection is accomplished), don't bother with other databases right now. $db-charSet will work only for oci8 and ibase drivers. I would wait with implementing this.

This solution is long way to be optimal but it is the most simple I can imagine.

==============

To final robust solution for all databases we should take following research and ideas in consideration:

I check the ADODB code through and must correct my statement about mysqli:
Setting $db->charSet before the connection is working only for 'ibase' and 'oci8' drivers (in ADODB version we use) and 'sybase' (in current upstream ADODB). 'mysqli' and 'postgres7' drivers have SetCharSet() methods implemented, they can be used to set encoding after connection was made. This is probably because mysqli does not allow specify connection charset when making connection. This method is not implemented for mysql - these ADODB drivers' methods are just wrappers to appropriate PHP modules (i.e. mysqli) and mysql PHP module does not have implemented this.

So we can force connection charset to oci8, ibase and sybase (if we update ADDODB to new version) drivers by setting $db->charSet('utf8') before actual connection happens. After connection has accomplished we can change charset for mysqli and postgres7 by $db->SetCharSet('utf8') method call. And finally we can force charset to mysql by mysql_set_charset('utf8',$db->_connectionID) and other databases in their specific way.

Right now I would change this only for mysql as I stated above. BTW we should consider use mysqli (MySQL Improved Extension) rather then mysql which is designed for mysql older then 4.1. See [1].

For mysql we should use either mysql_set_charset('utf8',$db->_connectionID) or execute an appropriate sql statement as you have proposed. Though, mysql_set_charset() is preferred way because it returns false if failed and sets internal mysql->charset properly (I am not sure what it means, but mysql_client_encoding() will not register change otherwise and it has probably something to do with mysql_real_escape_string() ). See link [1] bottom of my previous post. But that function is implemented only in PHP >= 5.2.3. Therefore we could simple make $db->EXECUTE("set names 'utf8'") as you proposed or use preferred mysql_set_charset('utf8',$db->_connectionID) and add this function in backcompat_functions.php. This way it is done in my patch.

Other databases charset should be implemented when we will fully support them in the code so we can make testing.

So, yes , set $db->charSet before connection is made, this will work only for ibase and oci8 drivers but can never hurt. After the connection is initiated call $db->SetCharSet() method if it is implemented. If it is not, set connection encoding in database driver specific way.

I realize we should not make config option that will force all databases connection charset to utf8. Because if we do it, we must ensure charset setting will immediately work for every database driver we will support in future.

So we must ensure all this thing will apply only to mysql right now:

1) Add 'mysql' case to db_connect() switch block - only here we should set connection character set.

Though, a bit refactoring of db_connect function will be appropriate, i.e. moving connection code from switch cases before switch block (that code is repeated in all cases).

Now if somebody adds $w2PConfig['dbconnection_charset'] = true to config.php, utf8 will be used for connection.

As addition to this we should add default (=false) value to config-dist.php. So we can place there a comment.

Better, if we set it to DBCONNECTION_CHARSET string that will be replaced by installer to false or, in case mysql will be chosen and it will be >= 5.0.7 version, to true. So fresh installations will be using utf8.

2) So something like this to config-dist.php

//set this value to true to use UTF8 for database connection, keep it false to use system default
$w2PConfig['dbconnection_charset'] = DBCONNECTION_CHARSET;

along with a FAT NOTE pointing to the wiki page, and stating...
1) KEEP THIS OPTION FALSE IF YOU DO NOT USE MYSQL >= 5.0.7 AS YOUR DATABASE.
2) IF YOU SET THIS TO TRUE AND HAVEN'T CONVERTED THE DATA ON THE DATABASE TO UTF8, YOUR DATA CHARACTERS WILL MOST PROBABLY END UP SCRAMBLED.
YOU HAVE BEEN WARNED!!!

3) Installation script will replace DBCONNECTION_CHARSET with FALSE or with TRUE if mysql database is used and if it is new enough (version check).

That's all.
This way we can add support for other database drivers later with no pain.

There are two more functions connecting directly to DB, please, consider if they must be changed too.

They are in classes/w2p/Core/UpgradeManager.class.php and install/manager.class.php

1) I've logged the "installer should support table prefixes" as 0000648.

2) For new installs, the utf-8 tables should be easy to support. We can add the mysql version check and enable the checkbox if it's 5.0.7 or above. Someone please file a request on that one with all relevant details.

3) For existing installs, this will be trouble.. having a one-time switch in the System Admin seems like a reasonable option *IF* we can detect if the character problem exists in their database:
* If it's just a setting problem but not data, let's change the setting, no problem.
* If there's a problem with their data, I'd prefer to show the disabled setting box with an explanation. I don't want to take responsibility for breaking their stuff.

4) Updating the minimum PHP version can only happen on Major releases (each June). I'm *guessing* we should go to 5.2.6 or 5.2.10 by that time. In the meantime, using the backcompat_functions is the way to go. Nice catch gajdusek.

5) If *either* of you find MySQL specific commands, etc outside the install/sql/mysql directory, could you mark those with something like:

// TODO: mysql specific

I'd like to have those findable as I'm digging through SQL Server support now.

There is no checkbox on installation time, instead this option is autodetected during fresh install or d2p migration. Detection is performed by the w2p_Core_UpgradeManager::DBConnectionCharsetSafetyCheck() method. This could be safely autodetected on upgrades as well, but I've found no support of changing already existing config.php.

IMHO there is no need for checkbox. If it is safe to use utf8 connection then it should be used, because w2p code assumes it already. There is no explanation why this option was detected to be unsafe (or not implemented for used db type), I have tried implement this but gave up. I remember no more why exactly. Instead, explanation will appear in sysadmin module (this way the already existing installations are covered too), but only if some user action can solve the issue (i.e. converting database). Nothing will be displayed if $w2Pconfig['dbconnection_charset'] is already enabled or mysql version is too old or unsupported database type is used.

*** caveeats ***

- Only mysql is supported right now
* the patch has no problem if unsupported DB is used
* support for other database can be easily and safety added anytime later