DBI

XML::LibXML

Sadly XML::LibXML's serialization code will sometimes encode non-ASCII characters as numeric entities and not UTF-8. For example, you might see pâte sucrée instead of pâte sucrée. Obviously this is valid XML but it plays havoc with e.g. MySQL's sorting!

Most perniciously this transformation can be invisible if you look at the data in web browser, and of course it depends on both the data being serialized and (apparently) the version of the underlying libxml2.

Handy debugging hints

I think this stuff can be quite a pain to debug. There seem to be ample scope for confusion because transformations sometimes happen automatically or invisibly. Ultimately some sort of hex dump (or the like) gives us unambiguous data.

It also seems to be handy to memorize the byte sequences you might see. fileformat.info1 has good Unicode pages.

For example, consider é2, which has codepoint 0xe9. I managed to create these sequences:

The correct UTF-8 encoding: 0xc3, 0xa9.

A broken 'double' UTF-8 encoding: 0xc3, 0x83, 0xc2, 0xa9. To get this I managed to run the data through the UTF-8 encoder twice.

Look in the mysql .MYD file

For applications which load data into a database, then display it, the mysql data files are a convenient place to bisect the problem.

If you use MySQL's default MyISAM storage engine, then the data in a table are stored in a .MYD file called $datadir/$database/$table.MYD. It's a binary file, but you can always dump it. On MacOS X: