For long lost friends and stalkers

UTF-8 A-Go-Go

Recently I finished (more or less) converting a Perl/CGI/MySQL website application to use UTF-8 throughout.

The CGI module and the DBI module currently have lousy character encoding support, so I created Perl packages to fix them (relatively) transparently.

Here’s how, and here’s my code:

UPDATE: I’ve just updated the code based on others’ feedback, for which, many thanks. See comments below. (Jan 2007)

Abstract

I here present slap-on fixes for two really common modules in Perl-5.8-based web development—CGI.pm and DBI.pm—to allow them to properly support character encoding within an all-UTF-8 application. Both fixes can be installed with an application, without requiring admin access to the server (so ideal for a web hosted app); both fixes operate as a thin filter layer, requiring minimal changes to your app. The CGI fix is designed to deal correctly with various web browser character encodings, and the DBI fix is designed to allow you to use pure UTF-8 in your database, even when your driver and version of MySQL don’t properly support it.

I wanted to Do The Right Thing with character encoding, and these are the cheap hacks which allowed me to do it. Please feel free to reuse in your project.

There are other approaches, but the UTF-8 character encoding has a host of advantages (not least ASCII-compatibility). Most text-processing systems, including Perl are moving towards it. Joel Spolsky has an excellent overview of Unicode and character encodings that every programmer should read.

Perl 5.8 provides decent UTF-8 support and very good support for converting between other character encodings. (See perldoc topics perluniintro and perlunicode.)

However, support for UTF-8 in libraries is patchy. Common libraries are ‘CGI’, for parsing CGI input* to server-side script, and ‘DBI’, for database access, neither of which (at least the versions I’ve seen installed anywhere), have any character encoding support, and both of which can easily mangle your carefully-wrought UTF-8.

The libraries described below, ‘UTF8CGI’ and ‘UTF8DBI’ are designed to be post facto fixes for these two common packages.

* Some people use CGI package to write HTML. I don’t. My modifications below don’t address this. However, it will probably work fairly seamlessly.

CGI.pm

The CGI library has been around since the dawn of time. It’s the cannonical way for a Perl script to parse CGI parameter and form input. Some people even use it to generate HTML. (In my opinion it should be replaced with something less monolithic and with better standards support, but that’s just my opinion.)

Issues with the existing module

CGI.pm completely ignores character encodings from client.

Many (especially older) web browsers don’t tell you what encoding their form input is in.

Solutions

My solution was to create a subclass of CGI which magically determines the input character encoding (or at least makes a decent stab in the dark), and converts it all into UTF-8, so that your application really doesn’t need to worry about it.

This subclass, UTF8CGI should be imported into wherever you “use CGI”. Instead of creating a “new CGI” object, just create a “new UTF8CGI” object instead. Your code might look like:

#!/usr/bin/perl -wT

use utf8; # If the script itself is written in UTF-8.
use strict; # Be strict. Note we gave ‘-wT’ flags to Perl too, for extra protection.
use CGI qw/:cgi -private_tempfiles/; # …or whatever parameters you like.
use UTF8CGI; # Import the fix.

DBI.pm

Issues with the existing module

Although most modern versions of database engines have character-encoding support, the current versions of Perl DBI don’t really.

Queries return strings as raw bytes, even when the database columns contain correctly-formed UTF-8. To correctly use such strings, the application must ‘decode’ every string from the database, which is error-prone and tedious. I found a good summary of the issues.

Solutions

My approach was to create a subclass of DBI, which puts a lightweight wrapper around every database query, and UTF-8-decodes every string from every query result.

Note that, because of the simple(minded) approach, the wrapper processes every single field of every row of every query, even if it contains numeric data. Potentially, with large data sets, this could become a significant overhead. Suck it and see.

Also, this requires that every string returned from any queries is either ASCII or valid UTF-8. If your database contains strings encoded as Latin-1, or any other character set, they will come back mangled. (However, if your database contains a mixture of character encodings, and you’re using DBI, you already have problems.) So before using, ensure all your database string data is UTF-8.

To use:

Ensure that UTF8DBI is in your library path

Replace ‘use DBI;’ with ‘use UTF8DBI;’

Replace ‘DBI->connect’ with ‘UTF8DBI->connect’

For example:

#!/usr/bin/perl -wT

use utf8; # The script itself is written in UTF-8
use strict; # Be strict. Note we gave ‘-wT’ flags to Perl too, for extra protection.
use UTF8DBI; # Import the fix.

Should work with Perl5.8 or later, and DBI 1.21 and later. (Tested with Perl 5.8.0 and 5.8.4 and DBI 1.32 and 1.46.)

If we find a string with malformed UTF-8, that string is replaced by a string “malformed-UTF8”. You could edit the code and replace this with a ‘die’, so that the code fails if the database is found to contain badly formed UTF-8, but I found the replaced string to be quite useful while debugging.

Summary

I use them, and they work.

They’re small and lightweight.

However, caveat emptor. You may need to tweak them to get them to work. I’m not providing any technical support. You have the source code, so if they blow up your computer and crash your grandmother, you’re on your own.

However, I hope they’re of use to you. If I save you from pulling your hair out as a result of pesky character-encoding issues, I’ll be very glad.

I don’t have much time to maintain these modules, but please feed back any improvements or bug fixes you make.

One more addendum – bet you can’t guess what I’ve spent all day doing! ;)

I think you need to add:
next if $query->upload($key);
after line 50 of UTF8CGI.pm “foreach my $key ($query->param) {”
Otherwise the filehandle (when grabbed with a later query->upload) seems to be invalid…?

Another update for you :-)
perl -wc UTF8DBI.pm complains with:
UTF8DBI::st::_utf8_() called too early to check prototype at UTF8DBI.pm line 10.

UTF8DBI::st::_utf8_() called too early to check prototype at UTF8DBI.pm line 11.

Apparently (according to google ;) ) the fix is to either forward-declare the _utf8_ sub, or to not use prototypes. I chose the former option (because I’m not familiar with prototypes) and managed to get rid of the warning by inserting:
sub _utf8_();
on the line before the “real” sub.

I’m afraid I’m a mere weekend (well, and evenings) Perl programmer. I need all the help I can get, so usually switch on strictness, warnings, and taint mode, too, if the situation seems to benefit from it.

I’ve also changed UTF8DBI for Pavel Kudinov’s version, but for a couple of changes:
* Removed the prototype on the _utf8_ function.
* Added back in the check for duff UTF8. Belt and braces.
* Removed all but one of the fetch… functions, since, it seems, all of the other statement-handle fetch functions go through that one.
* Now that should be all that’s required, in theory, since, as Pavel says, the DBI::dbselect… functions should just be short ways of getting at the st functions, but that doesn’t seem to be the case. It wouldn’t work without me reimplementing the select… functions too. (At least, this is the case in my version of the DBI module, which, as you can imagine, is the one that I care about.)