Perl: Default to UTF-8 encoding

By Arnon Weinberg, on June 20th, 2012

The UTF-8 (Unicode) character encoding system is a well supported alternative to the older ISO-8859-1 (Latin-1) system that can make it easier to work with special characters and multiple languages. Many developers can exercise sufficient control over their system to ensure that:

All Perl source code is encoded in UTF-8

All text input files and streams are encoded in UTF-8

Interactions with browsers are encoded in UTF-8

Interactions with all other interfaces are UTF-8 based

All other text data is encoded in UTF-8 by default

If, like me, you have decided to standardize your code on UTF-8 and, like me, are not particularly concerned with exceptions such as alternate encodings, then it’s useful (and reasonably safe) to set up scripts to default to UTF-8 character encoding for all text.

Perl versions >5.8 automatically tag text as UTF-8 whenever the interpreter knows for certain that it is UTF-8 encoded. Examples of this include using chr() with a UTF-8 code-point (>255), a string containing the \x{code-point} syntax, data from files opened with a :utf8 encoding layer, or any data explicitly tagged as UTF-8 encoding using (for example) utf8::decode().

Unfortunately however, for reasons of backwards compatibility with older code, there are many exceptions where UTF-8 encoded text is not automatically tagged as UTF-8. Examples of this include setting UTF-8 encoded string constants in Perl code, using filehandles that are not explicitly tagged as UTF-8, environment variables, command-line arguments, data fetched from a database, CGI input including unescaped URIs, cookies, parameters, and POSTed data.

Strings that contain UTF-8 encoded characters but not tagged as UTF-8 are treated as Latin-1 text (equivalent to binary), and may produce different/unexpected results when used in functions such as length() and sort(), and in regular expressions. Additionally, mixing tagged UTF-8 data with untagged UTF-8 data may lead to double-encoding. One solution is to manually tag all text strings as UTF-8. However, assuming as above that the system only ever interacts with UTF-8 encoded text, it is desireable to set that as the default in as many cases as is safe to do so.

The following script serves as a useful test case for demonstrating Perl’s default UTF-8 support features, and additional features that can be turned on to provide additional defaults:

Although the characters do (for the most part) display correctly, this is only because the viewer (shell terminal in this case) interprets the byte sequences correctly as UTF-8. Most of these test cases are not automatically tagged as UTF-8 (utf8=0), and hence mixing them with text that is may lead to the issues noted above. Also note that the 2 cases that are tagged as UTF-8 (utf8=1) produce a “Wide character in print” warning, which indicates the same problem – they are also displaying correctly only because of the viewer.

The premise here is that in a controlled environment that is standardized on UTF-8, all of the above text can be assumed by default to be UTF-8 encoded, unless Perl is explicitly told otherwise. How do we do this?

1. Tell Perl that all code, including string constants, is UTF-8 encoded by adding:

use utf8;

2. Tell Perl that all input and output files, handles, and streams are UTF-8 encoded by adding:

use open ( ":encoding(UTF-8)", ":std" );

3. Tag all environment variables as UTF-8 in a loop at the beginning of the script. If you are running in taint mode (as you should), then you can combine this with untainting, eg:

Now most of the test cases are tagged as UTF-8, with the exception of the non-UTF-8 cases, which are at least displaying correctly now.

Notes:

The term “tagged as UTF-8″ refers to the “UTF-8 flag” – Perl’s internal flag that indicates that a string has a known decoding. This is checked in the test script above using utf8::is_utf8(), and is useful for debugging purposes only.

The “use” pragmas above have a lexical scope, which means that every script needs them. In a web server (CGI) environment, this means every web page. A good content management system should be able to automate this. It may also be a good idea to use a boilerplate such as utf8::all.

In some cases it may be desirable to ensure that correct encoding is used by die()ing, instead of just logging a warning, when it isn’t:

use warnings ( "FATAL" => "utf8" );

Be sure to set up MySQL to use UTF-8 by adding the following to /etc/my.cnf:

CGI’s -utf8 option only automatically decodes string values returned by CGI::param(). It does not decode:

Parameter names

Values returned by other CGI functions, such as cookie(), url_param(), unescape(), or Vars()

File names

A more comprehensive solution is described below, which decodes parameter names and values from other CGI functions. However, file names must still be decoded manually when used – as described in the CGI documentation.

This solution overrides some existing CGI methods with UTF-8 friendly versions. It encodes all incoming arguments and decodes all outgoing results – so that CGI only interfaces with encoded (non-UTF-8) strings. An exception is made for CGI::param(), which should set UTF-8 decoded values so as to prevent double-encoding problems with input fields.

There are currently many bugs in Perl and modules with respect to UTF-8 support, so not everything goes smoothly. Perhaps one of the most significant shortcomings is that several file handling functions, including opendir(), readdir(), and features that rely on them such as <globbing*>, File::Find, etc, do not respect the use open pragma for file handling. This means that file/directory names are not automatically decoded, and there is currently no way to change that default behaviour. If you are using a UTF-8 encoded file system with encoded file/directory names, then you will have to decode them manually:

foreach ( <*> )
{
utf8::decode($_);
...
}

Another example is Data::Dumper – a module often used to encode and decode complex variables – that does not handle character encodings, so these are lost if not handled manually. Consider using JSON for this purpose instead.