The right way to internationalize your Python app

Recently, as part of our push to ship only Python 3 on the Ubuntu 12.10
desktop, I've helped several projects update their internationalization
(i18n) support. I've seen lots of instances of suboptimal Python 2 i18n code,
which leads to liberal sprinkling of cargo culted .decode() and
.encode() calls simply to avoid the dreaded UnicodeError s. These get
worse when the application or library is ported to Python 3 because then even
the workarounds aren't enough to prevent nasty failures in non-ASCII
environments (i.e. the non-English speaking world majority :).

Let's be honest though, the problem is not because these developers are crappy
coders! In fact, far from it, the folks I've talked with are really really
smart, experienced Pythonistas. The fundamental problem is Python 2's 8-bit
string type which doubles as a bytes type, and the terrible API of the
built-in Python 2 gettext module, which does its utmost to sabotage your
Python 2 i18n programs. I take considerable blame for the latter, since I
wrote the original version of that module. At the time, I really didn't
understand unicodes (this is probably also evident in the mess I made of the
email package). Oh, to really have access to Guido's time machine.

The good news is that we now know how to do i18n right, especially in a
bilingual Python 2/3 world, and the Python 3 gettext module fixes the most
egregious problems in the Python 2 version. Hopefully this article does some
measure of making up for my past sins.

Stop right here and go watch Ned Batchelder's talk from PyCon 2012 entitled
Pragmatic Unicode, or How Do I Stop the Pain? It's the single best
description of the background and effective use of Unicode in Python you'll
ever see. Ned does a brilliant job of resolving all the FUD.

Welcome back. Your Python application is multi-language friendly, right? I
mean, I'm as functionally monolinguistic as most Americans, but I love the
diversity of languages we have in the world, and appreciate that people really
want to use their desktop and applications in their native language.
Fortunately, once you know the tricks it's not that hard to write good i18n'd
Python code, and there are many good FLOSS tools available for helping
volunteers translate your application, such as Pootle, Launchpad
translations, Translatewiki, Transifex, and Zanata.

So there really is no excuse not to i18n your Python application. In fact,
GNU Mailman has been i18n'd for many years, and pioneered the supporting
code in Python's standard library, namely the gettext module. As part of
the Mailman 3 effort, I've also written a higher level library called
flufl.i18n which makes it even easier to i18n your application, even in
tricky multi-language contexts such as server programs, where you might need
to get a German translation and a French translation in one operation, then
turn around and get Japanese, Italian, and English for the next operation.

In one recent case, my colleague was having a problem with a simple command
line program. What's common about these types of applications is that you
fire them up once, they run to completion then exit, and they only have to
deal with one language during the entire execution of the program,
specifically the language defined in the user's locale. If you read the
gettext module's documentation, you'd be inclined to do this at the very
start of your application:

fromgettextimportgettextas_gettext.textdomain(my_program_name)

then, you'd wrap translatable strings in code like this:

print_('Here is something I want to tell you')

What gettext does is look up the source string (i.e. the argument to the
underscore function) in a translation catalog, returning the text in the
appropriate language, which will then be printed. There are some additional
details regarding i18n that I won't go into here. If you're curious, ask in
the comments, and I'll try to fill things in.

Anyway, if you do write the above code, you'll be in for a heap of trouble, as
my colleague soon found out. Just running his program with --help in a
French locale, he was getting the dreaded UnicodeEncodeError:

"UnicodeEncodeError: 'ascii' codec can't encode character"

I've also seen reports of such errors when trying to send translated strings
to a log file (a practice which I generally discourage, since I think log
messages usually shouldn't be translated). In any case, I'm here to tell you
why the above "obvious" code is wrong, and what you should do instead.

First, why is that code wrong, and why does it lead to the
UnicodeEncodeError s? What might not be obvious from the Python 2
gettext documentation is that gettext.gettext() always returns 8-bit
strings (a.k.a. byte strings in Python 3 terminology), and these 8-bit strings
are encoded with the charset defined in the language's catalog file.

It's always best practice in Python to deal with human readable text using
unicodes. This is traditionally more problematic in Python 2, where English
programs can cheat and use 8-bit strings and usually not crash, since their
character range is compatible with ASCII and you only ever print to English
locales. As soon as your French friend uses your program though, you're
probably going to run into trouble. By using unicodes everywhere, you can
generally avoid such problems, and in fact it will make your life much easier
when you eventually switch to Python 3.

So the 8-bit strings that gettext.gettext() hands you have already sunk
you, and to avoid the pain, you'd want to convert them back to unicodes before
you use them in any way. However, converting to unicodes makes the i18n APIs
much less convenient, so no one does it until there's way too much broken code
to fix.

What you really want in Python 2 is something like this:

fromgettextimportugettextas_

which you'd think you should be able to do, the "u" prefix meaning "give me
unicode". But for reasons I can only describe as based on our
misunderstandings of unicode and i18n at the time, you can't actually do that,
because ugettext() is not exposed as a module-level function. It is
available in the class-based API, but that's a more advanced API that again
almost no one uses. Sadly, it's too late to fix this in Python 2. The good
news is that in Python 3 it is fixed, not by exposing ugettext(), but by
changing the most commonly used gettext module APIs to return unicode
strings directly, as it always should have done. In Python 3, the obvious
code just works:

fromgettextimportgettextas_

What can you do in Python 2 then? Here's what you should use instead of the
two lines of code at the beginning of this article:

_=gettext.translation(my_program_name).ugettext

and now you can wrap all your translatable strings in _('Foo') and it
should Just Work.

Perhaps more usefully, you can use the gettext.install() function to put
_() into the built-in namespace, so that all your other code can just use
that function without doing anything special. Again, though we have to work
around the boneheaded Python 2 API. Here's how to write code which works
correctly in both Python 2 and Python 3:

importsys,gettextkwargs={}ifsys.version_info[0]>3:# In Python 2, ensure that the _() that gets installed into built-ins# always returns unicodes. This matches the default behavior under# Python 3, although that keyword argument is not present in the# Python 3 API.kwargs['unicode']=Truegettext.install(my_program_name,**kwargs)

Or you can use the flufl.i18n API, which always uses returns unicode
strings in both Python 2 and Python 3.

Also interesting was that I could never reproduce the crash when ssh'd into
the French locale VM. It would only crash for me when I was logged into a
terminal on the VM's graphical desktop. The only difference between the two
that I could tell was that in the desktop's terminal, locale(8) returned
French values (e.g. fr_FR.UTF-8) for everything, but in the ssh console,
it returned the French values for everything except the LC_CTYPE
environment variable. For the life of me, I could not get LC_CTYPE set to
anything other than en_US.UTF-8 in the ssh context, so the reproducible
test case would just return the English text, and not crash. This happened
even if I explicitly set that environment variable either as a separate export
command in the shell, or as a prefix to the normally crashing command. Maybe
there's something in ssh that causes this, but I couldn't find it.

One last thing. It's important to understand that Python's gettext module
only handles Python strings, and other subsystems may be involved. The
classic example is GObject Introspection, the newest and recommended
interface to the GNOME Object system. If your Python-GI based project needs
to translate strings too (e.g. in menus or other UI elements), you'll have to
use both the gettext API for your Python strings, and set the locale for
the C-based bits using locale.setlocale(). This is because Python's API
does not set the locale automatically, and Python-GI exposes no other way to
control the language it uses for translations.