Teaching CakePHP to be Multilingual (part 2)

Speed: Strings are pulled from a binary file and aggressively cached by apache.

Robustness: Gettext has been a standard for years and is used in a wide variety of systems reliably.

Friendliness: Due to its age and widespread use, the .po file that is distributed to localizers is widely recognized and has several applications to assist in the translation.

Anything else? There are command line programs for creating and merging gettext files already.

Fantastic, it fits the bill and hopefully will be straight forward to implement. One of the first questions that arose was whether we should use actual phrases or placeholder strings in the template files. This would be the difference between, for example, “Welcome to Remora” and “header_welcome” in the template files. Using the actual strings would make translation simpler, but if we wanted to do a minor change to a phrase in english (like adding a comma) we’d have to regenerate the .po files, remerge, and reverify them. If we used placeholder strings, we’d lose the built in gettext fallback of returning the input string when a match can’t be found in the .po file and they wouldn’t be as straight forward for localizers to translate.

After polling some people with expertise in localization, a, surprisingly unanimous, decision to use placeholder strings was agreed on. We’d just have to make sure our translations existed so we didn’t need to depend on gettext’s built in fallback.

A tutorial exists that does a great job covering setup and basic use of gettext already, so I’ll skip the fundamentals that are explained there (but be sure to read it before you continue here!). One aspect worth mentioning in addition to the ONLamp tutorial is supporting plural forms. In English, this usually means adding an ‘s’ – for example, nacho vs. nachos. In other languages (Polish is the classic example), the plural forms get much more complex, often depending on knowing the number of nachos instead of just knowing you have more than one.

Gettext supports multiple plural forms by adding a “Plural-Forms” header in the .po file. This can be a fairly complicated string of ternary operators, that, when evaluated, come up with a resulting number. This result is used as an index into an array in the .po file. That’s a confusing couple of sentences, so let’s have an example. If we were just dealing with English, we could write something like this to handle plural forms:

If we were to convert that directly to gettext, we’d be left with two strings to translate, and the only difference being the plural. Lucky for us, gettext supports plurals – unfortunately, it requires an inconvenient change to your code wherever you need to support it. To make the above code gettext/plural friendly, we’ll actually use the ngettext function. Using this function, we can pass the $number variable to gettext so it can determine which array index to return.

<?php
// It looks like we have some redundancy here, but that's the way it works -
// Since we're using placeholder strings, the first and second parameters are the same.
// $number shows up twice because we're passing it to ngettext() and sprintf()
echo sprintf(ngettext('header_message_num', 'header_message_num', $number), $number);
?>

The parts of the English .po file that are relevant to this example would look like:

The msgid and the msgid_plural correspond to the first and second parameters to ngettext (in our case, they’re equal). The $number variable we passed to gettext is run through the algorithm given in the Plural-Forms header, and results in either a zero or a one – the index to the msgstr array. The gettext manual has a section on plural forms that gives more complex examples, including the algorithms for other languages.

Overall, gettext works as advertised and fulfills our requirements for static localization, save a couple headaches. Firstly, it employs very aggressive caching, and sometimes it can get a little carried away. In fact, we’ve been unsuccessful in finding a way to disable or flush the gettext cache without restarting apache. This is an inconvenience but not a deal killer for us. Hopefully once finished our translations won’t change a lot, but it’s still annoying enough to wonder why there isn’t a more convenient solution.

The second headache is with gettext’s feature set – it doesn’t support declinations. Since, as far as I know, this is a foreign concept in English, let’s look at an example in Spanish. Let’s say we have the following sentence we want to represent in gettext (I’ll skip the placeholder strings for the sake of simplicity):

<?php
sprintf(_('Come el %s.'), $fruit);
?>

If you’re familiar with Spanish, you’ll notice that’s a masculine sentence, which would be appropriate if $fruit was “plátano”. However, what if $fruit were “pera”? The sentence would come out as “Come el pera.” when it should say “Come la pera.” As it currently stands, gettext doesn’t support a way to deal with the genders of words. For Remora, we’re going to have to depend on some creative wording to help us avoid situations like the above example.

Despite the shortcomings, I think gettext was the right decision. It has some hiccups that are frustrating, but it’s still the best thing out there. Look forward to another long and complicated post about dynamic localization in the future…

Addendum: After I wrote this post, I saw in the news that CakePHP 1.2 now boasts gettext() functionality in the core – good work to all involved.

3 responses

That’s the reason why I gave up on gettext and similar interfaces 15 years ago. It’s too anglo-centric to be usable in other languages.

The problem is that the developer assumes that it’s possible to do a straight translation. But that’s not necessarily true. For example, who says that there will be always 1 argument to replace in “You have %d messages” ? Ok, maybe in this simple example, but the language might dictate otherwise.

As an example, I had a fellow programmer that actually solved the masculine/feminine/neuter problem above. Bzzzzzt ! Wrong ! Some languages have multiple forms (Czech has 4 forms). Some have less. In Portugese, a man might say “Muito obrigado” (Thank you very much), but a woman says “Muito obrigada”. Etc … etc … It’s an endless nightmare.

And no, I don’t have a good solution either – there are a few attempts, mainly in the AI area, but most are often unusable in C/C++.

I don’t see the problem going away anytime soon, so if someone has put in some effort to make it better, I’d like to see what they did. We’re toying with the idea of making it better ourselves ( http://wiki.mozilla.org/L20n ) so it would be great to learn from other’s mistakes.

From my analysis which led to l20n which wil linked to above, it’s mighty tricky to really do l10n right from a non-object oriented language as C. Though it’s easier when things are encapsulated in a somewhat ‘intelligent’ library. Thus I moved the replacement stuff into the library.

I agree that localizing software can only be as good as the data that you get, that is why I hope for good results at least as long as the information is contained inside the localization. Adding computed data to the image will likely result in compromises, and the trick is to find out which.

I feel that somehow the architecture available or at least used today doesn’t put the power and the lack thereof close enough together, and I intend to fix that, too.

Feedback on l20n is welcome, either in the i18n newsgroup, in the wiki, or directly to me.