Writing Multilingual Sites With mod_perl and Template Toolkit

Writing Multilingual Sites With mod_perl and Template Toolkit

The Perl Journal March 2003

By Stas Bekman and Eric Cholet

Eric and Stas are the authors of the upcoming book Practical mod_perl (O'Reilly and Associates). Eric runs his own consulting business, Logilune, in Paris and can be reached at cholet@logilune.com. Stas is sponsored by TicketMaster to work on mod_perl development and can be reached at stas@ stason.org.

Before you search for a solution for your multilingual site, you have to figure out what kind of service you are going to provide: dynamic or static. If the pages are static, you need to evaluate whether there will be many pages to maintain or just a few. If you have only a few pages, the easiest solution is to just prepare each page in each language and forget about it.

If you have many pages, it's pretty much the same whether your pages are dynamic or static: Manual maintenance of many pages is time consuming and error pronein a word, ineffective. Therefore, the correct solution is to approach the problem as if it were a dynamic site, and generate static pages. Template Toolkit (http:// www.template-toolkit.org/) provides a utility program called "ttree" that creates static pages from dynamically generated output. From now on, we will assume that you are developing a dynamic site.

User Language Detection

Another important question is the process of figuring out what language should be used when presenting the content. The following algorithm tries to answer this question:

1. Separate users into two groups: those who are visiting the site for the first time, and those who have previously visited the site.

2. If you use cookies to track users (or some other mechanism that stores the information on the client side), and a user connects from the same machine/account that was used when previously accessing your service, you should already know the language preferences: This bit of information can be stored in the cookie, thereby answering the question of language detection for the second group of users.

3. If you don't use cookies or some other mechanism to track users, it probably doesn't matter whether they have accessed the site beforehand because you have no way to tell what their language preferences are.

4. If users have registered with your service, their language preference will be known after they have logged in, since preferences can be stored on the server side. However, the problem is that you need to know the user's language to display the login form.

5. You can try to figure out the user's language by deducing it from the user's country. One way to determine the user's country is by doing a reverse DNS lookup on the client IP address. This yields the user's computer name. You can then use the top level domain (TLD) to make some reasonable assumptions about the language: Chances are, most users in the .fr domain can read French, for example. Since many hosts do not have correct reverse DNS mapping, you might also be tempted to deduce the country from the IP address itself. However, this approach is due to fail in many cases: There are plenty of users whose visible IP address is outside their country; for example, AOL users worldwide use AOL proxies located in the United States.

So, we are back to the first group of usersthose we know nothing about. We must provide them a way to choose a language. The best method is to present a page with all available languages, with each language name written in that language. Each name is linked to the version of the service with content presented in that language.

We can go a little bit further and try to make an intelligent guess of the preferred language. This guess is made by looking at the Accept-Language header sent by most browsers. Localized versions of modern browsers set up the preferred language at install time. If users knows that it's possible to adjust the language preference in their browser, there is a chance that they will. For example they might set the following preferences:

German

English-US

French

which might mean the following: My preferred language is German. I also understand American English to some extent, and I know a little bit of French. (Of course, a user might know all three languages perfectly, but still prefer one language over another).

When a browser sends a request to a server, it generates the following header:

Accept-Language: de,en-us;q=0.7,fr;q=0.3

where each language is separated by a comma. In some browsers, the preference level can also be specified for alternate languages. So in this example, I've marked American English as 70 percent and French as 30 percent.

You can parse this header manually, but a better approach is to use standard CPAN modules. If you are using mod_perl, you can use the Apache::Language module; otherwise, use the all-purpose HTTP::Negotiate module bundled into the libwww distribution.

Browsers bundled with OS or ISP packages are usually preconfigured with the language of the country the package was issued in. So if users aren't computer savvy, chances are that the default language setting will be correct. If you have accepted this header, you may want to try your luck and present the top of the first page using the language derived from the header. But you still have to give users an option to change the language, since the browser setting might be incorrect for the particular user.

Remember that Accept-Language is useful for making your service more user friendly and sparing yourself the hassle of picking the right language, but it doesn't come as a replacement for the standard way of presenting the available languages.

At this point, we know the user's preferred language. In the case of a dynamic site, we proceed with generating the content. Otherwise, we simply direct the user to the right static content.

Generating the Content

When dynamic content is generated, at least two basic ingredients are used:

Invariant data: page headers and footers, side navigation, and other table information that is always the same.

Variant data: data that is not known a priori, which depends on the user or some other input.

When a site is generated in a single language, these two items are easily implemented: Either use templates for invariant data or hardcode it into the code, and retrieve the variant data from the database or through some other method. When the requirements include multilingualism, these tasks become more complex. I'm going to talk about each of them separately.

Fetching Dynamic Data

We separate the dynamic data requests into two groups: those that require user input and those that don't. A site search feature falls into the first category, whereas browsing the site belongs to the second one.

Searching

Let's use a movie server and a user whose preferred language is French as an example. Our user searches a movie by entering search keywords in a search box.

French includes accented characters. This is common in many other languages, too. An accented character usually uses some nonaccented character as a base. For example, characters: â, à, and á are all based on character a.

Because not all software supports accented characters, or appropriate keyboard maps are not always available, the user might generate input using only the base characters, without accents. In fact, even with proper software and hardware support, most French users will type keywords with no accents. The server is still expected to interpret this input correctly as if the accented characters were used.

We cannot guess which characters were inserted incorrectly, therefore, the obvious solution is to make the search index free of accented characters. This means that you'll have to keep two versions of the text; one version adjusted for the search, and the original unaltered version. You need the original because you still have to output the correct text, regardless of the user's input limitations.

In this article, I'll use the ISO-8859-1 character set, which is used by many Western European languages.

Listing 1 allows you to convert accented characters into their base characters. The code generates two methods: iso_8859_1_lc(), which turns any input using ISO-8858-1 into a lowercase, accent-free version; and iso_8859_1_uc(), which yields an accent-free uppercase version of its input.

For example, when you call:

$stripped_lc = $charsets{'iso-8859-1'}{lc}->('Bienvenüe');

$stripped_lc will be set to:

bienvenue

These functions are used twice: first, when creating the search index, and second, when accepting the search string, before the actual search is performed. Usually, using the lower case for searching is the accepted technique.

In addition, these functions can be used for sorting. Consider the following function:

Browsing

When a user browses the site, preset data inputs are used (be careful to make sure that inputs you assume to be nonchangeable really can't be changed by users). For example, after a search has successfully completed and matches one or more records, you may list all the matched results or a subset of them. From now on, the user clicks on one of the links to get to the full record.

At this point, you may want to use the original text version using all the characters, but these should be encoded because when the link is clicked, it's possible that the browser will interpret the request incorrectly. To accomplish this, you can use URI::Escape::uri_escape() or Apache::Util::escape_uri() (a much faster implementation under mod_perl).

The text itself should be encoded as well so the browser will not mess it up. The HTML::Entities::encode() or Apache::Util::escape_html() functions can be used for that.

Data Retrieval

One of the big questions is how to build your database so that it will accommodate the site's multilingual capability. Obviously, you should avoid creating language-specific fields in every table that includes multilingual data. For example:

is a bad idea because, as you can see, the table will require many columns. Don't forget that the number of columns will actually be doubled, since you need to duplicate all the columns for the searchable version of the text. Every time you want to support a new language, you'll have to alter the table and add many columns. A better solution is to place all language-specific data in one table:

id
lang
real_text
search_text

where lang specifies the language, real_text the real text, and search_text holds the version of the text adjusted for searching. id is needed to map every record into the table the data belongs to. This link back to the actual data table can be more complex and comprised of several fields. In one project, we used three fields to represent a unique data ID:

orig_table
orig_column
orig_id

The concatenation of these three fields gives us a unique mapping from a record in the text table to the data table, and the record it belongs to. For example:

SELECT * FROM lang where orig_table='movies'
AND orig_column='description' AND lang='fr'
AND search_text LIKE '%foo%'

will search only the description columns in the movies table. If we want to retrieve all language fields tied to some record, we can do:

SELECT * FROM lang where orig_table='movies'
AND orig_id=123456

If your particular database implementation cannot cope with all the textual data in all languages in one table, you may consider using one table per language.

Invariant Data

Finally, let's talk about invariant data. Data that doesn't change is either hardcoded in the code or, better yet, placed in a template. Let's take, for example, a search feature. The template will look something like Example 1. A simple mod_perl script that will parse this template and produce output is shown in Example 2.

So, these are the template and code used in a single-language site. When adding multilingualism to the site, we face this question: Should we use a template per language, or one template for all languages?

If you decide to go with the first option, you'll end up with many templates. However, keeping them synchronized will be a nightmare as changes must carefully be applied to each language template. A better approach is to keep all languages in one template. Modifications are easier because all strings are stored in the same place.

We have to find a way to parse this file and extract only the text in the requested language. Therefore, we've chosen to use XML tags, which will then be parsed by Template Toolkit so that texts in the right language will be selected.

We have used tag <text> for the text sections, and two-letter code tags for language-specific sections. Example 3 shows the search input template after applying these definitions, and Example 4 shows what the search results output template would now look like.

This parser assumes that a template variable named lang holding the current language code will exist at request time when the template is processed. We then use Template Toolkit to process the template and generate the output, using the following code:

Handling Dates

Date and time need to be formatted according to the locale. Different countries have different conventions for date and time presentation. Where an American user will read:

Thursday March 22, 2001 2:25pm

a French user will expect:

Jeudi 22 Mars 2001 14h25

In this article, we will assume that these conventions are tied to languages, rather than countries. This is incorrect in reality, but this assumption is good enough to be used as an example. In reality, you may want to tie the conventions to countries and not languages. In this case, you would need to ask the user for country preferences.

We specify the following data set for each language, as shown in Listing 3. Then we use this data to produce the dates and times in the correct language using the correct format. These are handy compile-time constants that are used in the date and time generators:

use enum qw(YEAR MONTH DAY);
use enum qw(HOUR MINUTE);

Listing 4 shows some useful macros used in the formats above (they are derived from the format used by the strftime() function).

Generating Correct Charset Headers

When the page is produced, it's important to specify a correct charset, so the browser will do the right thing when rendering the output. There are two techniques to accomplish that:

The preferred method to specify the character set is to use the charset parameter of the 'Content-Type' HTTP header. For example, to specify that an HTML document uses ISO-8859-1, a server would send the following header:

Content-Type: text/html; charset=ISO-8859-1

With mod_perl, you can do that with:

my $r = shift;
$r->send_http_header('text/html; charset=ISO-8859-1');

A less preferable method of setting the character encoding is by using the following tag in the 'HEAD' section of an HTML document:

This method requires that ASCII characters stand for themselves until after the <META> tag and often causes an annoying redraw with Netscape. The META HTTP-EQUIV method should only be used if you cannot set the charset parameter using the server.

Conclusion

We've discussed the following multilingual site development issues:

It's almost always better to develop a dynamic site rather than a static one.

Language selection is done by asking the user about it and/or looking at the Accept-Language header.

Storing user preference is best done via cookies or by making the user log in.

We have seen that generated output is comprised from semistatic template text and dynamic database content.

We have seen how site browsing is different from site searching in terms of multilingual input processing.

We have discussed ways that language-specific data can be stored in the database.

We have seen how multilingual variants of text can coexist in the same template and have the code deal with that.

We have seen how the presentation of dates and times can be adjusted to the user preferences.

Finally, we have learned how to tell client browsers to render the output using a correct language encoding.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!