Introduction

While everyone who programs in PHP has to learn some English eventually to get a handle on its function names and language constructs, PHP can create applications that speak just about any language. Some applications need to be used by speakers of many different languages. Taking an application written for French speakers and making it useful for German speakers is made easier by PHP's support for internationalization and localization.

Internationalization (often abbreviated I18N[1]) is the process of taking an application designed for just one locale and restructuring it so that it can be used in many different locales. Localization (often abbreviated L10N[2]) is the process of adding support for a new locale to an internationalized application.

A locale is a group of settings that describe text formatting and language customs in a particular area of the world. The settings are divided into six categories:

LC_COLLATE

These settings control text sorting: which letters go before and after others in alphabetical order.

LC_CTYPE

These settings control mapping between uppercase and lowercase letters as well as which characters fall into the different character classes, such as alphanumeric characters.

LC_MONETARY

These settings describe the preferred format of currency information, such as what character to use as a decimal point and how to indicate negative amounts.

LC_NUMERIC

These settings describe the preferred format of numeric information, such as how to group numbers and what character is used as a thousands separator.

LC_TIME

These settings describe the preferred format of time and date information, such as names of months and days and whether to use 24- or 12-hour time.

LC_MESSAGES

This category contains text messages used by applications that need to display information in multiple languages.

There is also a metacategory, LC_ALL, that encompasses all the categories.

A locale name generally has three components. The first, an abbreviation that indicates a language, is mandatory. For example, "en" for English or "pt" for Portuguese. Next, after an underscore, comes an optional country specifier, to distinguish between different countries that speak different versions of the same language. For example, "en_US" for U.S. English and "en_GB" for British English, or "pt_BR" for Brazilian Portuguese and "pt_PT" for Portuguese Portuguese. Last, after a period, comes an optional character-set specifier. For example, "zh_TW.Big5" for Taiwanese Chinese using the Big5 character set. While most locale names follow these conventions, some don't. One difficulty in using locales is that they can be arbitrarily named. Finding and setting a locale is discussed in Section 16.2 through Section 16.4.

Different techniques are necessary for correct localization of plain text, dates and times, and currency. Localization can also be applied to external entities your program uses, such as images and included files. Localizing these kinds of content is covered in Section 16.5 through Section 16.9.

Systems for dealing with large amounts of localization data are discussed in Section 16.10 and Section 16.11. Section 16.10 shows some simple ways to manage the data, and Section 16.11 introduces GNU gettext, a full-featured set of tools that provide localization support.

PHP also has limited support for Unicode. Converting data to and from the Unicode UTF-8 encoding is addressed in Section 16.12.

Listing Available Locales

Problem

You want to know what locales your system supports.

Solution

Use the locale program to list available locales; locale -a prints the locales your system supports.

Discussion

On Linux and Solaris systems, you can find locale at /usr/bin/locale. On Windows, locales are listed in the Regional Options section of the Control Panel.

Your mileage varies on other operating systems. BSD, for example, includes locale support but has no locale program to list locales. BSD locales are often stored in /usr/share/locale, so looking in that directory may yield a list of usable locales.

While the locale system helps with many localization tasks, its lack of standardization can be frustrating. Systems aren't guaranteed to have the same locales or even use the same names for equivalent locales.

See Also

Your system's locale(1) manpage.

Using a Particular Locale

Problem

You want to tell PHP to use the settings of a particular locale.

Solution

Call setlocale( ) with the appropriate category and locale. Here's how to use the es_US (U.S. Spanish) locale for all categories:

setlocale(LC_ALL,'es_US');

Here's how to use the de_AT (Austrian German) locale for time and date formatting:

setlocale(LC_TIME,'de_AT');

Discussion

To find the current locale without changing it, call setlocale( ) with a NULL locale:

print setlocale(LC_ALL,NULL);
en_US

Many systems also support a set of aliases for common locales, listed in a file such as /usr/share/locale/locale.alias. This file is a series of lines including:

The first column of each line is an alias; the second column shows the locale and character set the alias points to. You can use the alias in calls to setlocale( ) instead of the corresponding string the alias points to. For example, you can do:

setlocale(LC_ALL,'swedish');

instead of:

setlocale(LC_ALL,'sv_SE.ISO-8859-1');

On Windows, to change the locale, visit the Control Panel. In the Regional Options section, you can pick a new locale and customize its settings.

Setting the Default Locale

Problem

Solution

At the beginning of a file loaded by the auto_prepend_file configuration directive, call setlocale( ) to set your desired locale:

setlocale(LC_ALL,'es_US');

Discussion

Even if you set up appropriate environment variables before you start your web server or PHP binary, PHP doesn't change its locale until you call setlocale( ). After setting environment variable LC_ALL to es_US, for example, PHP still runs in the default C locale.

Localizing Text Messages

Problem

You want to display text messages in a locale-appropriate language.

Solution

Maintain a message catalog of words and phrases and retrieve the appropriate string from the message catalog before printing it. Here's a simple message catalog with some foods in American and British English and a function to retrieve words from the catalog:

To have the program output in American English instead of British English, just set $LANG to en_US.

You can combine the msg( ) message retrieval function with sprintf( ) to store phrases that require values to be substituted into them. For example, consider the English sentence "I am 12 years old." In Spanish, the corresponding phrase is "Tengo 12 años." The Spanish phrase can't be built by stitching together translations of "I am," the numeral 12, and "years old." Instead, store them in the message catalogs as sprintf( )-style format strings:

In the format string, %2$ tells sprintf( ) to use the second argument, and %1$ tells it to use the first.

These phrases can also be stored as a function's return value instead of as a string in an array. Storing the phrases as functions removes the need to use sprintf( ). Functions that return a sentence look like this:

If some parts of the message catalog belong in an array, and some parts belong in functions, an object is a helpful container for a language's message catalog. A base object and two simple message catalogs look like this:

Each message catalog object extends the pc_MC_Base class to get the msg( ) method, and then defines its own messages (in its constructor) and its own functions that return phrases. Here's how to print text in Spanish:

The formatted time string that %c produces, while locale-appropriate, isn't very flexible. If you just want the time, for example, you must pass a different format string to strftime( ). But these format strings themselves vary in different locales. In some locales, displaying an hour from 1 to 12 with an A.M./P.M. designation may be appropriate, while in others the hour should range from 0 to 23. To display appropriate time strings for a locale, add elements to the locale's $messages array for each time format you want. The key for a particular time format, such as %H:%M, is always the same in each locale. The value, however, can vary, such as %H:%M for 24-hour locales or %I:%M %P for 12-hour locales. Then, look up the appropriate format string and pass it to strftime( ):

$MC = new pc_MC_es_US;
print strftime($MC->msg('%H:%M'));

Changing the locale doesn't change the time zone, it changes only the formatting of the displayed result.

The code in pc_format_currency( ) that puts the currency symbol and sign in the correct place is almost identical for positive and negative amounts; it just uses different elements of the array returned by localeconv( ). The relevant elements of localeconv( )'s returned array are shown in Table 16-1.

Table 16-1. Currency-related information from localeconv( )

Array element

Description

currency_symbol

Local currency symbol

mon_decimal_point

Monetary decimal point character

mon_thousands_sep

Monetary thousands separator

positive_sign

Sign for positive values

negative_sign

Sign for negative values

frac_digits

Number of fractional digits

p_cs_precedes

1 if currency_symbol should precede a positive value, 0 if it should follow

p_sep_by_space

1 if a space should separate the currency symbol from a positive value, 0 if not

n_cs_precedes

1 if currency_symbol should precede a negative value, 0 if it should follow

n_sep_by_space

1 if a space should separate currency_symbol from a negative value, 0 if not

p_sign_posn

Positive sign position:0if parenthesis should surround the quantity and currency_symbol1 if the sign string should precede the quantity and currency_symbol2 if the sign string should follow the quantity and currency_symbol3 if the sign string should immediately precede currency_symbol4 if the sign string should immediately follow currency_symbol

n_sign_posn

Negative sign position: same possible values as p_sign_posn

There is a function in the C library called strfmon( ) that does for currency what strftime( ) does for dates and times; however, it isn't implemented in PHP. The pc_format_currency( ) function provides most of the same capabilities.

See Also

Localizing Images

Problem

You want to display images that have text in them and have that text in a locale-appropriate language.

Solution

Make an image directory for each locale you want to support, as well as a global image directory for images that have no locale-specific information in them. Create copies of each locale-specific image in the appropriate locale-specific directory. Make sure that the images have the same filename in the different directories. Instead of printing out image URLs directly, use a wrapper function similar to the msg( ) function in Section 16.5 that prints out locale-specific text.

Discussion

The img( ) wrapper function looks for a locale-specific version of an image first, then a global one. If neither are present, it prints a message to the error log:

This function needs to know both the path to the image file in the filesystem ($image_base_path) and the path to the image from the base URL of your site (/images). It uses the first to test if the file can be read and the second to construct an appropriate URL for the image.

A localized image must have the same filename in each localization directory. For example, an image that says "New!" on a yellow starburst should be called new.gif in both the images/en_US directory and the images/es_US directory, even though the file images/es_US/new.gif is a picture of a yellow starburst with "¡Nuevo!" on it.

Don't forget that the alt text you display in your image tags also needs to be localized. A complete localized <img> tag looks like:

printf('<img src="%s" alt="%s">',img('cancel.png'),msg('Cancel'));

If the localized versions of a particular image have varied dimensions, store image height and width in the message catalog as well:

Discussion

The $base variable holds the name of the base directory for your included localized files. Files that are not locale-specific go in the global subdirectory of $base, and locale-specific files go in a subdirectory named after their locale (e.g., en_US). Prepending the locale-specific directory and then the global directory to the include path makes them the first two places PHP looks when you include a file. Putting the locale-specific directory first ensures that nonlocalized information is loaded only if localized information isn't available.

This technique is similar to what the img( ) function does in the Section 16.8. Here, however, you can take advantage of PHP's include_path feature to have the directory searching happen automatically. For maximum utility, reset include_path as early as possible in your code, preferably at the top of a file loaded via auto_prepend_file on every request.

Managing Localization Resources

Problem

You need to keep track of your various message catalogs and images.

Solution

Two techniques simplify the management of your localization resources. The first is making a new language's object, for example Canadian English, extend from a similar existing language, such as American English. You only have to change the words and phrases in the new object that differ from the original language.

The second technique: to track what phrases still need to be translated in new languages, put stubs in the new language object that have the same value as in your base language. By finding which values are the same in the base language and the new language, you can then generate a list of words and phrases to translate.

Discussion

The catalog-compare.php program shown in Example 16-2 prints out messages that are the same in two catalogs, as well as messages that are missing from one catalog but present in another.

To use this program, put each message catalog object in a file with the same name as the object (e.g., the pc_MC_en_US class should be in a file named pc_MC_en_US.php, and the pc_MC_es_US class should be in a file named pc_MC_es_US.php). You then call the program with the two locale names as arguments on the command line:

% php catalog-compare.php en_US es_US

In a web context, it can be useful to use a different locale and message catalog on a per-request basis. The locale to use may come from the browser (in an Accept-Language header), or it may be explicitly set by the server (different virtual hosts may be set up to display the same content in different languages). If the same code needs to select a message catalog on a per-request basis, the message catalog class can be instantiated like this:

Discussion

gettext is a set of tools that makes it easier for your application to produce multilingual messages. Compiling PHP with the --with-gettext option enables functions to retrieve the appropriate text from gettext-format message catalogs, and there are a number of external tools to edit the message catalogs.

With gettext, messages are divided into domains, and all messages for a particular domain are stored in the same file. bindtextdomain( ) tells gettext where to find the message catalog for a particular domain. A call to:

bindtextdomain('gnumeric','/usr/share/locale')

indicates that the message catalog for the gnumeric domain in the en_CA locale is in the file /usr/share/locale/en_CA/LC_MESSAGES/gnumeric.mo.

The textdomain('gnumeric') function sets the default domain to gnumeric. Calling gettext( ) retrieves a message from the default domain. There are other functions, such as dgettext( ) , that let you retrieve a message from a different domain. When gettext( ) (or dgettext( )) is called, it returns the appropriate message for the current locale. If there's no message in the catalog for the current locale that corresponds to the argument passed to it, gettext( ) (or dgettext( )) returns just its argument. As a result, if you haven't translated all your messages, your code prints out English (or whatever your base language is) for those untranslated messages.

Setting the default domain with textdomain( ) makes each subsequent retrieval of a message from that domain more concise, because you just have to call gettext('Good morning') instead of dgettext('domain','Goodmorning'). However, if even gettext('Good morning') is too much typing, you can take advantage of an undocumented function alias: _( ) for gettext( ). Instead of gettext('Good morning'), use _('Good morning').

The gettext web site has helpful and detailed information for managing the information flow between programmers and translators and how to efficiently use gettext. It also includes information on other tools you can use to manage your message catalogs, such as a special GNU Emacs mode.

Discussion

There are 256 possible ASCII characters. The characters between codes 0 and 127 are standardized: control characters, letters and numbers, and punctuation. There are different rules, however, for the characters that codes 128-255 map to. One encoding is called ISO-8859-1, which includes characters necessary for writing most European languages, such as the ö in Gödel or the ñ in pestaña. Many languages, though, require more than 256 characters, and a character set that can express more than one language requires even more characters. This is where Unicode saves the day; its UTF-8 encoding can represent more than a million characters.

This increased functionality comes at the cost of space. ASCII characters are stored in just one byte; UTF-8 encoded characters need up to four bytes. Table 16-2 shows the byte representations of UTF-8 encoded characters.

Table 16-2. UTF-8 byte representation

Character code range

Bytes used

Byte 1

Byte 2

Byte 3

Byte 4

0x00000000 - 0x0000007F

1

0xxxxxxx

0x00000080 - 0x000007FF

2

110xxxxx

10xxxxxx

0x00000800 - 0x0000FFFF

3

1110xxxx

10xxxxxx

10xxxxxx

0x00010000 - 0x001FFFFF

4

11110xxx

10xxxxxx

10xxxxxx

10xxxxxx

In Table 16-2, the x positions represent bits used for actual character data. The least significant bit is the rightmost bit in the rightmost byte. In multibyte characters, the number of leading 1 bits in the leftmost byte is the same as the number of bytes in the character.