Internationalization, Part 1

Editor's note: Writing software that is truly multilingual is not an easy task. In this excerpt from Chapter 8 of Java Examples in a Nutshell, 3rd Edition, author David Flanagan offers real-world programming examples covering the three steps to internationalization in Java. This week, he covers how to use Unicode character encoding and how to handle local customs. Next week's excerpt will cover the third step: localizing user-visible messages.

Internationalization is the process of making a program flexible
enough to run correctly in any locale. The required corollary to
internationalization is localizationthe process of arranging
for a program to run in a specific locale.

There are several distinct steps to the task of internationalization.
Java (1.1 and later) addresses these steps with several different
mechanisms:

A program must be able to read, write, and manipulate localized text.
Java uses the Unicode character encoding, which by itself is a huge
step toward internationalization. In addition, the
InputStreamReader and
OutputStreamWriter classes convert text from a
locale-specific encoding to Unicode and from Unicode to a
locale-specific encoding, respectively.

A program must conform to local customs when displaying dates and
times, formatting numbers, and sorting strings. Java addresses these
issues with the classes in the java.text package.

A program must display all user-visible text in the local language.
Translating the messages a program displays is always one of the main
tasks in localizing a program. A more important task is writing the
program so that all user-visible text is fetched at runtime, rather
than hardcoded directly into the program. Java facilitates this
process with the ResourceBundle class and its
subclasses in the java.util package.

This chapter discusses all three aspects of internationalization.

A Word About Locales

A locale represents a geographic, political, or
cultural region. In Java, locales are represented by the
java.util.Locale class. A locale is frequently
defined by a language, which is represented by its standard lowercase
two-letter code, such as en (English) or fr (French). Sometimes,
however, language alone is not sufficient to uniquely specify a
locale, and a country is added to the specification. A country is
represented by an uppercase two-letter code. For example, the United
States English locale (en_US) is distinct from the British English
locale (en_GB), and the French spoken in Canada (fr_CA) is different
from the French spoken in France (fr_FR). Occasionally, the scope of
a locale is further narrowed with the addition of a system-dependent
variant string.

The Locale class maintains a static default
locale, which can be set and queried with Locale.setDefault(
) and Locale.getDefault( ).
Locale-sensitive methods in Java typically come in two forms. One
uses the default locale, and the other uses a
Locale object that is explicitly specified as an
argument. A program can create and use any number of nondefault
Locale objects, although it is more common simply
to rely on the default locale, which is inherited from the underlying
default locale on the native platform. Locale-sensitive classes in
Java often provide a method to query the list of locales that they
support.

Finally, note that AWT and Swing GUI components (see Chapter 11) have a locale property, so it is possible for
different components to use different locales. (Most components,
however, are not locale-sensitive; they behave the same in any
locale.)

Unicode

Java uses the Unicode character encoding. (Java 1.3 uses Unicode
Version 2.1. Support for Unicode 3.0 will be included in Java 1.4 or
another future release.) Unicode is a 16-bit character encoding
established by the Unicode Consortium, which describes the standard
as follows (see http://unicode.org ):

The Unicode Standard defines codes for characters used in the major
languages written today. Scripts include the European alphabetic
scripts, Middle Eastern right-to-left scripts, and scripts of Asia.
The Unicode Standard also includes punctuation marks, diacritics,
mathematical symbols, technical symbols, arrows, dingbats, etc. ...
In all, the Unicode Standard provides codes for 49,194 characters
from the world's alphabets, ideograph sets, and
symbol collections.

In the canonical form of Unicode encoding, which is what Java
char and String types use,
every character occupies two bytes. The Unicode characters
\u0020 to \u007E are equivalent
to the ASCII and ISO8859-1 (Latin-1) characters
0x20 through 0x7E. The Unicode
characters \u00A0 to \u00FF are
identical to the ISO8859-1 characters 0xA0 to
0xFF. Thus, there is a trivial mapping between
Latin-1 and Unicode characters. A number of other portions of the
Unicode encoding are based on preexisting standards, such as
ISO8859-5 (Cyrillic) and ISO8859-8 (Hebrew), though the mappings
between these standards and Unicode may not be as trivial as the
Latin-1 mapping.

Note that Unicode support may be limited on many platforms. One of
the difficulties with the use of Unicode is the poor availability of
fonts to display all the Unicode characters. Figure 8-1 shows some of the characters that are
available in the standard fonts that ship with Sun's
Java 1.3 SDK for Linux. (Note that these fonts do not ship with the
Java JRE, so even if they are available on your development platform,
they may not be available on your target platform.) Note the special
box glyph that indicates undefined characters.

Figure 8-1. Some Unicode characters and their encodings

Example 8-1 lists code used to create the displays
of Figure 8-1. Because Unicode characters are
integrated so fundamentally into the Java language, this
UnicodeDisplay program does not perform any
sophisticated internationalization techniques to display Unicode
glyphs. Thus, you'll find that Example 8-1 is more of a Swing GUI example rather than an
internationalization example. If you haven't read
Chapter 11 yet, you may not understand all the code
in this example.

Character Encodings

Text representation has
traditionally been one of the most difficult problems of
internationalization. Java, however, solves this problem quite
elegantly and hides the difficult issues. Java uses Unicode
internally, so it can represent essentially any character in any
commonly used written language. As I noted earlier, the remaining
task is to convert Unicode to and from locale-specific encodings.
Java includes quite a few internal byte-to-char and char-to-byte
converters that handle converting locale-specific character encodings
to Unicode and vice versa. Although the converters themselves are not
public, they are accessible through the
InputStreamReader and
OutputStreamWriter classes, which are character
streams included in the java.io package.

Any program can automatically handle locale-specific encodings simply
by using these character stream classes to do their textual input and
output. Note that the FileReader and
FileWriter classes use these streams to
automatically read and write text files that use the
platform's default encoding.

Example 8-2
shows a simple program that works with character encodings. It
converts a file from one specified encoding to another by converting
from the first encoding to Unicode and then from Unicode to the
second encoding. Note that most of the program is taken up with the
mechanics of parsing argument lists, handling exceptions, and so on.
Only a few lines are required to create the
InputStreamReader and
OutputStreamWriter classes that perform the two
halves of the conversion. Also note that exceptions are handled by
calling LocalizedError.display( ). This method is
not part of the Java API; it is a custom method shown in Example 8-5 at the end of this chapter.

Handling Local Customs

The second problem of internationalization is the task of following
local customs and conventions in areas such as date and time
formatting. The java.text package defines classes
to help with this duty.

The NumberFormat class formats numbers, monetary
amounts, and percentages in a locale-dependent way for display to the
user. This is necessary because different locales have different
conventions for number formatting. For example, in France, a comma is
used as a decimal separator instead of a period, as in many
English-speaking countries. A NumberFormat object
can use the default locale or any locale you specify.
NumberFormat has factory methods for obtaining
instances that are suitable for different purposes, such as
displaying monetary quantities or percentages. In Java 1.4 and later,
the java.util.Currency class can be used with
NumberFormat object so that it can correctly print
an appropriate currency symbol.

The DateFormat
class formats dates and times in a locale-dependent way for display
to the user. Different countries have different conventions. Should
the month or day be displayed first? Should periods or colons
separate fields of the time? What are the names of the months in the
language of the locale? A DateFormat object can
simply use the default locale, or it can use any locale you specify.
The DateFormat class is used in conjunction with
the TimeZone and Calendar
classes of java.util. The
TimeZone object tells the
DateFormat what time zone the date should be
interpreted in, while the Calendar object
specifies how the date itself should be broken down into days, weeks,
months, and years. Almost all locales use the standard
GregorianCalendar.
SimpleDateFormat is a useful subclass of
DateFormat: it allows dates to be formatted to or
parsed from a date format specified with a simple template string.

The
Collator class compares strings in a
locale-dependent way. This is necessary because different languages
alphabetize strings in different ways (and some languages
don't even use alphabets). In traditional Spanish,
for example, the letters "ch" are
treated as a single character that comes between
"c" and
"d" for the purposes of sorting.
When you need to sort strings or search for a string within Unicode
text, you should use a Collator object, either one
created to work with the default locale or one created for a
specified locale.

The
BreakIterator class allows you to locate
character, word, line, and sentence boundaries in a locale-dependent
way. This is useful when you need to recognize such boundaries in
Unicode text, such as when you are implementing a word-wrapping
algorithm.

Example 8-3 shows a class that uses the
NumberFormat and DateFormat
classes to display a hypothetical stock portfolio to the user
following local conventions. The program uses various
NumberFormat and DateFormat
objects to format (using the format( ) method)
different types of numbers and dates. These Format
objects all operate using the default locale but could have been
created with an explicitly specified locale. The program displays
information about a hypothetical stock portfolio, formatting dates
and numbers and monetary values according to the current or the
specified locale. Figure 8-2 shows example output
in different locales. The output was produced by running the program
in the default locale, with the arguments "en
GB" and "ja JP".

Setting the Locale

Example 8-3 contains
code that explicitly sets the locale using the language code and the
country code specified on the command line. If these arguments are
not specified, it uses the default locale for your system. When
experimenting with internationalization, you may want to change the
default locale for the entire platform so you can see what happens.
How you do this is platform-dependent. On Unix platforms, you
typically set the locale by setting the LANG
environment variable. For example, to set the locale for Canadian
French, using a Unix csh-style shell, use this
command:

% setenv LANG fr_CA

Or, to set the locale to English as spoken in Great Britain when
using a Unix sh-style shell, use this command:

$ export LANG=en_GB

To set the locale in Windows, use the Regional Settings control on
the Windows Control Panel.

David Flanagan
is the author of a number of O'Reilly books, including
Java in a Nutshell, Java Examples in a Nutshell, Java Foundation Classes in a
Nutshell, JavaScript: The
Definitive Guide, and JavaScript Pocket Reference.