Alphabet Soup: The Internationalization of Linux, Part 1

Mr. Turnbull takes a look at the problems faced when different character sets and the need for standardization.

What is Linux? Since you are reading this
in the Linux Journal, you probably already
know. Still, it is worth emphasizing that Linux is an open-source
software implementation of UNIX. It is created by a process of
distributed development, and a primary application is interaction
via networks with other, independently implemented and administered
systems. In this environment, conformance to public standards is
crucial. Unfortunately, internationalization is a field of
information processing in which current standards and available
methods are hardly satisfactory. The temptation to forfeit
conformance with (international) standards in favor of accurate and
efficient implementation of local standards and customs is often
high.

What is internationalization? It is not simply a matter of
the number of countries where Linux is installed, although that is
certainly indicative of Linux's flexibility. Until recently,
although their native languages varied widely, the bulk of Linux
users have been fluent in certain common not-so-natural languages,
such as C, sh and Perl. Their primary purpose in using Linux has
been as an inexpensive, flexible and reliable platform for software
development and provision of network services. Of course, most also
used Linux for text processing and document dissemination in their
native languages, but this was a relatively minor purpose. Strong
computer skills and hacker orientation made working around the
various problems acceptable.

Today, many new users are coming to Linux seeking a reliable,
flexible platform for activities such as desktop publishing and
content provision on the World Wide Web. Even hackers get tired of
working around software deficiencies, so now a strong demand exists
for software to make text processing in languages other than
English simple and reliable, and permitting text to be formatted
according to each user's native language and customs.

This process of adapting a system to a new culture is called
localization (abbreviated L10N). Obviously, this requires provision
of character encodings, display fonts and input methods for the
input and display of the user's native language, but it also
involves more subtle adjustments to facilities such as the default
time system (12 hour or 24 hour) and calendar (are numerical dates
given MM/DD/YY as in the U.S., or YY/MM/DD as in the international
standard, or DD/MM/YY?), currency representation and dictionary
sorting order. APIs for automatic handling of these issues have
been standardized by POSIX, but many other issues, such as
line-wrapping and hyphenation conventions, remain. Thus,
localization is more than just providing an appropriate script for
display of the language and, in fact, more than just supporting a
language. American and British people both use the same language as
far as computers can tell, but their currency symbols are
different.

Localization is facilitated by true internationalization, but
can also be accomplished by patching or porting any system
ad hoc. To see the difference, consider that a
Chinese person who wishes to deal with Japanese in the Microsoft
Windows environment has two choices: dual booting a Japanized
Windows and a Sinified Windows, or using the rather unsatisfactory
and generally unsupported by applications Unicode environment. This
is a localization; it is non-trivial to port applications from
Japanized Windows to Sinified Windows, as the same binaries cannot
be used. In an internationalized setup, one would simply need to
change fonts, input methods and translate the messages; these would
be implemented as loadable modules (or separate processes). With
respect to applications, the situation in Linux is, at best,
somewhat better (especially from the standpoint of Asian users).
However, the future looks very promising, because many groups are
actively promoting internationalization and developing
internationalized systems for the GNU/Linux environment.

Internationalization (abbreviated I18N) is the process of
adapting a system's data structures and algorithms so that
localizing the system to a new culture is a matter of translating a
database and does not require patching the source. Of course, we
would prefer the binaries to be equally flexible, but for reasons
of efficiency or backward compatibility, localized versions may
implement different data structures and algorithms. Although
internationalization is more difficult than localization, once it
is complete, the process of localizing the internationalized
software to a new environment becomes routine. Furthermore,
localization by its nature is not a strong candidate for
standardization, because each new system to be localized to a
particular environment brings its own new problems.
Internationalization, on the other hand, is by definition a
standard independent of the different cultural environments. An
obvious extension is to jointly standardize those facilities common
to many systems.

Internationalization can be contrasted with
multilingualization. Multilingualization (abbreviated M17N) is the
process of adapting a system to the simultaneous use of several
languages. Obviously more difficult than localization or even
internationalization, multilingualization requires that the system
not only deal with different languages, but also maintain different
contexts for specific parts of the current data set.

Note that the operating system can be localized,
internationalized or multilingualized while some or all
applications are not, and vice versa. In a certain sense, Linux is
a multilingual operating system; the kernel presents few hindrances
to use of different languages. However, most utilities and
applications are limited to English by availability of fonts and
input methods, as well as their own internal structures and message
databases. Even the kernel panics in English. On the other hand,
GNU Emacs 20, both the FSF version and the XEmacs variant,
incorporate the Mule (MUlti-Lingual Extensions Emacs) facilities
(see “Polyglot Emacs” in this issue). With the availability of
fonts and, where necessary, internationalized terminal emulators,
Emacs can simultaneously handle most of the world's languages. Many
GNU utilities use the GNU gettext
function (see “Internationalizing Messages in Linux Programs” in
this issue), which supports a different catalog of program messages
for each language.

Trending Topics

Webinar: 8 Signs You’re Beyond Cron

Scheduling Crontabs With an Enterprise Scheduler
11am CDT, April 29th

Join Linux Journal and Pat Cameron, Director of Automation Technology at HelpSystems, as they discuss the eight primary advantages of moving beyond cron job scheduling. In this webinar, you’ll learn about integrating cron with an enterprise scheduler.