Chapter 1 Unicode and Multilingual Computing

Today's global economy demands global computing solutions. Instant communications across continents--and computer platforms--characterize a business world at work 24 hours a day, 7 days a week. The widespread use of the Internet and e-commerce continue to create new international challenges.

More and more, users are demanding a computing environment to suit their own linguistic and cultural needs. They want applications and file formats they can share around the world, interfaces in their own language, and local time and date displays. Essentially, users want to write and speak at the keyboard the way they write and speak in the office.

The Solaris operating environment multilingual framework (including multiple character sets and multiple cultural attributes) uses the standard universal encoding codeset, Unicode (The Unicode Standard, Version 3.0). Unicode is well-suited to applications such as multilingual databases, e-commerce, and government research and reference.

1.1 Multilingual Computing

"Multilingual" computing can mean:

Multilanguage--multiple launches of one locale, one script.

Multiscript--single launch of one locale, multiple scripts.

Multilingual--single launch of multiple locales, multiple scripts.

The movement from multilanguage to multiscript to multilingual implies an increased level of complexity in the underlying operating environment.

1.1.1 Multilanguage Environment

In a multilanguage environment, a locale supports one script and one set of cultural attributes. An application inherits all the language and cultural attributes of the current locale. Document text is written in one script and text manipulated according to the locale language rules. A separate application launch in another locale is required to use different language and cultural attributes.

For example, to write a document in Chinese, a user first sets the Chinese locale before launching the application. To write a Russian document, the Russian locale must be separately set and the application launched again. Chinese and Russian text cannot be mixed in the same document

1.1.2 Multiscript Environment

In a multiscript environment, a locale can support more than one script, but only one locale can be set as current. An application creates a document in different scripts by tagging each separate script run (text in the same script). However, the current locale environment settings apply--for example, text is sorted according to the sorting rules of the current locale.

In the Chinese/Russian example above, rather than create two separate documents, the user creates one multiscript document containing both Chinese and Russian text. The cultural attributes of the active locale still apply--in the Chinese locale, the Chinese sorting rules apply to the mixed-script text.

Note -

In a Unicode locale, tagging script runs is not necessary because all language attributes are inherent in the Unicode codeset.

1.1.3 Multilingual Environment

In a multilingual environment, a locale can support multiple scripts and multiple cultural attributes, giving an application greater control over text manipulation. For example, a document containing text in multiple scripts can sort text according to the sort order of each script rather than the current locale.

In the Chinese/Russian example above, the Chinese locale sorting rules apply to the Chinese text and the Russian sorting rules apply to the Russian text.

The multilingual environment is closest to the ideal of multilingual computing. An application uses locale data from numerous locales, while at the same time allowing easy text manipulation in a variety of scripts. All users can easily work in their own language and be understood by others around the world.

1.2 Software Internationalization

Sun Microsystems defines the following levels at which an application can support a customer's international needs:

Internationalization

Localization

Software internationalization is the process of designing and implementing software to transparently manage different linguistic and cultural conventions without additional modification. The same binary copy of an application should run on any localized version of the Solaris operating environment, without requiring source code changes or recompilation.

Software localization is the process of adding language translation (including text messages, icons, buttons, and so on), cultural data, and components (such as input methods and spell checkers) to a product to meet regional market requirements.

The Solaris operating environment is an example of a product that supports both internationalization and localization. The Solaris operating environment is a single internationalized binary that is localized into various languages (for example, French, Japanese, and Chinese) to support the language and cultural conventions of each language.

Properly designed applications can easily accommodate a localized interface without extensive modification. One suggestion for creating easy-to-localize software is to first internationalize the software and then encapsulate the language- and cultural-specific elements in a locale-specific database. This greatly simplifies the localization process, should a developer choose to localize in the future.

At a minimum, Sun Microsystems strongly encourages developers to internationalize their software. Internationalized applications can run on any localized version of the Solaris operating environment and easily manage the language and cultural preferences.

A locale is the language and cultural data set by the user and dynamically loaded into memory at run time. The locale settings are applied to the operating system and to subsequent application launches.

The Solaris operating environment includes APIs for developers to directly access language and cultural data in the current locale. Applications can run in any locale without prior input of language or cultural data. For example, an application does not need to encode a particular currency symbol. By calling the appropriate system API, the current locale currency symbol is returned.

A localizable interface considers variations in an interface translated into another language. The Solaris operating environment provides messaging APIs and utilities to collect, generate, and process messages.

In an application, codeset independence does not assume a particular codeset. For example, text-handling routines should not define in advance the size of the character codeset.

For more information about designing applications with Unicode, see Chapter 3, Technical Considerations. For more information about the internationalization framework, see the whitepaper Asian Language Support in the Solaris Operating Environment.

1.4 Supporting the Unicode Standard

Unicode (Universal Codeset) is a universal character encoding scheme developed and promoted by the Unicode Consortium, a non-profit organization which includes Sun Microsystems. The Unicode standard encompasses most alphabetic, ideographic, and symbolic characters.

Using one universal codeset enables applications to support text from multiple scripts in the same documents without elaborate tagging. However, applications must treat Unicode as any another codeset--applying codeset independence to Unicode as well.

Unicode locales are called the same way and function the same way as all other locales in the Solaris operating environment. These locales provide the extra benefits that the Unicode codeset brings to the work environment, including the ability to create text in multiple scripts without having to switch locales. Sun Microsystems provides the same level of Unicode locale support for both 32-bit and 64-bit Solaris environments.

1.5 Benefits of Unicode

Support for Unicode provides many benefits to application developers, including:

Global source and binary.

Support for mixed-script computing environments.

Improved cross-platform data interoperability through a common codeset.

Space-efficient encoding scheme for data storage.

Reduced time-to-market for localized products.

Expanded market access.

Developers can use Unicode to create global applications. Users can exchange data more freely using one flat codeset without elaborate code conversions to comprehend characters.

In the Solaris operating environment internationalization framework, Unicode is "just another codeset." By adopting and implementing codeset independence to design, applications can handle different codesets without extensive code rework to support specific languages.