Towards a Truly Worldwide Web

How XML and Unicode are making it easier to publish multilingual electronic documents.

The Web was originally designed around the ISO 8859-1 character set, which supports
only Western European languages. In the early days, when the development of the Web was
mainly in the US, this was not a problem, but with the growth in the use of the Internet
worldwide, the number of people attempting to distribute non-English content over the Web
has grown substantially. In addition, the ability to provide localized content has become
an important source of competitive advantage for companies competing in the global market
place. The need for more robust standards and protocols to support multilingual publishing
on the Internet has become of prime importance.

The recent introduction of a number of new Web technologies and standards has gone some
way to improving the situation, but this is more than just a character encoding or a font
display problem. The Web is a whole new medium that goes far beyond the possibilities of
traditional publishing. The frontier between document content and application user
interface is increasingly blurred and documents are becoming applications in themselves.
These "dynamic" documents contain a mixture of both document content and
information about the content, or metadata. What is needed is a way to meet the needs of
today's professional Web publishers and those of tomorrow's dynamic document application
architectures.

This is where the Extensible Markup Language (XML) comes in. XML looks set to make
large-scale hypertext document publishing to a worldwide audience a reality at last. At
the same time it will make the life of the multilingual document publisher a whole lot
easier.

Current Problems

HTML, the file format used for Web documents, inherited its reliance on the ISO 8859
character set definitions from the SGML standard on which it is based. ISO 8859 defines a
dozen character sets that are useful for languages that use the Latin, Cyrillic, Arabic,
Greek and Hebrew alphabets. However, these standards have a limited range of application
(eight bits per character, a maximum of 190 characters). Although they are sufficient for
10 or so of the most widely used languages, problems often occur when translating
documents from one character set to another (due to the fact that the same code is used to
represent different characters in different character sets). Furthermore, ISO 8859 is
totally inadequate for representing more complex languages, such as Japanese or Chinese,
which contain many thousands characters.

For publishers dealing with these more "exotic" languages, the only solution,
until recently, was to rely on national language code standards. Andrew S. Tanenbaum once
said, "the nice thing about standards is that there are so many to choose from."
Nowhere is this more true than in the domain of national language code standards. There
are literally hundreds of different codes available, each created over the years to
satisfy constraints and constantly changing technological limitations. For example, there
are over three-dozen codes for the Arabic language alone. This overabundance of standards
significantly complicates life for the international software developer and the
multilingual publisher. But then, you already know that, right?

It was to resolve these problems that the Unicode standard was created. The work of the
Unicode Consortium was subsequently combined with that of the ISO 10646 working group and
version 2.0 of Unicode/ISO 10646 was released in 1997. The Unicode Worldwide Character
Standard is a character coding system designed to support the interchange, processing and
display of the written texts of the diverse languages of the modern world.

Every character in Unicode is coded using two bytes (or 16 bits), which provides over
65,000 separate positions, 38,885 of which have already been defined. This is enough to
represent most of the world's living languages, including single-byte languages such as
Western European, Eastern European, and bi-directional Middle Eastern, as well as
multibyte languages such as Chinese, Japanese and Korean (CJK). And there's plenty of room
left to encode the missing languages as soon as enough of the necessary research is done.
Using Unicode, it is finally possible to display several languages within the same
electronic document, even if they are based on different alphabets, without worrying about
the problem of national language code tables.

Adding Unicode support to existing software applications has proved to be a major
undertaking, often requiring a complete rewrite of low-level code. It is only recently
that Unicode support has begun appearing in some key desktop applications. You can now
find support in Windows NT 4.0, Java, HTML 4.0 and (yes, you guessed it) XML. This at last
opens the way for truly multilingual Web-based applications and should accelerate the
adoption of Unicode for other desktop applications.

XML: New Standard for a new Medium

Officially endorsed as a W3C Recommendation on February 10, 1998, Extensible Markup
Language (XML) version 1.0 is a subset of ISO 8879:1986 - Standard Generalized Markup
Language (SGML), the international standard for defining and using content-based markup of
information. The SGML standard specifies how to define a set of markup codes (or
"tags") to describe the content and structure of particular types of documents.
This tag set, and the hierarchical relationships between each tag, are defined in a
Document Type Definition (DTD). HTML is an example of an SGML DTD that was designed
specifically for the creation of simple Web documents.

XML is essentially a simplified and modernized remake of SGML that removes many of the
more complex and less-used features that made SGML somewhat difficult to implement. Unlike
SGML, XML enables you to distribute documents without the DTD that was used to create
them. This greatly simplifies the publishing procedure and makes it far easier to design
tools that support XML. In addition, designers of the XML standard were aware of the
importance of internationalization issues. Accordingly, they specified Unicode as the
fixed reference character set for XML documents and in doing so went a long way to solving
the character encoding problem. XML also provides more robust hyperlinking features than

The term extensible describes the fact that XML enables you to define an infinite number
of document markup tags, adapted to different types of application. Of course, authors
have been adding all sorts of custom tags, scripts and/or comments to their HTML documents
for ages. This additional information is often referred to as metadata. The XML standard
provides a more flexible encoding method, and represents a long-awaited alternative to the
many incompatible proprietary extensions to HTML currently in use. The clear advantage of
XML, then, is its capacity for handling arbitrary data structures which open the way for a
powerful new breed of intelligent Web-based applications. These data structures can be
used to describe a document, with sections that contain rows, columns, cells and so on
(just like in HTML). They may also be used to describe information to be interpreted by a
piece of software (or to control a piece of machinery), or they may combine the two. XML
provides a way to add this additional, machine-readable information to your documents and
data in a way that is not only standardized, but that separates data from the format used
to display that data. Using a style sheet, you can specify which information should be
displayed to the user, and how it should be formatted. Simply by applying a different
style sheet, you can provide a different presentation of your data, without touching the
content of the document.

The Future of Multilingual Web Publishing

How is XML going to make things easier for you? Well, let's imagine that you have a
collection of documents that describe a particular subject, in a variety of different
languages, and that you want to publish these documents on your Web site. If your
documents are marked up using XML, you can think of this collection as a kind of database,
with each set of XML tags identifying a different "field" of data. The
difference with the real database is that your data fields are organized into separate
documents, rather than rows and columns in a table. Now, with an XML-aware search engine,
you could perform complex searches on your document collection so as to retrieve, for
example, all documents that contained your search text in their abstract, but only if the
document is written in French or Japanese, As a result of your search, you could choose to
generate a new document that contained versions of the original text in each of the
selected languages. You could then choose to hide text in one language, or by applying a
different style sheet, display text in both languages side-by-side, or paragraph by
paragraph. Thanks to http1.1 (the latest version of the hypertext transfer protocol that
is used for sending and receiving information over the Web), it is now possible for a
browser to automatically select the appropriate language version of a document to deliver,
if available, based on a user's preferences. In case a document search is unsuccessful,
the server can send a list of alternative choices. Even if a search is successful, the
server can send a list of related documents to tell the user about the existence of
alternative versions.

One application that is of particular interest to language professionals is OpenTag. The OpenTag format is a markup format based on
XML that can be used to encode text extracted from documents in various formats. Rather
than converting information from "format X" into OpenTag format, OpenTag is
designed so that data can be extracted from "format X," manipulated in an
OpenTag environment, and later merged back into the "format X" file. As
explained in the OpenTag website (www.opentag.org), if your translation memory databases
are stored in OpenTag format, they can become tool and supplier independent. This makes it
possible to share these assets among multiple translation service suppliers, who needn't
be using the same suite of localization tools. A translation customer would have the
enormous benefit of being able to export a document to translate from its native format
(by selecting a "Prepare for Translation..." item from a menu, for example) into
a file that is directly compatible with the localization process and translation tools to
be used. The result of the translation could then easily returned to the customer in the
document's native format, or published directly in HTML, after a straightforward
conversion process.