Internationalized Resource Identifiers:
From Specification to Testing

Abstract

Uniform Resource Identifiers (URIs) are a core component of the Web.
Internationalized Resource Identifiers (IRIs) are equivalent to URIs except
that they remove the limitation that only a subset of us-ascii can be used.
Conversion between IRIs and URIs is based on the UTF-8 character encoding
followed by %-escaping. This matches well with an increasing number of URI
schemes and components that use UTF-8 as their encoding. This paper discusses
URI internationalization in detail, including motivation, architecture,
specifications, and testing.

1. Introduction

This section discusses the motivation for the internationalization of URIs
and gives some basic introduction to URIs, their properties, and their
components. Uniform Resource Identifiers (URIs) [RFC2396] are one of the
three basic components of the original World Wide Web architecture (the other
two being HTTP and HTML). URIs are the glue of the World Wide Web, they are
used to identify virtually everything of importance from Web pages and
services to email addresses, telnet connections, and telephone calls.

Motivation for Internationalized Resource Identifiers

On average, URIs use a mixture of readable parts and syntax that is
cryptic at least at first glance. For example, this paper will be available
at http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html.
In this example, the reader may be able to correlate several of the
components with the date of the talk, the conference name, the title, and so
on. Some of these correlations may be wrong, or may go unnoticed. But the
experience with URIs over the last few years, as well as with many other
kinds of identifiers, shows that there is a continuing desire for people to
make use of such correlations. In particular, such correlations are useful
for the following purposes [Dür97a]:

Devise identifiers (what is my identifier for X?)

Memorize identifiers (what was the identifier of X?)

Guess identifiers (what could be the identifier of X?)

Understand identifiers (what does X refer to?)

Manipulate identifiers (write, type,...)

Correct identifiers (spelling errors)

Identify with identifiers (nice identifier, isn't it?)

All these operations are much easier if people can use their native
script. This is a very clear motivation for making sure URIs are
appropriately internationalized. In addition, URIs may contain query parts,
where it is important that characters can be sent to the server reliably (see
Section 4).

Properties of URIs

The basic property of URIs is that they are identifiers, i.e. they stand
for something else. The 'something else' is called the resource, and the
process of obtaining the resource is called resolution. URIs have a number of
additional important properties. The properties most important in the context
of this discussion are uniformity and transcribability. For completeness,
this subsection also shortly discusses universality and the distinctions
between URLs and URNs.

Uniformity refers to the fact that certain syntactic conventions are
associated with certain operations for all URIs. As an example, the
characters '#' or '/' always have the same function whenever they appear in
an URI. This does not mean that every URI has a '/', or that all URI schemes
allow '/', but it guarantees that the operations associated with '/'
characters in URIs can be executed uniformly for all URIs. A thourough
discussion of the importance of uniformity for the current and future
operation of the Web can be found in [Gettys].

Uniformity was in some cases used as an argument against URI
internationalization. Using a small and uniform set of characters would allow
any URI to be used by anybody, on almost any type of device. However, many
URIs are predominantly used by people knowing a particular script, and it is
much better to optimize these URIs for these users than to optimize it for
the remaining small minority that is not familliar with the script.

While the above discussion applies to the final form of URIs, uniformity
is definitely very important when looking at character encoding issues. As
Section 2 will show, this unfortunately has not been recognized from the
start.

The second important property of URIs in the context of
internationalization is that they are not only used inside digital systems as
protocol elements, but are also used on paper as well as in people's minds.
These kinds of transcriptions are important for URI internationalization in
various ways. As said above they are one of the main motivations for
internationalizing URIs to the point where a wide range of characters can
directly be used.

URIs are also known as Universal Resource Identifiers. This
refers to the fact that anything of importance can be given an URI, and any
existing system of identifiers can be subsumed by URIs. It does not mean that
URIs are the only such system possible, but it is currently the most visible,
successful, and widely used one.

URIs are often partitioned into URLs (Uniform Resource Locators) and URNs
(Uniform Resource Names). Internationalization of URIs is orthogonal to this
distinction, and so only a very short summary is given here. Depending on the
context of the discussion, this is done in at least three ways. First, in an
abstract sense, there is an attempt to make a distinction between names and
addresses (locations). This works very well for physical entities such as
human beings or books in a library, but gets heavily blurred in the case of a
digital network with numerous indirections and caching mechanisms. Second, in
an intentional sense, URNs are often positioned for more persistent use.
Third, in a syntactical sense, URNs are distinguished as those URIs that
start with the prefix (scheme name) urn:.

URI Components

URI syntax is defined so that various parts of an URI can be clearly
identified if present. First, according to [RFC2396],
what goes into the href attribute of the <a>
element in HTML and similar places is called an URI Reference. This
includes the part after the #, the so-called fragment identifier.
For [RFC2396] and specifications referring to it, only
the part before the # (if present) is actually called URI. In
everyday language, the term URI is often used for everything including the
fragment identifier; this paper follows this practice because
internationalization considerations apply to the fragment identifier without
exception.

In a well-defined context (e.g. in a Web page that has its own URI), it is
possible to use relative URIs, which can be extremely short. [RFC2396] defines exactly how relative URIs can be
converted into absolute URIs. Again, the distinction between relative and
absolute URIs is not relevant for internationalization. Below, we will use
very short examples, which can be understood to be relative URIs.

The first part of an absolute URI, up to the first colon, is the schema.
Well-know schemas include http:, ftp:, and so on.
The schema defines both the syntax (within the general limits of the URI
syntax) and the semantics of the URIs in this schema, including character
encoding.

Overview

The next section discusses character encoding in URIs, from the legacy of
undefined character encoding towards the consistent use of UTF-8. Section 3
introduces IRIs as the internationalized equivalent of URIs. Section 4 deals
with specific aspecs such as query part internationalization, domain name
internationalization, and bidirectionality. Section 5 discusses testing and
future work.

2. From Legacy to Consistency

This section discusses the evolution from legacy URI character handling to
the use of UTF-8 for consistent URI character handling. For completeness,
some other approaches to URI internationalization that have been proposed in
the past are also discussed.

Legacy URI Character Handling

Older specifications for URIs [RFC1630] do not clearly distinguish between
characters and bytes, and to some extend assume the use of iso-8859-1. With
the quick growth of the Web beyond the area covered by iso-8859-1, this
assumption became obsolete.

[RFC2396], the specification currently defining URIs, explains how
characters get encoded into URIs in Section 2.1. A sequence of original
characters (e.g. in a domain name or a file name) is mapped to a sequence of
bytes. This sequence of bytes is then mapped to a sequence of URI characters.
Both mappings can work both ways.

The second mapping, from bytes to URI characters, is well defined. For the
bytes corresponding to a subset of us-ascii, the us-ascii encoding is used.
This subset includes all letters and digits and a small number of symbols
(called unreserved in [RFC2396]: '-', '_', '.', '!', '~', '*', "'", '(',
')'). For all the other bytes, a '%' followed by two hexadecimal digits is
used. This is called %-encoding or %HH-encoding. The escaping also affects
all the syntactically relevant characters such as '/', '#', '%', and so on.
As an example, a simple % has to be escaped to %25 to clearly distinguish a
'payload' % characters from a % used in a %-escape. It is also possible to
escape additional bytes. As an example, an 'a' can always be escaped to %61,
although this is done extremely rarely.

Unfortunately, the first mapping, from original characters to bytes, is
not well defined. For characters in the us-ascii range, the us-ascii encoding
is used, but for other characters, [RFC2396] explicitly leaves the encoding
undefined, and defers it to future specifications. The resulting situation is
depicted in Table 1.

encoding undefined

us-ascii or %HH

original characters

<======>

bytes

<======>

URI characters

March

us-ascii

4D 61 72 63 68

March

März

iso-8859-1

4D E4 72 7A

M%E4rz

März

macintosh

4D 8A 72 7A

M%8Arz

März

utf-8

4D C3 A4 72 7A

M%C3%A4rz

Table 1: Mapping between original characters and URI chararcters, with
examples.

This shows clearly that there is a very strong asymetry between the
characters in the US-ASCII range and other characters. For characters in the
US-ASCII range, the overall mapping is the identity. From protocol designers
to end users, nobody is really aware that there are two mappings; the
identity is taken for granted. For other characters, there is a double
handicap: they get converted to unreadable escapes, and the encoding used
gets lost. When trying to back-convert, one could for example end up with
M‰rz or MÃ¤rz.

Please note that there is no requirement that URIs are constructed from
original characters. It is also possible to directly start with bytes in the
case of digital data. However, the only known URI scheme that allows to
directly encode (binary) data, the data: scheme [RFC2397], uses base64 for easier readability and
shortness.

Character Encoding in URIs based on UTF-8

The above situation can be improved in two steps. The first step consists
in converging on a single encoding. The second step consists in extending the
number of characters allowed in URIs. The first step is described here. The
second step is described in Section 3. Both steps are designed to be
introduced in parallel and to reinforce each other.

Allowing arbitrary encodings for non-ASCII characters in URIs creates
unnecessary confusion. Converging on a single encoding is highly desirable.
UTF-8 is the encoding of choice for the following main reasons:

[RFC2396] leaves details of URI syntax, including the issue of character
encoding, to scheme-specific definitions. Some of these in turn leave the
choice of encoding to the individual creators of URIs. Given this situation,
it was not possible to suddenly declare that UTF-8 should be the only
encoding, used in all URIs. [RFC2718], section 2.2.5, however clearly
recommends using UTF-8 for new URI schemes.

There are the following ways in which an URI scheme can adopt UTF-8:

By using UTF-8 in the protocol that is associated with the scheme.
Using the bytes from the protocol directly in the URI was one of the
motivations for originally not fixing the character encoding for URIs. An
example of a protocol that uses UTF-8 is the 'ftp:' URI associated with
the FTP protocol [RFC2640].

By explicitly converting between another representation used in the
protocol and UTF-8 used in the URI. An example for this is the 'imap:'
URI [RFC2192].

By just declaring UTF-8 to be used in cases where there is no protocol
associated with the URI scheme. An example for this is the 'urn:' scheme
[RFC2141]. For URNs, identifier syntax and
resolution mechanisms are highly independent.

By creating resources using UTF-8 as the character encoding for the URI
resolution. The typical example of this is HTTP. Each server can chose
the encoding for each of its resource names. UTF-8 based resource names
can be created either by using UTF-8 in the underlying server system (in
many cases just a file system), or by doing the appropriate conversions
in the server in order to expose UTF-8 rather than the actual back-end
encoding.

There are also parts of URIs that are independent of URI schemes, in
particular the fragment identifier (the part after the #). Fragment
identifiers are separated from URIs before resolution, and applied to the
resolved resource depending on its MIME type. The syntax of fragment
identifiers is defined by the format used for the resource, e.g. HTML. An
syntax for more flexible fragment identifiers is [XPointer], which also is defined to use UTF-8.

Other proposed Solutions

To address the problems with URI internationalization, other approaches
have also been considered initially. However, they have been discarded years
ago because each of them had severe problems. We are discussing them here
mainly to help understand why the UTF-8 solution was adopted.

Some proposals tried to extend the tagging approach used for MIME (e.g.
email) headers and body parts, i.e. to indicate the encoding used for each
URI or URI component. This was quickly rejected for a large number of
reasons. Adding tags would lengthen the URI considerably. There would be
confusion about whether the tagging indicated the encoding at the current
place, or the encoding to be used when converting from characters to protocol
bytes. Encoding tags would have to be added to URIs on paper, which would be
highly counterintuitive. Also, URI resolvers and other software would have to
know all encodings used.

One other proposal was to use (a variant of) UTF-7 instead of UTF-8. There
would have been some slight advantage in length against the escaped UTF-8
form. However, it would be difficult or impossible to keep original US-ASCII
characters and the characters produced by UTF-7 apart.

A convention that at some time enjoyed some popularity, and is actually
produced or accepted by a few implementations, was to use %uHHHH (where HHHH
is the four-digit hexadecimal number of the code unit in Unicode). The
advantage would be that it is clear that %uHHHH is an encoding of a character
based on Unicode. The problem is that it is new syntax that will easily
confuse older implementations.

3. Internationalized Resource Identifiers

Converging on UTF-8 for the conversion between original characters and
URIs is a very important step ahead, but still requires %HH-escaping. The
obvious goal is to get rid of %HH-escaping whenever possible, and to just
reach the same 'identity conversion' as for us-ascii. For this, the
convergence to UTF-8 is an important prerequisite, because otherwise
conversion to (traditional) URIs is not clearly defined.

The resulting construct has been called Internationalized URI and
Globalized URI, but recently, we have adopted the term
Internationalized Resource Identifier (IRI). The change of
terminology made it quite a bit easier to describe the concepts, because it
was possible to avoid lengthy terms such as non-internationalized
URI. However, dropping the 'U' (uniform or universal) does not at all
mean that these principles have been dropped; IRIs maintain these principles,
and in some sense are actually more uniform and universal. It also does not
mean that IRIs should be limited to very special places. IRIs can and should
replace URIs wherever possible. The use of two clearly distinct terms makes
it easier to describe this replacement in specifications. Whether the general
public will ever adopt the term IRI is a different question.

In principle, the definition of IRI is very easy: It is the same as an
URI, except that wherever %HH is allowed, non-URI characters are also
allowed. As a result, using non-ASCII characters in IRIs becomes as easy and
straightforward as using us-ascii characters in URIs. For convenience, the
resolution of IRIs is defined via a conversion to URIs. However, this does
not mean that an actual conversion to URIs is always needed. Conversion from
IRIs to URIs is straightforward. All the characters not allowed in URIs are
%HH-escaped after a conversion to bytes based on UTF-8. This is shown in
Table 2.

encoding

utf-8 or %HH

us-ascii or %HH

original

<======>

bytes

IRI

URI

March

utf-8

4D 61 72 63 68

March

March

März

utf-8

4D C3 A4 72 7A

März

M%C3%A4rz

März

iso-8859-1

4D E4 72 7A

M%E4rz

M%E4rz

März

macintosh

4D 8A 72 7A

M%8Arz

M%8Arz

Table 2: Original characters, IRIs, and URIs.

Table 2 also shows that IRIs do not exclude URIs based on legacy encodings
(last two rows). However, because these URIs do not use UTF-8, %HH-escaping
has to be used. There are other cases where %HH-escaping has to be used or
can be used; all together, there are the following cases:

To escape syntactix characters when used as 'payload' characters. The
syntax characters of IRIs are exactly the same as those of URIs.

To unambiguously encode code points that cannot be transcribed (e.g.
unassigned codepoints, codepoints not in NFC,...).

To escape bytes resulting from legacy encodings (see examples
above).

To escape characters when the transport medium does not allow to carry
them (e.g. Japanese characters in an iso-8859-1 encoded email). If the
transport medium has its own way of escaping such characters, this may be
preferred. E.g. in a Japanese Web page (encoded in iso-2022-jp), the
above example would be written as
M&auml;rz rather than
M%C3%A4rz because this conserves the
identity of the character a-Umlaut.

To escape any non-syntactic character if desired (same as escaping 'a'
to %61 for URIs). This can be used to get around limitations of devices,
or to make sure anybody can transcribe the URI even if he or she is not
familiar with the script used.

IRI Specification Details

While the basic idea of IRIs is extremely simple, there are some details
that have to be addressed carefully. These include digital transport of IRIs,
normalization issues, and conversion from URIs to IRIs.

Of course IRIs are designed to be transported digitally. One important
detail which may be easy to overlook in this case is that IRIs, in the same
way as URIs, are sequences of characters, not sequences of bytes.
This is obvious when IRIs appear on paper or other physical transport media.
For digital formats and protocols, it means that the character encoding of
IRIs follows the character encoding conventions of the format or protocol in
question. To take a particular example, an email body part or Web page
encoded in iso-8859-1 would use the E4 byte to encode the
character 'ä' in the IRI März, in the same way it uses
this byte for all the other 'ä' characters. Conversion to UTF-8 only occurs
when converting the IRI to an URI or when directly passing the IRI to the
resolution API (assuming that API uses UTF-8 and not e.g. UTF-16).

Unicode allows multiple encoding variants for certain characters. For
example, many characters with diacritics can be represented both in
precomposed and in decomposed form. On paper and similar transport media,
there is no difference. When converting from a form that does not make such
destinctions to a form where these distinctions are relevant, a particular
encoding must be chosen consistently. For many reasons outlined in [NormReq], the form to choose is Normalization Form C
(NFC) [UTF15].

While conversion from IRIs to URIs is quite straghtforward, the reverse
conversion is more difficult. The main problem is that it is not clear
whether some %HH-escaping sequence was the result of encoding some characters
using UTF-8 or whether it was produced otherwise. However, UTF-8 byte
sequences are highly regular and very rarely coincide with byte sequences
from legacy encodings (see [Dür97]). Also, just testing
against UTF-8 byte patterns is not sufficient. Before converting back,
various other conditions have to be checked against. These include
non-allocated code points, non-normalized code sequences, and characters not
suitable in IRIs, such as formating characters and various kinds of spaces
and compatibility characters.

Specifications using IRIs

Over the last years, a number of core Web specifications have adopted what
now is called IRI. The most important ones are:

[HTML4], Section B.2.1, described a convention for
what to do with Non-ASCII characters in URI attributes. This is worded as
an implementation notice to align error behaviour.

[XLink] Section 5.4, defines that the
xlink:href attribute that is used for all XLink links is an
IRI.

[XML Schema] Section 3.2.17, defines the
anyURI datatype as including IRI functionality. The 'any'
prefix covers both the full spectrum of URI Syntax including URI
References as well as non-ASCII characters.

[CharMod], still in draft stage, requires the
use of IRIs in the place of URIs for all future W3C specifications.

As [IRI] is still only a draft, these specifications include explicit
definitions of IRI behavior. The following quote shows how this is done in
[XLink]; other specifications contain very similar
provisions.

Some characters are disallowed in URI references, even if they are
allowed in XML; the disallowed characters include all non-ASCII characters,
plus the excluded characters listed in Section 2.4 of [IETF RFC 2396],
except for the number sign (#) and percent sign (%) and the square bracket
characters re-allowed in [IETF RFC 2732]. Disallowed characters must be
escaped as follows:

Each disallowed character is converted to UTF-8 [IETF RFC 2279] as
one or more bytes.

Any bytes corresponding to a disallowed character are escaped with
the URI escaping mechanism (that is, converted to %HH, where HH is the
hexadecimal notation of the byte value).

The original character is replaced by the resulting character
sequence.

One remaining problem is the treatment of those us-ascii characters that
are not allowed in URIs, i.e. space and various delimiters such as '<',
'>', '{', '}',... The current wording in [IRI] excludes
them, but most of the specifications above actually allow them. It is
expected that [IRI] will be updated to allow them, but
will contain a strong warning against using them. The reasons for allowing
them are that many actual implementations already tolerate them, and that in
some contexts, in particular in XML attributes, many of them do not cause any
problems, while other characters such as '&' actually have to be escaped
(not with URI %HH-escaping, but with XML escaping such as &amp;). The
reason for a strong warning is that IRIs that contain these characters cannot
easily be transferred from one context to another.

Conditions for Using IRIs

There are two groups of conditions for the use of IRIs. The first group
comprises basic operational conditions to deal with non-ASCII characters on
the computer, and includes keyboard and display logic. The second group
contains the two conditions necessary for IRIs to actually work in a
particular context:

The URI corresponding to the IRI has to be based on UTF-8, because an
appropriate scheme is used or the URI is constructed that way (e.g. via
server settings).

The carrier format for the IRI has to allow IRIs in the particular
location (e.g. XML attribute,...).

4. Specific Aspects of URI Internationalization

Forms and Query Parts

HTML Forms are an important part of interactivity and client-server
communication on the Web. The most frequent way of sending information back
to the server after filling in a form is by appending it to the request URI
after a '?', in the so-called query part.

Internationalization of query parts is very important, because form
character data naturally contains non-ascii characters. Unfortunately,
because character encoding of URIs was not originally specified, various
heuristics developed:

Sending the data back in the encoding of the page received. This is
what [HTML4] specifies, and is implemented by most
browsers already for some generations. It works if the page is not
transcoded to another character encoding on the way to the browser (true
nowadays for PCs but not necessarily for mobile phones). A precondition
is that the encoding of the page is correctly detected; to help this, it
is a good idea to make sure that there is some relevant non-ASCII text on
the page.

Using a hidden field to indicate the encoding that was used to send the
page out. This supports the above behaviour if the page is generated
dynamically in different encodings, but does not protect against
transcoding.

Using a hidden field with a specific value. If the page is transcoded,
this value will also be transcoded, and so it is possible to trace
well-known transcodings.

Using a field with a specific name that the browser will use to return
the encoding of the query part. This would be close to ideal, but is not
widely deployed.

Guessing the encoding based on return values, by analysing the returned
bit patterns and the information about the browser sent in the
User-Agent: header field. This is very difficult, but
sometimes the only solution.

Sending out a page in UTF-8 alleviates many problems with Web
internationalisation, including this the problem of query parts. [XForms], the new generation of forms currently under
development at W3C, provides hope for a final solution to all the above
problems.

Domain Names

Many kinds of URIs can contain domain names (for example www.example.com).
Currently, domain names only allow a very restricted set of characters, a
subset of the characters allowed in URIs. The IETF IDN Working Group [IDN] is working on a solution to allow a large number of
characters from Unicode in domain names. The solution that currently has the
most support for use inside the domain name system itself is a mapping of
Unicode characters back to ASCII characters, a so called ACE (Ascii
Compatible Encoding).

Independent of the solution chosen for the domain name system itself,
uniformity of character representation is important for domain name
components in URIs and IRIs, too. [IDN-URI] therefore
proposes to adopt UTF-8 followed by %HH escaping for the domain name parts of
URIs, which means that in IRIs, domain names from all kinds of scripts can be
used naturally. [IDN-URI] also extends the generic URI
syntax of [RFC2396] to allow %HH-escapes in the domain
name part.

Bidirectionality

Scripts such as Arabic and Hebrew are written from right to left. Combined
with letters from other scritps or digits, this leads to the problem of mixed
directionality or bidirectionality. Bidirectionality for URIs and IRIs is a
difficult problem due to three contradicting requirements:

The principle of uniformity for URIs and IRIs requires that components
and characters from right-to-left scripts be stored in in the same order
as those from left-to-right scritps for processing, i.e. in logical
order.

The fact that IRIs are transcribed visually and for this to work, the
visual order of bidirectional components and characters has to be clearly
defined.

The Unicode bidirectional reordering algorithm (converting from logical
ordered backing store to visual display) works well for natural text. But
IRI syntax is different from natural text and therefore need special
precautions.

[Atkin] proposes a reordering algorithm for domain
names that takes a good balance between liminations on character combinations
(e.g. both Latin and Arabic in the same component) and complexity. [IRI-bidi] explains how to use bidi control characters to
bridge the gap between an IRI-specific algorithm and the Unicode algorithm. A
combination of ideas from both proposals may lead to the best solution under
the given boundary conditions.

5. Testing and Future Work

The previous chapters have shown how to move towards a consistent and
user-friendly architecture for internationalized URIs. IRIs allow to use a
wide range of characters directly with the same syntax as URIs. Conversion to
URIs is done by encoding in UTF-8 and then using %HH-escaping as necessary.
This fits together with the adopition of UTF-8 for more and more URI
schemes.

Various experiences with W3C specifications over the last few years has
shown that test suites can be a very efficient way to improve specifications
and their implementation and deployment. We are therefore currently working
on some test for IRIs. A first version is publicly available at http://www.w3.org/2001/08/iri-test.
Tests will include documents conforming to different specifications (HTML,
XML, XML Schema,...) in different encodings. Each test will test the
functionality of a particular IRI in the document. Some tests will be added
to make sure that the basic functionality for the test is available and that
the test is executed correctly. These will include both tests using us-ascii
only as well as basic tests for each of the encodings used. A first version
of the tests only available to W3C members already showed some encouraging
results.

Other future work in particular includes moving the various specifications
currently in draft stage further along the W3C or IETF specification
process.

Acknowledgements

All opinions and errors in this paper are purely those of the author.
There are many people who have to be acknowledged for URI
internatinalization, too many to list them all. The main thanks go to
François Yergeau, who had the idea both for using UTF-8 and for how to
address bidi problems, and to Larry Masinter, for providing both help as well
as creative pushback.

References

RFCs are available at many other locations, among else from
http://www.ietf.org/rfc/rfcNNNN.txt, where NNNN is the RFC number.
Internet-Drafts are work in progress and are frequently updated. They can be
found at http://www.ietf.org/internet-drafts/xxxx, where xxxx is the name of
the draft. Please check whether there is a new version, with a higher
sequence number (e.g. -08.txt in place of -07.txt).

Martin J. Dürst, The
Properties and Promises of UTF-8, Proc. 11th International
Unicode Conference, San Jose, CA, Sept. 1997, also available as
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.