This section is not part of the specification:
it is simply an explanation of the
way in which the specification was
derived.

Design criteria

The syntax was designed to be

Extensible

New naming schemes may
be added later.

Complete

It is possible to encode
any naming scheme.

Printable

It is possible to express
any URI using 7-bit ASCII characters
so that URIs may if necessary be
passed using pen and ink.

Choices for a universal syntax

For the syntax itself there is little
choice except for the order and punctuation
of the elements, and the acceptable
characters and escaping rules.

The extensibility requirement is
met by allowing an arbitrary (but
registered) string to be used as
a prefix. A prefix is chosen as
left to right parsing is more common
than right to left. The choice
of a colon as separator of the prefix
from the rest of the URI was arbitrary.

The decoding of the rest of the string
is defined as a function of the prefix.
New prefixed are introduced for new
schemes as necessary, in agreement
with the registration authority.
The registration of a new scheme
clearly requires the definition of
the decoding of the URI into a given
name space, and a definition of the
properties and, where applicable,
resolution protocols, for the name
space.

The completeness requirement is easily
met by allowing particularly strange
or plain binary names to be encoded
in base 16 or 64 using the acceptable
characters.

The printability requirement could
have been met by requiring all schemes
to encode characters not part of
a basic set. This led to many discussions
of what the basic set should be.
A difficult case, for example, is
when an ISO latin 1 string appears
in a URL, and within an application
with ISO Latin-1 capability, it can
be handled intact. However, for
transport in general, the non-ASCII
characters need to be escaped.

The solution to this was to specify
a safe set of characters, and a general
escaping scheme which may be used
for encoding "unsafe" characters.
This "safe" set is suitable, for
example, for use in electronic mail.
This is the canonical form of a
URI.

The choice of escape character for
introducing representations of non-allowed
characters also tends to be a matter
of taste. An ANSI standard exists
in the C language, using the back-slash
character "\". The use of this character
on unix command lines, however, can
be a problem as it is interpreted
by many shell programs, and would
have itself to be escaped. It is
also a character which is not available
on certain keyboards. The equals
sign is commonly used in the encoding
of names having attribute=value pairs.
The percent sign was eventually chosen
as a suitable escape character.

There is a conflict between the need
to be able to represent many characters
including spaces within a URI directly,
and the need to be able to use a
URI in environments which have limited
character sets or in which certain
characters are prone to corruption.
This conflict has been resolved by
use of an hexadecimal escaping method
which may be applied to any characters
forbidden in a given context. When
URLs are moved between contexts,
the set of characters escaped may
be enlarged or reduced unambiguously.

The use of white space characters
is risky in URIs to be printed or
sent by electronic mail, and the
use of multiple white space characters
is very risky. This is because of
the frequent introduction of extraneous
white space when lines are wrapped
by systems such as mail, or sheer
necessity of narrow column width,
and because of the inter-conversion
of various forms of white space which
occurs during character code conversion
and the transfer of text between
applications. This is why the canonical
form for URIs has all white spaces
encoded.