2012/3/11 Jeremy Shaw <jeremy at n-heptane.com>:
> Also, URIs are not defined in terms of octets.. but in terms
> of characters. If you write a URI down on a piece of paper --
> what octets are you using? None.. it's some scribbles on a
> paper. It is the characters that are important, not the bit
> representation.
Well, to quote one example from RFC 3986:
2.1. Percent-Encoding
A percent-encoding mechanism is used to represent a data octet in a
component when that octet's corresponding character is outside the
allowed set or is being used as a delimiter of, or within, the
component.
The syntax of URIs is a mechanism for describing data octets,
not Unicode code points. It is at variance to describe URIs in
terms of Unicode code points.
> If you render a URI in a utf-8 encoded document versus a
> utf-16 encoded document.. the octets will be different, but
> the meaning will be the same. Because it is the characters
> that are important. For a URI Text would be a more compact
> representation than String.. but ByteString is a bit dodgy
> since it is not well defined what those bytes represent.
> (though if you use a newtype wrapper around ByteString to
> declare that it is Ascii, then that would be fine).
This is all fine well and good for what a URI is parsed from
and what it is serialized too; but once parsed, the major
components of a URI are all octets, pure and simple. Like the
"host" part of the authority:
host = IP-literal / IPv4address / reg-name
...
reg-name = *( unreserved / pct-encoded / sub-delims )
The reg-name production is enough to show that, once the host
portion is parsed, it could contain any bytes whatever.
ByteString is the only correct representations for a parsed host
and userinfo, as well as a parsed path, query or fragment.
--
Jason Dusek
pgp /// solidsnack 1FD4C6C1 FED18A2B