ok, here is what I have found out so far. First, I tested 3 html generation libraries to see if they do any escaping on the arguments passed to href (Text.Html, Text.XHtml, and HSP):<div><br></div><div><div>{-# OPTIONS -F -pgmFtrhsx #-}</div>
<div>module Main where</div><div><br></div><div>import System.IO</div><div>import qualified Text.Html as H</div><div>import qualified Text.XHtml as X</div><div>import HSP</div><div>import HSP.Identity</div><div>import HSP.HTML</div>
<div><br></div><div>main :: IO ()</div><div>main =&nbsp;</div><div>&nbsp;&nbsp;do hSetEncoding stdout utf8</div><div>&nbsp;&nbsp; &nbsp; let nihongo = &quot;日本語&quot;</div><div>&nbsp;&nbsp; &nbsp; putStrLn nihongo</div><div>&nbsp;&nbsp; &nbsp; putStrLn $ H.renderHtml $ H.anchor H.! [H.href nihongo] H.&lt;&lt; (H.toHtml &quot;nihongo&quot;)</div>
<div>&nbsp;&nbsp; &nbsp; putStrLn $ X.renderHtml $ X.anchor X.! [X.href nihongo] X.&lt;&lt; (X.toHtml &quot;nihongo&quot;) &nbsp; &nbsp;&nbsp;</div><div>&nbsp;&nbsp; &nbsp; putStrLn $ renderAsHTML $ evalIdentity $ &lt;a href=nihongo&gt;nihongo&lt;/a&gt;</div><div><br>
</div><div>The output produced was:</div><div><br></div><div><div>*Main Text.Html System.IO&gt; main</div><div>日本語</div><div>&lt;!DOCTYPE HTML PUBLIC &quot;-//W3C//DTD HTML 3.2 FINAL//EN&quot;&gt;</div><div>&lt;!--Rendered using the Haskell Html Library v0.2--&gt;</div>
<div>&lt;HTML</div><div>&gt;&lt;A HREF = &quot;日本語&quot;</div><div>&nbsp;&nbsp;&gt;nihongo&lt;/A</div><div>&nbsp;&nbsp;&gt;&lt;/HTML</div><div>&gt;</div><div><br></div><div>&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot; &quot;<a href="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd</a>&quot;&gt;</div>
<div>&lt;html xmlns=&quot;<a href="http://www.w3.org/1999/xhtml">http://www.w3.org/1999/xhtml</a>&quot;</div><div>&gt;&lt;a href=&quot;&amp;#26085;&amp;#26412;&amp;#35486;&quot;</div><div>&nbsp;&nbsp;&gt;nihongo&lt;/a</div><div>&nbsp;&nbsp;&gt;&lt;/html</div>
<div>&gt;</div><div><br></div><div>&lt;a href=&quot;日本語&quot;</div><div>&gt;nihongo&lt;/a</div><div>&gt;</div><div><br></div><div>So, none of them attempted to convert the String into a valid URL. &nbsp;The XHtml library did make an attempt to encode the string, but that encoding does not really make it a valid URL. (And the other two utf-8 encoded the string, because they utf-8 encoded the whole document -- which is the correct thing to do).</div>
<div><br></div><div>The behavior of these libraries seems correct -- if they attempted to do more url encoding, &nbsp;I think that would just make things worse.</div><div><br></div><div>Next there is the question of what are you supposed to do with non-ASCII characters in a URI? This is describe in section 2.1 of RFC 2396:</div>
<div><br></div><div><a href="http://www.ietf.org/rfc/rfc2396.txt">http://www.ietf.org/rfc/rfc2396.txt</a></div></div><div><br></div><div><div>&nbsp;&nbsp; The relationship between URI and characters has been a source of</div><div>&nbsp;&nbsp; confusion for characters that are not part of US-ASCII. To describe</div>
<div>&nbsp;&nbsp; the relationship, it is useful to distinguish between a &quot;character&quot;</div><div>&nbsp;&nbsp; (as a distinguishable semantic entity) and an &quot;octet&quot; (an 8-bit</div><div>&nbsp;&nbsp; byte). There are two mappings, one from URI characters to octets, and</div>
<div>&nbsp;&nbsp; a second from octets to original characters:</div><div><br></div><div>&nbsp;&nbsp; URI character sequence-&gt;octet sequence-&gt;original character sequence</div><div><br></div><div>&nbsp;&nbsp; A URI is represented as a sequence of characters, not as a sequence</div>
<div>&nbsp;&nbsp; of octets. That is because URI might be &quot;transported&quot; by means that</div><div>&nbsp;&nbsp; are not through a computer network, e.g., printed on paper, read over</div><div>&nbsp;&nbsp; the radio, etc.</div><div><br></div></div>
<div>So a URI is a character sequence (of a restricted set of characters that are found in ASCII). A URI does not have a &#39;binary representation&#39;, because it could be transmitted via non-binary forms, such as a business card, etc. It is the characters that matter. A uri that has been utf-8 encoded and utf-16 encoded is still the same uri because the characters represented by those encodings are the same.</div>
<div><br></div><div>So, there is actually another little piece missing in that sequence when data is transmitted via the computer. Namely, extracting the URI from the raw octets.</div><div><br></div><div>&nbsp;raw octets for uri -&gt; URI character sequence -&gt; octet sequence -&gt; original character sequence</div>
<div><br></div><div>For example, let&#39;s pretend a web page was sent as: Content-type: text/html; charset=utf-32</div><div><br></div><div>The utf-32 octets representing the uri must first be decoded to characters (aka the uri character sequence). That seems outside the scope of URLT .. that stage of decoding should be done before URLT gets the data because it requires looking at HTTP headers, the meta-equiv tag, etc. Next we can convert the uri sequence into a new sequence of octets representing 8-bit encoded data. That is done by converting normal ascii characters to their 8-bit ascii equivalent, and by converting % encoded values to their equivalent 8-bit values. so the character &#39;a&#39; in the URI would be converted to 0x61, and the sequence %35 would be converted to 0x35. Next the binary data is converted to the original character sequence.</div>
<div><br></div><div>There are a few things that make this tricky.&nbsp;</div><div><br></div><div>&nbsp;1. the encoding of the octet sequence in the middle is not specified in the uri. So when you are converting back to the original character sequence you don&#39;t know if octet sequence represents ascii, utf-8, or something else.</div>
<div><br></div><div>&nbsp;2. normalization and reserved characters</div><div><br></div><div>&nbsp;&nbsp;Every character *can* be percent encoded, though your are only supposed to percent encode a limited set. URL normalization dictates that&nbsp;the following three URIs are equivalent:</div>
<div><div><br></div><div>&nbsp;&nbsp; &nbsp; &nbsp;<a href="http://example.com:80/~smith/home.html">http://example.com:80/~smith/home.html</a></div><div>&nbsp;&nbsp; &nbsp; &nbsp;http://EXAMPLE.com/%7Esmith/home.html</div><div>&nbsp;&nbsp; &nbsp; &nbsp;http://EXAMPLE.com:/%7esmith/home.html</div>
<div>&nbsp;</div><div>&nbsp;The %7E and ~ are equal, because ~ is *not* a reserved character. But&nbsp;</div><div><br></div><div>&nbsp;&nbsp; /foo/bar/baz/</div><div>&nbsp;&nbsp; /foo%2Fbar/baz/</div><div><br></div><div>&nbsp;are *not* equal because / is a reserved character.</div>
<div><br></div><div>RFC3986 has this to say about when to encode and decode:</div><div><br></div><div><div>2.4. &nbsp;When to Encode or Decode</div><div><br></div><div>&nbsp;&nbsp; Under normal circumstances, the only time when octets within a URI</div>
<div>&nbsp;&nbsp; are percent-encoded is during the process of producing the URI from</div><div>&nbsp;&nbsp; its component parts. &nbsp;This is when an implementation determines which</div><div>&nbsp;&nbsp; of the reserved characters are to be used as subcomponent delimiters</div>
<div>&nbsp;&nbsp; and which can be safely used as data. &nbsp;Once produced, a URI is always</div><div>&nbsp;&nbsp; in its percent-encoded form.</div><div><br></div><div>&nbsp;&nbsp; When a URI is dereferenced, the components and subcomponents</div><div>&nbsp;&nbsp; significant to the scheme-specific dereferencing process (if any)</div>
<div>&nbsp;&nbsp; must be parsed and separated before the percent-encoded octets within</div><div>&nbsp;&nbsp; those components can be safely decoded, as otherwise the data may be</div><div>&nbsp;&nbsp; mistaken for component delimiters. &nbsp;The only exception is for</div>
<div>&nbsp;&nbsp; percent-encoded octets corresponding to characters in the unreserved</div><div>&nbsp;&nbsp; set, which can be decoded at any time. &nbsp;For example, the octet</div><div>&nbsp;&nbsp; corresponding to the tilde (&quot;~&quot;) character is often encoded as &quot;%7E&quot;</div>
<div>&nbsp;&nbsp; by older URI processing implementations; the &quot;%7E&quot; can be replaced by</div><div>&nbsp;&nbsp; &quot;~&quot; without changing its interpretation.</div><div><br></div><div>&nbsp;&nbsp; Because the percent (&quot;%&quot;) character serves as the indicator for</div>
<div>&nbsp;&nbsp; percent-encoded octets, it must be percent-encoded as &quot;%25&quot; for that</div><div>&nbsp;&nbsp; octet to be used as data within a URI. &nbsp;Implementations must not</div><div>&nbsp;&nbsp; percent-encode or decode the same string more than once, as decoding</div>
<div>&nbsp;&nbsp; an already decoded string might lead to misinterpreting a percent</div><div>&nbsp;&nbsp; data octet as the beginning of a percent-encoding, or vice versa in</div><div>&nbsp;&nbsp; the case of percent-encoding an already percent-encoded string.</div>
<div><br></div></div><div><br></div><div>It also has this to say about encoding Unicode data:</div><div><br></div><div><div>&nbsp;&nbsp; When a new URI scheme defines a component that represents textual</div><div>&nbsp;&nbsp; data consisting of characters from the Universal Character Set [UCS],</div>
<div>&nbsp;&nbsp; the data should first be encoded as octets according to the UTF-8</div><div>&nbsp;&nbsp; character encoding [STD63]; then only those octets that do not</div><div>&nbsp;&nbsp; correspond to characters in the unreserved set should be percent-</div>
<div>&nbsp;&nbsp; encoded. &nbsp;For example, the character A would be represented as &quot;A&quot;,</div><div>&nbsp;&nbsp; the character LATIN CAPITAL LETTER A WITH GRAVE would be represented</div><div>&nbsp;&nbsp; as &quot;%C3%80&quot;, and the character KATAKANA LETTER A would be represented</div>
<div>&nbsp;&nbsp; as &quot;%E3%82%A2&quot;.</div></div><div><br></div><div>I can&#39;t find an official stamp of approval, but I believe the http scheme now specifies that the octets in the middle step are utf-8 encoded.</div></div>
<div><br></div><div>So, here is a starting example of what I think should happen for encoding, and then decoding.</div><div><br></div><div>1. We start with a list of path components [&quot;foo/bar&quot;,&quot;baz&quot;]</div>
<div>2. We then convert the sequence to a String containing the utf-8 encoded octets (a String not a bytestring)</div><div>3. We percent encode everything that is not an unreserved character</div><div>4. We add the delimiters</div>
<div><br></div><div>We now have a proper URI. Note that we have a String and that the URI is made up of the characters in that String. The final step happens when the URI is actually used:</div><div><br></div><div>&nbsp;5. the URI is inserted into an HTML document (etc). The document is this encoded according to whatever encoding the document is supposed to have (could be anything), converting the URI into some encoding.</div>
<div><br></div><div>So a URI is actually encoded twice. We use a similar process to decode the URI. Here is some code that does what I described:</div><div><br></div><div><div>import Codec.Binary.UTF8.String (encodeString, decodeString)</div>
<div>import Network.URI</div><div>import System.FilePath.Posix (joinPath, splitDirectories)</div><div><br></div><div>encodeUrl :: [String] -&gt; String</div><div>encodeUrl paths =&nbsp;</div><div>&nbsp;&nbsp;let step1 = map encodeString paths -- utf-8 encode the data characters in path components (we have not added any delimiters yet)</div>
<div>&nbsp;&nbsp; &nbsp; &nbsp;step2 = map (escapeURIString isUnreserved) step1 -- percent encode the characters</div><div>&nbsp;&nbsp; &nbsp; &nbsp;step3 = joinPath step2 -- add in the delimiters</div><div>&nbsp;&nbsp;in step3</div><div>&nbsp;&nbsp; &nbsp;&nbsp;</div><div>decodeUrl :: String -&gt; [String] &nbsp; &nbsp;&nbsp;</div>
<div>decodeUrl str =</div><div>&nbsp;&nbsp;let step1 = splitDirectories str &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;-- split path on delimiters</div><div>&nbsp;&nbsp; &nbsp; &nbsp;step2 = map unEscapeString step1 -- decode any percent encoded characters</div><div>&nbsp;&nbsp; &nbsp; &nbsp;step3 = map decodeString step2 &nbsp; -- decode octets</div>
<div>&nbsp;&nbsp;in step3</div><div>&nbsp;&nbsp;</div><div>f = encodeString &quot;日本語&quot; &nbsp; &nbsp;&nbsp;</div><div>&nbsp;&nbsp; &nbsp;&nbsp;</div><div>test =&nbsp;</div><div>&nbsp;&nbsp;let p = [&quot;foo/bar&quot;, &quot;日本語&quot;] &nbsp; &nbsp;&nbsp;</div><div>&nbsp;&nbsp; &nbsp; &nbsp;e = encodeUrl p</div><div>&nbsp;&nbsp; &nbsp; &nbsp;d = decodeUrl e</div>
<div>&nbsp;&nbsp;in (d == p, p, e ,d)</div><div><br></div></div><div>The problem with using [String] is that it assumes the only delimiter we care about is &#39;/&#39;. But we might also want to deal with the other delimiters such as : # ?. (For example, if we want to use the urlt system to generate the query string as well as the path..). But [String] does not give us a way to do that. Instead it seems like we would need a type that would allow us to specify the path, the query string, the fragment, etc. namely a real uri type? Perhaps there is something on hackage we can leverage.</div>
<div><br></div><div>I think that having each individual set of toUrl / fromUrl functions deal with the encoding / decoding is not a good way to go. Makes it too easy to get it wrong. Having it all done correctly in one place makes life easier for people adding new instances or methods of generating instances.</div>
<div><br></div><div>I think that urlt dealing with ByteString or [ByteString] is never the right thing. The only time that the URI is a &#39;byte string&#39; is when it is encoded in an html document, or encoded in the http headers. But at the URLT level, we don&#39;t know what encoding that is. Instead we want the bytestring decoded, and we want to receive a &#39;URI character sequence.&#39; Or we want to give a &#39;URI character sequence&#39; to a the html library, and let it worry about the encoding of the document.</div>
<div><br></div><div>At present, I think I am still ok with the fromURL and toURL functions producing and consuming String values. But, what we need is an intermediate URL type like:</div><div><br></div><div>data URL = URL { paths :: [String], queryString :: String :: frag :: String }</div>
<div><br></div><div>and functions that properly do, encodeURL :: URL -&gt; String, decodeURL :: String -&gt; URL.</div><div><br></div><div>The AsURL class would look like:</div><div><br></div><div>class AsURL u where</div>
<div>&nbsp;&nbsp;toURLC :: u -&gt; URL</div><div>&nbsp;&nbsp;fromURLC :: URL -&gt; Failing u</div><div><br></div><div>instance AsURL URL where</div><div>&nbsp;&nbsp;toURLC = id</div><div>&nbsp;&nbsp;fromURLC = Success</div><div><br></div><div>And then toURL / fromURL would be like:</div>
<div><br></div><div>toURL :: (AsURL u) =&gt; u -&gt; String</div><div>toURL = encodeURL . toURLC</div><div><br></div><div>fromURL :: (AsURL u) =&gt; String -&gt; u</div><div>fromURL = fromURLC . decodeURL</div><div><br></div>
<div>The Strings in the URL type would not require any special encoding/decoding. The encoding / decoding would be handled by the encodeURL / decodeURL functions.</div><div><br></div><div>In other words, when the user creates a URL type by hand, they do not have to know anything about url encoding rules, it just happens like magic. That should make it much easier to write AsURL instances by hand.</div>
<div><br></div><div>Does this makes sense to you?</div><div><br></div><div>The key now is seeing if someone has already create a suitable URL type that we can use...</div><div><br></div><div>- jeremy</div><div><br></div><div class="gmail_quote">
On Fri, Mar 19, 2010 at 5:55 PM, Michael Snoyman <span dir="ltr">&lt;<a href="mailto:michael@snoyman.com">michael@snoyman.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div><div></div><div class="h5"><a href="http://www.ietf.org/rfc/rfc2396.txt">http://www.ietf.org/rfc/rfc2396.txt</a><br><br><div class="gmail_quote">On Fri, Mar 19, 2010 at 2:41 PM, Jeremy Shaw <span dir="ltr">&lt;<a href="mailto:jeremy@n-heptane.com" target="_blank">jeremy@n-heptane.com</a>&gt;</span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>On Fri, Mar 19, 2010 at 5:22 PM, Michael Snoyman <span dir="ltr">&lt;<a href="mailto:michael@snoyman.com" target="_blank">michael@snoyman.com</a>&gt;</span> wrote:<br></div><div class="gmail_quote"><div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="gmail_quote"><div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="gmail_quote"><div>I am not going to have time to look at this again until Saturday or Sunday. There are a few minor details that have been swept under the rug that need to be addressed. For example, when exactly does should url encoding / decoding take place. It&#39;s not good if that happens twice or not at all.</div>

<div><br></div><font color="#888888"><div><font color="#000000"><font color="#888888"><br></font></font></div></font></div></blockquote></div><div>Just to confuse the topic even more: if we do real URL encoding/decoding, I believe we would have to assume a certain character set. I had to deal with a site that was encoded in non-UTF8 just a bit ago, and dealing with query parameters is not fun.</div>

<div><br></div><div>That said, perhaps we should consider making the type of PathInfo &quot;PathInfo ByteString&quot; so we make it clear that we&#39;re doing no character encoding.</div></div></blockquote><div><br></div>

</div><div>Yeah. I dunno. I just know it needs to be solved :)&nbsp;</div><div><div>&nbsp;</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="gmail_quote">
<div></div><div>Another issue in the same vein is dealing with leading and trailing slashes, though I think this is fairly simple in practice: the web app knows what to do about the trailing slashes, and each plugin should always pass a leading slash.</div>

<div><br></div><div>(ignoring any escaping &nbsp;that needs to happen in title, and ignoring an AbsPath / PathInfo stuff).</div><div><br></div><div>But we could, of course, do it the other way:</div><div><br></div><div><div><br>

- jeremy</div></font></div>
</blockquote></div><br></div></div><div>Then here&#39;s a proposal for both issues at once:</div><div><br></div><div>* PathInfo is a ByteString</div><div>* handleWai strips the leading slash from the path-info</div><div>
* every component parses and generates URLs without a leading slash. Trailing slash is application&#39;s choice.</div>
<div><br></div><div>Regarding URL encoding, let me point out that the following are two different URLs (just try clicking on them):</div><div class="im"><div><br></div><div><a href="http://www.snoyman.com/blog/entry/persistent-plugs/" target="_blank">http://www.snoyman.com/blog/entry/persistent-plugs/</a></div>

</div><div><a href="http://www.snoyman.com/blog/entry/persistent-plugs/" target="_blank">http://www.snoyman.com/blog/entry%2Fpersistent-plugs/</a></div><div><br></div><div>In other words, if we ever URL-decode the string before it reaches the application, we will have conflated unique URLs. I see two options here:</div>

<div><br></div><div>* We specify that PathInfo contains URL-encoded values. Any fromUrl/toUrl functions must be aware of this fact.</div><div>* We change the type of PathInfo to [ByteString], where we split the PathInfo by slashes, and specify that the pieces of *not* URL-encoded. In order to preserve perfectly the original value, we should not combine adjacent delimiters. In other words:</div>

<div>/foo%2Fbar/baz/ -&gt; [&quot;foo/bar&quot;, &quot;baz&quot;, &quot;&quot;]</div><div>/foo//bar/baz -&gt; [&quot;foo&quot;, &quot;&quot;, &quot;bar&quot;, &quot;baz]</div><div><br></div><div>I&#39;m not strongly attached to any of this. Also, my original motivation for breaking up the pieces (easier pattern matching) will be mitigated by the usage of ByteStrings.</div>