New patch (patch10). Details on Rietveld review tracker
(http://codereview.appspot.com/2827).
Another update on the remaining "outstanding issues":
Resolved issues since last time:
> Should unquote accept a bytes/bytearray as well as a str?
No. But see below.
> Lib/email/utils.py:
> Should encode_rfc2231 with charset=None accept strings with non-ASCII
> characters, and just encode them to UTF-8?
Implemented Antoine's fix ("or 'ascii'").
> Should quote accept safe characters outside the
> ASCII range (thereby potentially producing invalid URIs)?
No.
New issues:
unquote_to_bytes doesn't cope well with non-ASCII characters (currently
encodes as UTF-8 - not a lot we can do since this is a str->bytes
operation). However, we can allow it to accept a bytes as input (while
unquote does not), and it preserves the bytes precisely.
Discussion at http://codereview.appspot.com/2827/diff/82/84, line 265.
I have *implemented* that suggestion - so unquote_to_bytes now accepts
either a bytes or str, while unquote accepts only a str. No changes need
to be made unless there is disagreement on that decision.
I also emailed Barry Warsaw about the email/utils.py patch (because we
weren't sure exactly what that code was doing). However, I'm sure that
this patch isn't breaking anything there, because I call unquote with
encoding="latin-1", which is the same behaviour as the current head.
That's all the issues I have left over in this patch.
Attaching patch 10 (for revision 65675).
Commit log for patch 10:
Fix for issue 3300.
urllib.parse.unquote:
Added "encoding" and "errors" optional arguments, allowing the caller
to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded
as ISO-8859-1).
Fixed a bug in which mixed-case hex digits (such as "%aF") weren't
being decoded at all.
urllib.parse.quote:
Added "encoding" and "errors" optional arguments, allowing the
caller to determine the encoding of non-ASCII characters
before being percent-encoded.
Default is "utf-8" (previously characters in range(128, 256)
were encoded as ISO-8859-1, and characters above that as UTF-8).
Characters/bytes above 128 are no longer allowed to be "safe".
Now allows either bytes or strings.
Optimised "Quoter"; now inherits defaultdict.
Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes.
All quote/unquote functions now exported from the module.
Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.
Lib/test/test_urllib.py: Added many new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.
Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.
Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the email
module is dependent upon).