In case it isn't clear what the problem is: the bug occurs when an app that uses non-utf8 encoded data (such as a bittorrent tracker, apparently) in a query parameter has one of its urls arrive without a required trailing slash. The middleware that appends the slash is attempting to use request.GET.urlencode() to restore the query string to its original percent-encoded state, but that doesn't work. The request.GET QueryDict was created under an assumption that the data was utf8 encoded, but given that it was not, the resulting unicode data in request.GET is likely full of unicode replacement characters for each invalid set of bytes in the original input. So the redirect that gets sent does have a trailing slash, but does not contain the same query string as the original request.

There's no way to 'undo' the incorrect decoding of what's in request.GET, but the original percent-encoded string is available in request.META['QueryString'], so that is what should be used here, I believe.

In case it's useful, there is a way to get the bytes that were part of the request: set request.encoding to iso-8859-1 or latin1. It's a Python "feature" that the latin1 codec is actually an identity mapping, even for bytes that aren't part of latin1/ISO-8859-1. So you'll then get back a unicode type object whose contents are the original bytes.

Ah, I did not realize that about the latin1 codec in Python. So alternatively the middleware could set request.encoding so that request.GET QueryDict gets (re?)built with no mangling before doing the urlencode(). Seems more straightforward to just go ahead and use request.META['QueryString'] though? Or is there some case where that won't work?

The only advantage of using request.GET that I can see is that it will always be present. In theory, QUERY_STRING should always be present, too, but I notice that in the different handlers (modpython and wsgi), we construct the GET dictionary via different methods, which makes me wonder why. However, if the raw input is always available, then using that will avoid any decoding/encoding round trip changes (ordering, normalisation, whatever may happen). So either way is probably sufficient.