Problem with urllib2 as it stands: many things would be
nice to implement as a handler rather than by overriding
methods (and inevitably duplicating code and increasing
fragility), but it's not always possible to do so. For
example (all from HTTP), automatically adding Referer
headers, handling 200 responses that should have been
non-2xx errors if the server were sane, handling cookies,
handling HTTP-EQUIV headers as if they were normal
HTTP headers, automatically making responses
seekable, and following Refresh headers. I've done all
these things, but I had to duplicate code to do it,
because I don't see how to do it with handlers. I've now
rewritten this code by adding a 'processor' scheme to
urllib2 (I'm *not* using 'scheme' in the technical URL
sense here!).
Processors work quite similarly to handlers, except that
1. They always *all* get run (rather than just the first to
handle a request or response -- unlike handlers).
2. The methods that get called on processors are of the
form <proto>_request and <proto>_response, and are
called, respectively, immediately before and immediately
after the normal OpenerDirector.open machinery.
http_request, for example, gets called on all processors
before, and pre-processes HTTP requests; http_response
post-processes HTTP responses.
3. <proto>_request methods return request objects, and
<proto>_response methods return response objects.
4. Even 200 responses get processed.
You use it like this:
# just pass processors to build_opener as if they were
handlers
opener = build_opener(FooHandler, BarProcessor,
BazHandler)
response = opener.open("http://www.example.com/")
I've reimplemented all my stuff (the features listed in the
first paragraph, above) in terms of this scheme, and it all
seems to be working fine (but no unit tests yet). So, the
scheme does achieve the extensibility it aims for. The
patch I've attached here doesn't include most of those
features -- the only new functionality it adds is an
HTTPRefererProcessor. If this gets accepted, I intend to
submit patches adding new processors for cookie
handling etc. later.
Two things I'd like to know: 1. will my solution break
people's code 2. is there a better way?
For 1., I *think* it shouldn't break code.
For 2., the obvious problem with my solution (below) is
that handlers are pretty similar to my processors already.
The thing is, I can't see how to reimplement these things
in terms of handlers. First, I need to *see* all requests
(responses) -- I can do that using handlers by giving
them low (high) .handler_order in Python 2.3 and
returning None from http_open (http_error_xxx).
However, 200 responses never get seen by
http_error_xxx, so that doesn't work (and changing that
would break old code). Second, I need to actually
modify the requests and responses. Sometimes I'd much
rather do that by making a new request or response than
modifying the old one in-place (redirections, for
example) -- and in general, even if I *am* just modifying
in-place, I'd still prefer to explictly return the object than
rely on side-effects. Perhaps just adding a couple of
hooks to AbstractHTTPHandler might get these jobs
done, but I think the increased simplicity of
AbstractHTTPHandler.do_open and the various
processors makes my approach worthwhile (assuming it
actually works & is backwards-compat., of course...).
Comments?
A few notes:
Some headers (Content-Length, Referer, ...) mustn't be
copied to requests for a redirected URL. This requires
the addition of a new dict to Request. I've added an
add_unredirected_headers method, too. The old
implementation just sends these headers directly, but
that's not possible if you want to use procesors to
implement these things.
The current response object (httplib.HTTPResponse,
wrapped with urllib.addinfourl) doesn't include response
code or message (because code is always 200). The
patch just assigns .code and .msg attributes (maybe they
should be methods, for uniformity with the rest of the
response interface).
Backwards-compatibility notes:
People who override AbstractHTTPHandler.do_open will
do non-200 response handling there, which will break
processors, but that's a forwards-compat. issue. I don't
think the existence of overridden do_open methods in old
code should be a problem for backwards-compatibility.
Note that, though processors see all responses, the end
user still only gets 200 responses returned.
ErrorProcessor ensures that by passing non-200
responses on to the existing urllib2 error machinery.
John

Logged In: YES
user_id=261020
I just noticed the patch breaks on https. Trivially fixed by
adding lines like https_request = http_request to the various
processor classes.
Also, another use case: gzip Content-encoding.

Logged In: YES
user_id=357491
Sounds kind of like Apache's filters. The idea seems fine, but
perhaps this could instead be worked in with Guido's sio package
in the CVS sandbox? Seems to be a similar idea. Perhaps there
could some way of hooking that code into urllib2?

Logged In: YES
user_id=261020
Possibly similar to Apache filters, but sio's filters seem to be for
filtering stream data rather than request / response objects --
has no concept of headers, for example.

Logged In: YES
user_id=261020
If anybody wants to see some concrete examples of use of this
patch, ask me & I'll mail them to you (actually, they use my
urllib2 extension module rather than the patch, but the
differences are trivial).
BTW Jeremy, any guess about when in your Copious Free
Time you're likely to get to this? I'm wondering whether to just
release my code as-is, or wait for your comments.

Logged In: YES
user_id=261020
Just to note a minor change that should happen if this gets
accepted: processors shouldn't be separate objects, but rather
just a new interface that handler objects can support. That
way, a single object can implement both interfaces. This
makes implementing response cache handlers easier, for
example.
Not uploading a new patch, since it's a trivial code change.
It might also be useful to have default_request and
default_response methods (by analogy with default_open), for
example for any response caching that's independent of
protocol scheme.