DESCRIPTION

When you process content from some website, you will get page one by one and extract some data from this page with regexp, DOM parser or smth else. Sometimes we makes errors in our data extractors and realize this only when all 1_000_000 pages were processed. We should fix our extraction logic and start all process from the beginning. Please STOP! How about cache? Yes, you can cache all responses and second, third and other attempts will be very fast.

LWP::UserAgent::Cached is yet another LWP::UserAgent subclass with cache support. It stores cache in the files on local filesystem and if response already available in the cache returns it instead of making HTTP request. This module was writed because other available alternatives didn't meet my needs:

cache_dir() or cache_dir($dir)

Gets or sets path to the directory where cache will be stored. If not set useragent will behaves as LWP::UserAgent without cache support.

nocache_if() or nocache_if($sub)

Gets or sets reference to subroutine which will be called after receiving each non-cached response. First parameter of this subroutine will be HTTP::Response object. This subroutine should return true if this response should not be cached and false otherwise. If not set all responses will be cached.

recache_if() or recache_if($sub)

Gets or sets reference to subroutine which will be called for each response available in the cache. First parameter of this subroutine will be HTTP::Response object, second - path to file with cache, third - HTTP::Request object. This subroutine should return true if response needs to be recached (new HTTP request will be made) and false otherwise. This $sub will be called only if response already available in the cache. Here you can also modify request for your needs. This will not change name of the file with cache.

on_uncached() or on_uncached($sub)

Gets or sets reference to subroutine which will be called for each non-cached http request, before actually request. First parameter of this subroutine will be HTTP::Request object. Here you can also modify request for your needs. This will not change name of the file with cache.

cachename_spec() or cachename_spec($spec)

Gets or sets hash reference to cache naming specification. In fact cache naming for each request based on request content. Internally it is md5_hex($request->as_string). But what if some of request headers in your program changed dinamically, e.g. User-Agent or Cookie? In such case caching will not work properly for you. We need some way to omit this headers when calculating cache name. This option is what you need. Specification hash should contain header name and header value which will be used (instead of values in request) while calculating cache name.

For example we already have cache where 'User-Agent' value in the headers was 'Mozilla/5.0', but in the current version of the program it will be changed for each request. So we force specified that for cache name calculation 'User-Agent' should be 'Mozilla/5.0'. Cached request had not 'Accept' header, but in the current version it has. So we force specified do not include this header for cache name calculation.

Specification hash may contain two special keys: '_body' and '_headers'. With '_body' key you can specify body content in the request for cache name calculation. For example to not include body content in cache name calculation set '_body' to undef or empty string. With '_headers' key you can specify which headers should be included in $request for cache name calculation. For example you can say to include only 'Host' and 'Referer'. '_headers' value should be array reference:

cachename_spec => {
_body => undef, # omit body
_headers => ['Host'], # include only host with value from request
# It will be smth like:
# md5_hex("METHOD url\r\nHost: host\r\n\r\n")
# method and url will be included in any case
}

Another example. Omit body, include only 'Host' and 'User-Agent' headers, use 'Host' value from request and specified 'User-Agent' value, in addition include referrer with specified value ('Referer' not specified in '_headers', but values from main specification hash has higher priority):

last_cached()

Returns list with pathes to files with cache stored by last noncached response. List may contain more than one element if there was redirect.

last_used_cache()

Returns list with pathes to files with cache used in last response. This includes files just stored (last_cached) and files that may be already exists (cached earlier). List may contain more than one element if there was redirect.

uncache()

Proxy and cache name

Here you can see how changing of proxy for useragent will affect cache name

HTTP proxy

HTTP proxy support works out of the box and causes no problems. Changing of proxy server will not affect cache name

HTTPS proxy

Proper HTTPS proxy support added in LWP since 6.06 and causes no problems. Changing of proxy server will not affect cache name

CONNECT proxy

CONNECT proxy support may be added using LWP::Protocol::connect. The problem is that this module uses LWP's request() for creation of CONNECT tunnel, so this response will be cached. But in fact it shouldn't. To workaround this you need to install nocache_if hook