Common Lisp Async Web Scraping

The set of tools to do web scraping in Common Lisp is pretty complete
and pleasant. In this short tutorial we’ll see how to make http
requests, parse html, extract content and do asynchronous requests.

Our simple task will be to extract the list of links on the CL
Cookbook’s index page and check if they are reachable.

Note: to find out what should be the CSS selector of the element
I’m interested in, I right click on an element in the browser and I
choose “Inspect element”. This opens up the inspector of my browser’s
web dev tool and I can study the page structure.

So the links I want to extract are in a page with an id of value
“content”, and they are in regular list elements (li).

and now to the real work. For every url, we want to request it and
check that its return code is 200. We have to ignore certain
errors. Indeed, a request can timeout, be redirected (we don’t want
that) or return an error code.

To be in real conditions we’ll add a link that times out in our list:

(setf (aref *filtered-urls* 0) "http://lisp.org") ;; too bad indeed

We’ll take the simple approach to ignore errors and return nil in
that case. If all goes well, we return the return code, that should be
200.

As we saw at the beginning, dex:get returns many values, including
the return code. We’ll catch only this one with nth-value (instead
of all of them with multiple-value-bind) and we’ll use
ignore-errors, that returns nil in case of an error. We could also
use handler-case and catch specific error types (see examples in
dexador’s documentation) or (better yet ?) use handler-bind to catch
any condition.

(ignore-errors has the caveat that when there’s an error, we can not
return the element it comes from. We’ll get to our ends though.)

Bingo. It still takes more than 10 seconds because we wait 10 seconds
for one request that times out. But otherwise it proceeds all the http
requests in parallel and so it is much faster.

Shall we get the urls that aren’t reachable, remove them from our list
and measure the execution time in the sync and async cases ?

What we do is: instead of returning only the return code, we check it
is valid and we return the url:

... (if (and status (= 200 status)) it) ...
(defvar *valid-urls* *)

we get a vector of urls with a couple of nils: indeed, I thought I
would have only one unreachable url but I discovered another
one. Hopefully I have pushed a fix before you try this tutorial.

But what are they ? We saw the status codes but not the urls :S We
have a vector with all the urls and another with the valid ones. We’ll
simply treat them as sets and compute their difference. This will show
us the bad ones. We must transform our vectors to lists for that.