Project description

Have you ever written a web scraper, only to find out after
a long time that there’s some extra data on the pages you
should’ve been scraping all along?

Or a change on a website means your scraper stops working,
and you lose days or weeks of data until you can find the
time to fix it?

This library aims to solve this problem by splitting a Scrapy scraper up into two asynchronous stages:

Download stage - The website is crawled, and the pages to
be scraped are downloaded and saved to disk.

Extract stage - The pages to be scraped are loaded from disk.
The desired data is extracted from the pages and exported (e.g. to
a file or database).

The crawler logic for the download stage should be kept as simple
as possible. It would typically open a known URL and perform very
simple actions such as clicking a “next page” button or submitting
a search query. This reduces the risk of the downloader breaking if
there are minor changes made to the website.

And since all of the raw data is being saved, if you ever decide to
change your extractor logic, you can simply re-run the extractor on
all of the data that has been downloaded.