SCRAPY with REDIS – A Distributed Approach

Many of us wanting to scrape content from webspages should go through this highly comprehensive web scraping framework called Scrapy.

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

(This blog is not a tutorial for learning scrapy, you can refer scrapy docs for that.)

It’s a “NoSQL” key-value data store. More precisely, it is a data structure server. The closest analogue is probably to think of Redis as Memcached, but with built-in persistence. Persistence to disk means you can use Redis as a real database instead of just a volatile cache. The data won’t disappear when you restart, like with memcached.

The entire data set, like memcached, is stored in-memory so it is extremely fast (like memcached)… often even faster than memcached. Redis had virtual memory, where rarely used values would be swapped out to disk, so only the keys had to fit into memory, but this has been deprecated. Going forward the use of cases for Redis are those where it’s possible (and desirable) for the entire data set to fit into memory.

Redis is a fantastic choice if you want a highly scalable data store shared by multiple processes, multiple applications, or multiple servers.

How to co-ordinate Scrapy and Redis to obtain a distributed framework ?

To obtain this useful distributed architecture you would need your own servers (machines with Redis supported operating systems) or you could make use of some paid server provider api’s such as Linode, Digital Ocean, etc. With the help of these api’s you would create the number of servers according to the amount of data you need to scrape and time available to scrape that data. We also need a redis-server whose bind address has been changed to 0.0.0.0 to allow it to also communicate with other ip addresses including 127.0.0.0.

Steps to accomplish this task:

We first need a dedicated redis-server, your own physical machine or one created using linode api to store the urls for webpages to scrape data from.

Run a web app that will call the linode api to create servers and deploy your scrapy spiders onto those servers and start the spiders.