Crawl a website with scrapy

Introduction

In this article, we are going to see how to scrape information from a website, in particular, from all pages with a common URL pattern. We will see how to do that with Scrapy, a very powerful, and yet simple, scraping and web-crawling framework.

For example, you might be interested in scraping information about each article of a blog, and store it information in a database. To achieve such a thing, we will see how to implement a simple spider using Scrapy, which will crawl the blog and store the extracted data into a MongoDB database.

We will consider that you have a working MongoDB server, and that you have installed the pymongo and scrapy python packages, both installable with pip.

Our spider inherits from CrawlSpider, which “provides a convenient mechanism for following links by defining a set of rules”. More info here.

We then define two simple rules:

Follow links pointing to http://isbullsh.it/page/X

Extract information from pages defined by a URL of pattern http://isbullsh.it/YYYY/MM/title, using the callback method parse_blogpost.

Extracting the data

To extract the title, author, etc, from the HTML code, we’ll use the scrapy.selector.HtmlXPathSelector object, which uses the libxml2 HTML parser. If you’re not familiar with this object, you should read the XPathSelectordocumentation.

We’ll now define the extraction logic in the parse_blogpost method (I’ll only define it for the title and tag(s), it’s pretty much always the same logic):

Note: To be sure of the XPath selectors you define, I’d advise you to use Firebug, Firefox Inspect, or equivalent, to inspect the HTML code of a page, and then test the selector in a Scrapy shell. That only works if the data position is coherent throughout all the pages you crawl.

Store the results in MongoDB

Each time that the parse_blogspot method returns an item, we want it to be sent to a pipeline which will validate the data, and store everything in our Mongo collection.

Now that we’ve defined our pipeline, our MongoDB database and collection, we’re just left with the pipeline implementation. We just want to be sure that we do not have any missing data (ex: a blogpost without a title, author, etc).

Here is our pipelines.py file :

importpymongofromscrapy.exceptionsimportDropItemfromscrapy.confimportsettingsfromscrapyimportlogclassMongoDBPipeline(object):def__init__(self):connection=pymongo.Connection(settings['MONGODB_SERVER'],settings['MONGODB_PORT'])db=connection[settings['MONGODB_DB']]self.collection=db[settings['MONGODB_COLLECTION']]defprocess_item(self,item,spider):valid=Truefordatainitem:# here we only check if the data is not null# but we could do any crazy validation we wantifnotdata:valid=FalseraiseDropItem("Missing %s of blogpost from %s"%(data,item['url']))ifvalid:self.collection.insert(dict(item))log.msg("Item wrote to MongoDB database %s/%s"%(settings['MONGODB_DB'],settings['MONGODB_COLLECTION']),level=log.DEBUG,spider=spider)returnitem

Release the spider!

Now, all we have to do is change directory to the root of our project and execute

$ scrapy crawl isbullshit

The spider will then follow all links pointing to a blogpost, retrieve the post title, author name, date, etc, validate the extracted data, and store all that in a MongoDB collection if validation went well.

Pretty neat, hm?

Conclusion

This case is pretty simplistic: all URLs have a similar pattern and all links are hard written in the HTML code: there is no JS involved. In the case were the links you want to reach are generated by JS, you’d probably want to check out Selenium. You could complexify the spider by adding new rules, or more complicated regular expressions, but I just wanted to demo how Scrapy worked, not getting into crazy regex explanations.

Also, be aware that sometimes, there’s a thin line bewteen playing with web-scraping and getting into trouble.

Finally, when toying with web-crawling, keep in mind that you might just flood the server with requests, which can sometimes get you IP-blocked :)