Become a True Content Brand

Meet The New Digital Agency

iAcquire is in the business of content marketing—and, inevitably, so is your brand. We create content that powers your business, and develop strategies that forge the road ahead. Using market research, we ensure your brand’s content aligns with what your users are looking for in search and talking about in social. Welcome to the new way of doing content.

Distributed Web Crawling with Tornado and Gearman

Welcome to Technology Tuesday – where the iAcquire development team will be sharing a little piece of our world with you once a week. Over the past 3 years we’ve been quietly behind the scenes building the technology that makes iAcquire’s services impossible to compete with. Today we’re going to share with you a method…

Welcome to Technology Tuesday – where the iAcquire development team will be sharing a little piece of our world with you once a week. Over the past 3 years we’ve been quietly behind the scenes building the technology that makes iAcquire’s services impossible to compete with. Today we’re going to share with you a method of high performance distributed web crawling using the Tornado asynchronous Python web framework along with Gearman – a high performance distributed task queue.

What Exactly Is A Distributed Web Crawler?

Let’s define the terms so we know what we’re talking about here:

Web Crawler – a computer program that pulls down a set of URLs and processes the pages in an automated fashion. A typical web crawler runs on a single computer and churns through a list of URLs in a linear fashion one by one, or in parallel using multiple simultaneous connections from a single computer.

There Are Many Ways To Build A Web Crawler But None This Simple And Powerful

Sure you can write yourself a Java crawler and deploy it on your own Hadoop cluster or use Amazon Elastic Map Reduce. You could write custom plugins for an existing solution. Nutch comes to mind, its a nice crawler, but it’s not something that you can use to gather live data with. We’ve made heavy use of Nutch and we know what it takes to implement ­solid solutions on top of it – time. Today we’re presenting a simple distributed crawler pattern we’ve affectionately named GearNado. It brings two easy to use, wonderfully designed systems together in harmony. GearNado enables you to build out real-time crawl/parse operations to test your theories out and get you results faster than you’ll ever need.

Demo Time – Introducing TweetHandler And TweetScout

We’ve built a simple proof-of-concept Twitter username crawler on top of GearNado. Lets say you have a list of authoritative URLs and you want to find the Twitter accounts that appear most frequently on the pages within the set. You don’t have time to sit around and wait for a desktop crawler app to crunch through the massive list of pages you’ve got. In comes our distributed real-time crawler – capable of fetching, parsing and analyzing over 50 pages per second on a single node and capable of being distributed to a nearly limitless number of nodes.

Get Yourself A Relevant URL List In Your Target Sector And Get Ready To Rock

There are numerous ways to obtain lists of authoritative URLs in the space you are researching. To get a good sample set for this post, I used a browser toolbar to export the top 400 results for the following Google keyword searches:

python web crawling

tornado web crawling

gearman web crawling

This resulted in a list of over 1,134 URLs after filtering out PDFs and other unwanted URLs (twitter.com, etc). Lets see what it looks like when we process these with 30 TweetScout workers:

OK, I Get It – This Distributed Web Crawler Kicks Ass… How Do I Use It?

First off, you’re going to need a Linux machine to get started. I’m going to provide instructions for getting everything set up using Ubuntu, but just about any modern Linux Distribution will do.

About jeffnappi

4 responses to “Distributed Web Crawling with Tornado and Gearman”

Something quite important I didn’t address in the post – the tests shown here are in fact only running on a single node. In order to run it in a distributed manner one would just launch TweetScout workers on additional nodes with the –jobserver=master-server-ip:4730 parameter. I will be making a follow-up post about this in the coming weeks.

Hi Jon, sorry for the delayed reply. I plan on providing an easy to use Amazon Machine Image with a web based interface to this in the future, but for now yes you only need a single Ubuntu instance to run this. One way to do this is to sign up for Amazon Web Services and start by launching this AMI: https://console.aws.amazon.com/ec2/home?region=us-east-1#launchAmi=ami-a29943cb

Once you have everything set up by following the directions in the post, you would supply a list of URLs in a text file (one on each line) and pass the name of the text file to the TweetHandler.py script with the –url_file= parameter.

Feel free to contact me via e-mail if you have more questions – jeff _a_ iacquire.com