Beating Google With CouchDB, Celery and Whoosh (Part 1)

Ok, let’s get this out of the way right at the start – the title is a huge overstatement. This series of posts will show you how to create a search engine using standard Python tools like Django, Celery and Whoosh with CouchDB as the backend.

Celery is a message passing library that makes it really easy to run background tasks and to spread them across a number of nodes. The most recent release added the NoSQL database CouchDB as a possible backend. I’m a huge fan of CouchDB, and the idea of running both my database and message passing backend on the same software really appealed to me. Unfortunately the documentation doesn’t make it clear what you need to do to get CouchDB working, and what the downsides are. I decided to write this series partly to explain how Celery and CouchDB work, but also to experiment with using them together.

In this series I’m going to talk about setting up Celery to work with Django, using CouchDB as a backend. I’m also going to show you how to use Celery to create a web-crawler. We’ll then index the crawled pages using Whoosh and use a PageRank-like algorithm to help rank the results. Finally, we’ll attach a simple Django frontend to the search engine for querying it.

Let’s consider what we need to implement for our webcrawler to work, and be a good citizen of the internet. First and foremost is that we must be read and respect robots.txt. This is a file that specifies what areas of a site crawlers are banned from. We must also not hit a site too hard, or too often. It is very easy to write a crawler than repeatedly hits a site, and requests the same document over and over again. These are very big no-noes. Lastly we must make sure that we use a custom User Agent so our bot is identifiable.

We’ll divide the algorithm for our webcrawler into three parts. Firstly we’ll need a set of urls. The crawler picks a url, retrieves the page then store it in the database. The second stage takes the page content, parses it for links, and adds the links to the set of urls to be crawled. The final stage is to index the retrieved text. This is done by watching for pages that are retrieved by the first stage, and adding them to the full text index.

Celery’s allows you to create ‘tasks’. These are units of work that are triggered by a piece of code and then executed, after a period of time, on any node in your system. For the crawler we’ll need two seperate tasks. The first retrieves and stores a given url. When it completes it will triggers a second task, one that parses the links from the page. To begin the process we’ll need to use an external command to feed some initial urls into the system, but after that it will continuously crawl until it runs out of links. A real search engine would want to monitor its index for stale pages and reload those, but I won’t implement that in this example.

I’m going to assume that you have a decent level of knowledge about Python and Django, so you might want to read some tutorials on those first. If you’re following along at home, create yourself a blank Django project with a single app inside. You’ll also need to install django-celery, the CouchDB Python library, and have a working install of CouchDB available.

BEATING Google? Title shows a form of arrogance. To make that claim you need PETABYTES of data. I know the article says otherwise off the bat, but the title is misleading.

But there’s a LOT of functionality woosh is missing that you have to build that comes with solr – stemming, multi-language support, synonyms, boost queries, easy administration, and sooo much more. I don’t mean just adding UTF-8 for multi language support either.

I think it’s great to have something like this in place, but these days if you want real awesome textual search features using large corpuses you’d be silly not to consider solr before building a custom solution.

I think the title is more hyperbole than arrogance! You need a hook to get people reading :-)

You’re right, Whoosh is not as fully featured as Solr and of course even Solr would not be enough to scale up to petabytes of data. The big advantage that Whoosh gives you is that it is very easy to set up. Just a few lines of code and you’ve got a full text search engine. No messing around with daemons or xml configuration files. For a series of short blog posts that was the deciding factor.

I use both Solr and Whoosh at work, depending on the context. I think they both have their place.

Hi Andrew, a quick question on ensuring uniqueness of pages. I have run quick example based on your tutorial and in some cases ended up with multiple documents being the copy of the same page. What would be the consequences to ensure that the Page is unique by storing the md5 of the url as the_id of the couchdb document? Is it a good idea?

I don’t think storing the md5 of the url will be much help. If the url is different then the md5 would be different too. In my code if a url issues a 301/302 redirect then you’ll get one document under the url which redirects and one for the url it redirects to. I think the best solution is to check what url the page was actually loaded from and only store the document under that.

Thanks Andrew, you are absolutely right about redirects. In my case they did not matter to much and I have ended up storing hashlib.sha224(url).hexdigest() as id – that would spare me one view and make look-ups slightly more direct. Many thanks for the great walk-through! Thomas