Sunday, April 1, 2012

Natural Language Processing (NLP). A marvelous world of possibilities! Fortunately it is also a great example of another domain of application for which Python is wonderfully well equipped.

I have been playing with Python and NLP, for a couple of years now, integrating its tools on a reasonably large project. I hope to demo this project really soon, but it is not the topic of this post.

The topic of this post is to demonstrate the extent of possible performance gains on typical NLP problems, by simply using Pypy. Pypy has come a long way recently, and can now be used a a drop in replacement for CPython in many applications with large performance gains.

I'll start by showing how you can start using pypy for your day-to-day development needs with very little effort. First you need to install some very powerfull python tools: virtualenv and virtualenvWrapper. These can be easily installed with easy_install or pip.

sudo easy_install -U virtualenv virtualenvwrapper

Follow the post-install configuration for virtualenvwrapper. Then download the most recent stable release tarball from pypy page, and extract it somewhere on your system:

Now you have to create you own virtualenv to work with Pypy instead of the standard CPython intallation on your system:

mkvirtualenv -p pypy-1.8/bin/pypy pypyEnv

From this point on, whenever you want to use Pypy all you need to do is type:

workon pypyEnv

anywhere on your system.

Now that we've got all this environment setup out of our way. We can focus on testing NLTK with Pypy, and compare it to CPython. By the way, NLTK can be installed in the same way as virtualenv.

Since Pypy has a very extensive benchmarking system, I decide to keep all my benchmarking code visible so that if the project devs want to take advantage of it to further improve Pypy, they can. The code is on GitHub.

The benchmarks (see Github page) indicate big gains on some operations, and not so big ones on others. On a couple of cases, pypy is slower though I didn't investigate why.

The main purpose of having such a benchmark is to provide some experimental grounds for the improvement of PyPy. Its core developers say that "If it is not faster CPython then it's a bug". But I also wanted to let fellow developers know that NLTK seems to be fully compatible with PyPy (as far as I tested it), and can benefit from performance improvements when run with PyPy.

Naturally, this benchmark can be vastly improved and extented. I count on YOU to help me with this, just add a quick function on the benchmark.py module and send me a pull request.