Scaling applications with Python

This year’s PyCon had quite a few presentations on how Python is being used to scale applications to massive volumes. There are many companies using Python for their large traffic web sites. A shortlist:-

Here, I’m going to look at a big Python stack that came up a number of times.

Jinja2 is a template engine, which is similar in many respects to Django templates. Jinja2 does not set out to be the fastest template engine, striving to be easy to use, provide easily configurable syntax, be easy to debug and provide sandboxing to allow running third party templates in a safe environment.

Although it doesn’t set out to be the fastest, in some benchmarks it is more than 10 times faster than Django’s template engine. It’s this balance of features and speed that made it so suited to the high traffic sites.

Bottle is a very fast and simple WSGI web framework. It is pure python, and fits into a single file. Bottle uses an @route function decorator to identify what url to use to invoke a given method, and allows you to return either a generator for streaming straight back to the caller, or a dictionary to be used by a template defined with an @view decorator. That is pretty much all that bottle does. It has it’s own simple template engine, and a single threaded server, which are both fine for development. It also has support for a multitude of template engines and multi threaded servers.

At 3 of the presentations on scaling apps, the presenters said they use paste httpserver to serve their applications. In 2 other presentations, the applications were running on twisted.web. Paste is written by the talented Ian Bicking, and is more than just a multi threaded http server. In the presentations I went to though, they were just using the httpserver component.

This is a highly reliable messaging system based on the emerging AMQP standard. RabbitMQ came up many times in different talks during the conference. So much so that it would appear to be the primary choice for the Python community at the moment. The interesting point for me was that it’s written in Erlang, so the attraction to it isn’t driven by it’s Python street cred. RabbitMQ was used extensively to off load any processing that can be done asynchronously. In the larger volume application, a common pattern was to try to find anything that could possibly be done asynchronously, to allow a response to be returned as quickly as possible. There was an excellent presentation by Jinal Jhaveri on Scaling Python webapps from zero to 50 million users. This was the story of a game developed for Facebook, that went from 0 users to 1 million in it’s first week. Messaging was the key architectural component that enabled them to maintain a good user experience.

The twisted framework has been around for a long time, and is already widely used. I went to a presentation on cooperative multi-tasking, which looked at ways of substantially increasing performance of your code. As well as networking and filesystems as a source of blocking, code is a common area for blocking. Twisted offers support for non-blocking sockets for networking, and constructs for use in your code to prevent blocking.

The primary target in the presentation was for loops. Calling functions on elements in a for loop running in a single process is a source of blocking, and twisted has tools to improve this. You supply an iterator, and twisted then schedules this to execute along with any other tasks, in multiple processes. In Python, starting multiple threads is not a good solution, because of the way python uses the Global Interpreter Lock, but multi-processes are a good solution, and Twisted make multi processing easy.

There were a lot of talks about NoSQL databases, or document stores during the conference. There were presentations covering MongoDB, Redis, Cassandra and Neo4J (Neo4J is particularly targeted at persisting graphs). Of them all, the one that stood out for me was MongoDB. Both due to it’s maturity, and the relatively easy mental transition from a relational database model. The lead developer of the Python driver for MongoDB, Mike Dirolf, presented in an Open Spaces session, which was excellent. MongoDB in written in C++, and stores JSON documents in collections. It supports a JSON query language that has a SQL like feel, and is blisteringly fast. There are production usages of MongoDB, holding more than 600 million documents. SourceForge are moving more and more of their data onto MongoDB, including page caching.

It was great to see so many presentations on using Python for big, complex, popular applications, and the wealth of tools out there to make it possible. It’s clear to see that Python is well and truly Production Ready. In my next blog, I’ll write about using some of these components to replace an existing large Java REST service with something small, simple and pythonic that is functionally equivalent.