I’ve had more time to work on Labour (originally posted here, inspired by this and that), the WSGI Server Durability Benchmark. I’m relatively happy with progress so far, and will probably try to ‘market’ Labour a bit shortly. Two things that are still sorely missing from Labour (and are blocking me from posting some news of it to mailing lists et al) are better reporting facilities (read: flashy graphs) and mod_wsgi support.

That said, Labour can already benchmark nine WSGI Servers with a single commandline just seconds after you pull it off the sourcecode repository, and that’s something (do note that you’d have to have the servers installed, Labour can benchmark all installed servers but won’t install them for you). The output is still not pretty, but you can unfold the item below to see sample results from an invocation of Labour. Do note that being a library, this is just one test configuration out of several. So while it’s possible to see numbers-divided-by-second-real-quick for all the benchmark junkies out there, if you want to do anything meaningful, you better read the README file (aww…).

We see that Labour benchmarked nine servers with four different tests. Each test starts the server and hits it with a certain number of HTTP Requests. The requests cause the Labour WSGI Application (contained in the server) to do all sorts of trouble. For example, the Plain test in the table above is comprised of 99% requests that will return normally (200 OK and a body) and 1% of requests that will return 404 NOT FOUND. The SIGSEGV test, on the other hand, is made of 50% OK requests and 50% requests that cause a SIGSEGV signal to be sent to whichever process running out application. The two sleeping tests cause the application to sleep(), eating up server workers, the heavy test more often and for longer times than the lighter variant.

Each cell in the result table contains two numbers – the percentage of ‘failed’ responses accumulated in the test, and the number of responses per second. In a Labout test, a “failed” response is an HTTP response which doesn’t match what’s expected from the Behaviour specified in the HTTP request that triggered it. For example, if we asked the server to respond with 404 NOT FOUND and the server responded 200 OK, that’s considered a failure. Some Behaviours allow quite a broad range of responses (I don’t expect anything from a SIGSEGV Behaviour, even a socket error would be counted as OK), as they are issued to measure their effect on other requests (maybe some WSGI Servers will insulate other requests from a SIGSEGV‘s). More than 10% of failed responses in a test will consider the whole test as failed.

Do note that the data in the table above should not be used to infer anything about the servers listed. First, it was created using only 1,000 requests per test (hey, I’m on vacation and all I got is a crummy lil’ Netbook). Second, so far Labour does nothing to try and tune the various servers with similar parameters. So, for example, maybe some of these servers have multiprocessing support which I didn’t enable, while others forked 16 kids. Third and last, I didn’t think much about the tests I’ve created so far, and am mostly focused with developing Labour itself rather than producing quality comparisons (yet).

That’s it, for now. You’re welcome to try Labour, and by all means I’d be happy to hear comments about this benchmarking methodology, the kind of test profiles you’d like to see constructed (try looking into labour.tester.factories and labour.behaviours, the syntax of test-construction is self-teaching), etc. I won’t wreak havoc on the sources on purpose, but no API is stable until declared otherwise (…).

I just pushed an update to labour which supports forking multiple HTTP clients, alongside with some other small improvements. For the sake of this update I wrote a simple module called multicall which enables a process to distribute a callable among several subprocesses (POSIX only). More elegant solutions exist (‘import multiprocessing’ come to mind), but I think it’s hard to beat multicall’s simplicity, and it doesn’t use Python threads /at all/ (unlike multiprocessing), which lands some performance improvement.

I think the next area which requires work is probably reporting, and I’m not entirely sure what kind of reports would be useful. Labour isn’t a performance oriented benchmark, but a bit more like a cross between a benchmark and a test-suite for WSGI servers. I’m not sure if the output should be a graph (using behaviour set ‘foo’, servers X, Y and Z reach A, B and C hits per second) or a table (server X handles SIGSEGV well, server Y doesn’t, server X doesn’t handle a slow worker well, server Y does).

Also, and this is something that’s been troubling me for a while, I think I know what the results of most benchmarks will be. Let’s start with the fact that all servers that don’t fork out processes will not survive pretty much all the wedging/SIGSEGVing/memory leaking benchmarks. Second, I’m pretty sure most servers will handle somewhat different scenarios (IO/CPU/sleep) in exactly the same way, which is, not notice anything went wrong and have the worker pool reduced by one.

Assuming this is indeed the way the result matrix will look like, and assuming the response of the respective WSGI servers’ programmers’ will be as I expect (“fix the workers”) – then why am I doing this benchmark, anyway?

The main answer which comes to mind is that this is an ‘everyone fails’ exam which opens the door to a future where some will fail and some will succeed (or fail more gracefully). I’m pretty sure this is a future we should be heading to. I never liked the ‘fix the workers’ response, we’re hackers, not mathematicians – failing well is part of the job. That said, I’m not sure Labour will have enough weight in the WSGI/Python Web community to make this change.

If anyone wants to chip in, or even just send moral support, go ahead. I’ll post updates about Labour the next time I revisit this project.

A few days ago I ran into an interesting post by Ian Bicking about a benchmark that Nicholas Piël ran on WSGI servers. Go ahead and read the original posts, but the the skinny is that Nicholas’ (and many others’) focus was about performance, wherein Ian (and I…) feel more attention should be given to the ‘robustness’ of the server, so to speak.

What’s robustness, really? Well, worker-code can misbehave in a lot of ways (and for a lot of reasons), and a robust server will deal gracefully with these misbehaviours. The puritans in the crowd may well argue that the proper solution is to fix the worker (or any of its hierarchy of dependencies). However, all too often purity is a luxury that a web programmer can’t afford (and maybe shouldn’t bother to afford, but maybe we can talk about that some other time).

So I sat down and jotted something called Labour which is a library/tool to ‘benchmark’ WSGI servers not on the merit of how many jiffies they take to parse HTTP headers and what kind of scheduler algorithm they got, but rather on how they’re handling bad workers (socialist pun intended but not necessarily achieved). Labour lets a test author write elegant (ahem) code to declare a certain mixture of bad worker behaviour, and then run it against any one of several WSGI servers.

Let’s look at the simplest example (you can find a few more in Labour’s repository):

At its current early stage, Labour supports these (self explanatory) behaviours: PlainResponse, Sleeping, IOBound, LeakMemory, PythonWedge and SIGSEGV, more can be added with ease. It also supports the following servers (at least to some extent): WSGIRef, Twisted, CherryPy, Cogen, FAPWS3, Gevent and Paste. Significant work should be done in the areas of forking test clients in parallel (the library is near meaningless without this), supporting more WSGI servers and better exploitation of those already supported, richer and better reporting and cleaning up the potential mess after a test.

This long list of caveats said – it is actually working code and it’s currently on github (link above). So either grab it while it’s hot or follow this space to see updates in the (near) future, I’d love to hear what you think about it.