To help understand why occasionally requests are slow, we’ve written a Django middleware that, without interrupting a request, takes a backtrace of its stack and emails it to us. In the short time it has had on our production servers it has helped diagnose issues with the same clarity as regular 500-style error reports.

We’ve packaged up this slow request watchdog middleware as open source and released it for others to benefit from it as well. Give it a try and let us know how you go!

Bitbucket has seen significant growth over the last year, which combined with an increasing number of features, has had some impact on our availability at times. We have therefore been working hard to understand and fix any runtime problems that occur running the site.

Bitbucket is a fairly standard Django application, configured to automatically email backtraces whenever a request crashes with a 500 error. While you see the kicked over bucket, we get a full backtrace emailed with all the request’s details and variables. This is hugely helpful and often provides all the info needed to identify the problem.

However, there is a class of server-side problems that is a little harder to investigate.

Slow Requests

Occasionally you might get an error page as a result of a server-side timeout. The way our servers are configured is to automatically kill stalled requests that take longer than 30 seconds to respond. We run Gunicorn for which this culling of slow requests is standard behaviour. The Gunicorn worker pool can only handle so many concurrent requests, so letting slow requests just sit can tie up the whole pool and effectively stall the entire site. Our timeout here is 30 seconds (so next time you see an error page after exactly 30 seconds, you most likely hit that timeout).

Getting information about why some requests are slow is difficult when they get killed automatically. Since there is no exception or response, there is no backtrace and therefore no email. All we know is that workers sometimes get killed by Gunicorn, but that’s about it.

Tracking down these silent assassinations was crucial though, as killing a process that is in the middle of making changes to your database and filesystem can have some nasty consequences. Rolled back transactions will protect the internal integrity of the database, but modifications to your repository on the file system cannot be rolled back by the database and so the two could get out of sync.

Gunicorn has no mercy for unresponsive worker processes. It sends them a SIGKILL signal, which makes it impossible to gather any forensic evidence.

SIGKILL

​ cannot be intercepted in the worker, so there is no opportunity to log anything.

Watchdog

Leveraging team member Brodie Rao‘s Tipper project that adds a signal handler to a Python program that dumps a live backtrace on demand, Java-style, we built a Django middleware class that adds a watchdog timer to every request. This watchdog then automatically logs a backtrace if the request does not return on time. Setting the timeout to 25 seconds, just before Gunicorn shoots the worker, we can get a peek at these problematic requests.

The watchdog is implemented as a separate background thread that reads the stack of the request thread without interrupting or otherwise disturbing it. It logs the backtrace together with all stack variables and emails it straight to the team.

It shows that the request thread was blocked on a network read, waiting for a response from one of our Redis servers.

results

After we deployed the watchdog on our production servers, we got our first insight into this so far invisible problem. Within hours patterns arose in these emails, pinpointing weak spots in our code. Slow database queries, expensive file system interaction and networking issues.

The backtrace above was one of a series that identified issues with Redis.

Redis is a fast, in-memory key/value store, somewhat similar to memcached but with persistence. We use it to store pre-computed user and repository newsfeeds. While it’s supposed to be very fast, our watchdog clearly showed it was not responding very quickly, causing requests to time out and get killed.

Closer examination of these emails revealed an interesting pattern. The emails came in on sharp 5 minute intervals:

This peculiar interval turned out to perfectly match Redis’ periodic BGSAVE cycle during which it writes its entire database to disk. At almost 40GB, our in-memory Redis database is significant and the way BGSAVE forks the process, even copy-on-write couldn’t keep us out of swap space.

We have since fixed our Redis configuration. Switched to Append-Only mode, moved it to a box with more memory and no longer do a full file dump every 5 minutes.

Open Source

Even though it was deployed only a week ago, this relatively simple middleware has helped us a lot, giving very detailed forensic data on a class of problems that was largely invisible before. To let others benefit from this too, we have packaged it up as a standalone project and released it under the LGPL. We christened our watchdog for slow requests: Dogslow

To enable it in your Django app, simply add it to your list of installed middlewares in

settings.py​

:

MIDDLEWARE_CLASSES = (
'dogslow.WatchdogMiddleware',
...
)

Dogslow’s default configuration is to watch all requests with a 25 second timeout. If you have a webserver setup similar to Bitbucket that forcibly kills slow requests, make sure you set dogslow’s timeout a little lower than your webserver’s.

Dogslow is completely silent as long as your requests return normally, while slow request backtraces are logged to individual files in the system’s tmp​ directory. You can tell it to log elsewhere by adding the following​ to