Django management commands and out-of-memory kills

tl;dr: Smart developers run database-heavy operations as management commands or celery tasks. If DEBUG=True, you’ll be recording each database access in memory. Ensure DEBUG=False to avoid running out of memory.

Measure the performance of the existing code against real-world data, to establish the benchmark,

Create a representative data set that will demonstrate the problem but allow for reasonable iteration,

Make changes and measure improvements against the test data, until significant improvements are gained,

Measure against the real-world data again, and report the gains.

I was having a lot of trouble with step 1. First, I had to clear ticket 10, which was preventing me from importing the Southeast Queensland data at all. Next, I started an import on my MacBook Pro (2.4 GHz Intel Core i7 w/ 8 GB RAM) overnight, and woke to find my laptop unusable, requiring a hard reboot. I spent some pennies and moved to the cloud. A t1.micro instance was unable to import the Tulsa feed, killed by the out of memory killer. That seemed odd, but I knew the free tier instances were memory constrained (600 MB RAM), with no swap. I then moved to a m3.medium instance (3.75 GB RAM). The Tulsa feed imported in 50 minutes (vs. 30 minutes on my laptop, which isn’t surprising). However, importing the Southeast Queensland feed also trigger the OOM killer.

How can importing 165MB of text use up almost 4 GB of memory?

My suspicion was a memory leak, or some out-of-control caching. Memory leaks are rare in Python, but possible. Caching is a more likely danger – some helpful code that speeds up a web request could cause an offline task to eat up memory. I remembered a recent Planet Python post on memory leaks, and quickly found mem_top. There’s a few variations that all do the same basic thing, asking Python about current memory usage and identify the top memory users. mem_top is an easy pip install, and easy to call:

from mem_top import mem_top
logging.debug(mem_top())

I added this code in several places, tested it against a small sample feed, and then ran it against a small but real feed. The output looks like this:

The top memory consumer, which grew as the feed was imported, was that list of dictionaries, with timing and SQL queries. There’s not enough here to nail down the issue, but it was enough to trigger my memory of the common advice to never run Django with DEBUG=True in production. They’ve listed it in the FAQ under “Why is Django leaking memory?”, which then links to “How can I see the raw SQL queries Django is running?” When DEBUG=True in your settings, you can access the SQL queries used by the connection:

When you are running Django as a web server, the request_started signal will reset the queries to an empty list, so you’ll get fresh data for the next request. This is exactly what you want for single-machine debugging, and it’s how tools like debug toolbar give you valuable information about your SQL queries. Lots of SQL queries are usually the thing making your requests slow. The solution is usually to add smarter indexes, combine similar queries to reduce total database requests, read from a cache instead of the database, and defer writes to an async task runner like celery. I usually leave these optimizations until I’m pretty sure the business team is happy with the design.

This same query caching logic runs in offline tasks. Loading an image from disk into a BLOB? The base64-encoded version is in memory. Crunching usage data for engagement metrics? Here’s a handful of SQL queries for every instance in the system. And, surprising to me, running a management command also logs all of the SQL queries.

Setting DEBUG=False will disable this behavior. I suspect New Relic acts like DEBUG="Yes and lots of it". For one project, I was running my celery tasks under the New Relic Python agent, but I’m going to stop doing that, and see if some memory issues around stats crunching go away.

I want DEBUG=True for local testing, but I want to be able to run database-intensive management commands without having to remember to turn it off. My solution is to add this snippet to the management command:

This uses the undocumented use_debug_cursor attribute, so it may not work with exotic database backends. It only disables debugging on the default connection, so you’ll have to do more work if you use multiple database connections. Iterating through django.db.connections should do the trick (see the reset_queries code).

Also, it looks like changes are in progress for Django 1.7. The “memory leak” question has been removed from the FAQ, and the queries code has recently been updated to use a limited queue, and to always turn on database query recording when DEBUG=True. Maybe this change will take care of the issue, but I’ll miss the ability to disable this for management commands, since I just discovered it last night. I’ve added ticket #22873, we’ll see if they agree with my patch.

Update: The Django developer doesn’t want to change the behavior back. The attribute use_debug_cursor should probably be named force_debug_cursor, since the intention is to force SQL logging for support of assertNumQueries. I’ve got ideas for supporting my use case in Django 1.7, but the code will get more complex. Oh well.