Jarcoal

These days it’s hard to imagine developing in Python without virtualenv. Having isolated environments prevents a whole class of subtle (and not so subtle) bugs, and makes it much easier to manage dependencies.

But virtualenv isn’t without it’s pain points. Having to activate and deactivate is a nuisance (virtualenvwrapper helps a bit) that makes jumping into, out of, and between environments harder than it needs to be. Ideally we should be able to pip install django and ./manage.py runserver without thinking about anything else.

The solution is pipenv. pipenv is a replacement for both pip and virtualenv, and is going to let us say goodbye to activate/deactivate forever.

I’m not going to cover it in detail for this post, but in addition to abstracting away virtual environments, pipenv also gives us a replacement for requirements.txt that ensures we get deterministic builds. Learn more about what they look like here.

Installing pipenv

Now that we have pipenv installed, create a directory for your project:

$ mkdir -p ~/projects/pipenv_test/ &&cd ~/projects/pipenv_test/

Using pipenv

Here’s where the magic starts. Let’s install a dependency:

$ pipenv install django

That’s it. Behind the scenes pipenv has created a virtual environment and installed django into it. It has also recorded django as a dependency in Pipfile and Pipfile.lock. Again, we won’t go into those files in depth here, but like requirements.txt they should go into version control.

Continuing with our django example, let’s create a project:

$ pipenv run django-admin.py startproject pipenvtest

So what exactly is happening here? By prefixing our normal django-admin.py command with pipenv run, it will be run within the context of the virtual environment that we are no longer thinking about.

Now we can start the dev server:

$ cd pipenvtest
$ pipenv run ./manage.py runserver

Silky smooth. What’s a virtual environment?

Environment Variables

If you’re doing things by the book then you’re probably storing your various secrets/tokens/etc in environment variables. The problem is when you execute pipenv run ... it’s going to lose those environment variables. Fixing that is as easy as adding a .env file to the root of your project (aka next to Pipfile) and populating it with your vars:

$ echo'DATABASE_PASSWORD=supersecretstuff'>> .env

Now when you pipenv run ... those environment variables will be in context.

One of the cool new features rolled out in Django 1.4 is the prefetch_related() method added to querysets. The purpose for this method is to improve performance for situations like this:

See that inner loop for the product features? That thing is banging against your database with each iteration. If you have 100 products, that means 100 extra queries just to fetch their features. No good.

Calling prefetch_related('feature_set') on the initial products queryset let’s us get all of the features in one query to the database, then Django does some magic behind the scenes “joining” them to their parent product.

That’s cool, now we’re down to just two queries. All is well? Not quite.

To find the limitations with prefetch_related, we need to take a peek at the SQL that it’s generating. If you haven’t figured this out already, you can see the SQL that Django’s ORM is building by printing out the query attribute of any queryset like so:

That said, I highly recommend you look into django-debug-toolbar. It makes analyzing your queries/SQL a lot easier.

The first query Django makes to retrieve the products is the same as any other query you’d run. It’s just a quick select with any filters you might have expressed as where clauses:

Looking good so far. Now that Django has found your products, it’s going to prefetch the features for those products. For the sake of this example, let’s say the previous query returned 5 products, with IDs 1-5. Here is what your next query will look like:

A bit different from the first query. The IN portion of the where clause is the key thing to note here. This tells your database: “find all of the product features associated with products 1-5”.

Sounds good, and in this particular case it is likely a win for performance. But can you imagine what this query will look like if your first query returned 500 products? 5000 products? Not only will it become absolutely giant, requiring tons of data to be sent over the wire to your DB, but it’s also not cache-able by your database.

Most databases perform simple query caching by looking at the incoming statement and checking to see if it has been parsed before. If it has, the execution plan comes out of the cache and your DB saves a bunch of cycles that it would have spent parsing/explaining. This is one reason why bind variables are such a clear win. Not only do they protect your query from injection, they make it very easy on your database to identify previously run queries because rather than the values in your statement changing, they remain consistent thanks to statically named variables.

Besides the IN statement bloating up in size, it also changes constantly depending on the result set of the previous query (in this case the products query), so your database can’t get a good cache of the feature statements you’re sending it.

The real take-away here is that prefetch_related() works very well for queries that you expect to return a small amount of results, but as your result-set grows, you will see that performance benefit turn into a loss quickly.

I have a user profile model for every project I build in Django, so when django-debug-toolbar told me that the auth app was sloppily selecting just my User object, then making another trip back to the database when I called get_profile(), I knew this would not stand. Here’s a quick little patch to select your user and profile data in one query.

Somewhere in your project (utils or misc file would be good), add this snippet:

Then in your settings:

After adding this code, make sure to logout of your app and back in (Django caches your backend after logging in, so it will keep going back to the original ModelBackend if you don’t). You should see one less query in your debug tools.

If you frequently access a model even deeper than your profile, you can easily change up the 'profile' to something like 'profile__company' and fetch them all in a single query.

So you found the wonderful django-storages and now you’re using S3 to store your static and media files. Perfect. All those 9’s should keep your content safe from datacenter drama (hopefully).

But what about fat fingers or malicious users? One bad DELETE request issued to S3 is all it takes to wipe out your data. Your static files can easily be pulled back out of version control, but your media files are gone.

Enter S3 versioning. Versioning is a great feature of S3 that gets very little press. The idea is simple: when enabled, S3 will archive and version any file that gets uploaded to your bucket and the newest version of a file gets served when a request comes in. In addition, if you issue a DELETE request against that same file, it will insert a marker at the top of the version stack indicating the file is deleted, so it knows that the next time it’s requested, a 404 should be served. The end result is that your files can never be deleted with a simple DELETE request.

So how do you enable versioning on a bucket? Simple: issue an authorized PUT request against the bucket like so (note ?versioning query string):

Now we could probably just leave it at that. After all, your files are now safe from an accidental or rogue DELETE. But your static files that are already in version control and can easily be restored are now being re-versioned by S3 with every collectstatic you issue. This is obviously redundant and costly, so let’s fix that.

If you’re anything like me, somewhere in your settings.py you have something like this:

This is great, but unfortunately it binds both your media and static to the same bucket. This isn’t going to work if we want to keep our static files off of the versioned bucket we created for our media. Let’s separate that out into two buckets: one versioned bucket for our media, and one regular bucket for our static. Somewhere in your project, create a s3config.py that looks like this:

Now back in your settings.py, let’s separate out those storage backend configs:

We’re looking good now. Our media files will get uploaded to the versioned bucket, and our replaceable static files are uploaded to a regular bucket.

That just about wraps things up. There are a couple of issues I didn’t tackle in this post, such as how to actually recover from an old version if you do accidentally delete your files. There are good docs on AWS that explain this process.