Initial Data in Django

November 10, 2015

I've struggled to find an ideal way to load initial data for Django projects. By “initial data,” I'm referring to the kind of data that you need on a new system for it to be functional, but could change later. These are largely lists of possible choices, such as time zones, countries, or crayon colors.

Here are my requirements:

Fairly simple to run on initial server deploy, initial development environment setup, and when starting a test run.

Does not risk overwriting changes that are made to records in the live database after they're initially created.

Not too hard to update from the current live data, so that future new deploys etc get the latest data

Copes well as models evolve, because they will

Well supported by Django

Not too great a performance impact on testing

Here are some of the approaches that I've tried.

Fixtures

Fixtures are how Django used to recommend loading initial data.

Pros:

It's fairly easy to update the fixtures as the "initial" data evolves - e.g. you've added more options in your live server, and want to preserve them in the initial data, so just do another dumpdata.

Fixtures don't slow down test startup if they're not named "initial.XXXX" or you're using a recent version of Django, because they don't get loaded automatically.

Easy enough to load at the beginning of tests that need them by adding a fixtures attribute to the test case class.

Cons:

fatal - If a fixture is loaded again, it overwrites any changed data in the database with the original values

Discouraged by current Django documentation

Hard to keep valid when models evolve. The right way would be every time a model changes, you update the fixtures from the current data, then create a fresh temporary database without applying the new migrations, load the current fixtures, apply the new migrations, and make a fresh dump of the initial data. But that’s a lot of work, and hard to remember to do every time models change.

Data is not automatically available during tests, and since our system won't run correctly without some of this data, you have to arrange to load or create it at test setup.

Not loaded automatically so:

When setting up new development environments, you must document it and it’s still easily overlooked, or else get a developer to run some script that includes it

For automated deploys, not safe to run on every deploy. Probably the only safe approach is to run manually after the first deploy.

Summary: rejected due to risk of data loss, inconvenience during development, and negative recommendation from Django documentation.

Fixture hack

I played around with a modified loaddata command that checked (using natural keys) if a record in the fixture was already in the database and did not overwrite any data if the record had previously been loaded.

This means it's safer to add to scripts and automatic deploys.

Pros:

Fairly easy to update as "initial" data evolves - e.g. you've added more options in your live server, and want to preserve them in the initial data, so just do another dumpdata

Fixtures don't slow down test startup if they're not named "initial.XXXX" or you're using a recent version of Django, because they don't get loaded automatically

Easy enough to load at the beginning of tests that need them by adding a fixtures attribute to the test case class.

Can add to env setup scripts and automated deploys safely

Cons:

Hard to keep valid when models evolve

Data is not automatically available during tests

Not loaded automatically, so when setting up new development environments, you must document it and it’s still easily overlooked, or else get a developer to run some script that includes it

Summary: rejected; it mitigates one problem with fixtures, but all the others remain.

Post-migrate signal

Something else I experimented with was running code to create the new records in a post-migrate signal, even though the docs warn against data modification in that signal.

Pros:

Runs automatically each time migrations are run, so will automatically get run during most automated deploys

Runs automatically when tests are setting up the test database, so all tests have the data available - but is part of the initial database, so we don't have the overhead of loading initial data during every test's setUp.

Cons:

fatal - Runs every time migrations are run, even reverse migrations - so it runs when tables are in the wrong state and breaks development when you might be migrating forward and back

If it fails, the whole migration fails, so you can't just ignore a failure even if you didn't care about creating the initial data that time

Slows down database creation when running tests, unless you use --keepdb

Summary: rejected; not a valid way to load initial data.

In a migration

Add a migration that creates the initial records.

Pros:

This is what the Django documentation currently recommends

Runs automatically

The migration only runs when the database schema matches what it was when you wrote it, so it won't break as models evolve

You can write it to ignore records that already exist, so it won't overwrite later changes in the database

Cons:

fatal in some cases - migrations don't use the actual model class, so models with custom behavior (like MPTTModel) won't get created correctly. You might be able to find workarounds for this on a case-by-case basis.

Slows down database creation when running tests, unless you use --keepdb

Harder than fixtures to update as the initial data evolves. Options:

Go back and edit the original migration - but then it won't run on existing databases and they won't get the new records

Add a new migration that adds the whole updated initial data set, then go back and comment out the code in the previous initial data migration since there's no point running it twice on new database setup

Add yet another migration for just the new data - probably the simplest in terms of updating the migrations, but it'll be harder to extract just the new data from the current database than to just extract the whole dataset again. Also, you don't preserve any edits that might have been made over time to older records.

Summary: best option so far. It has some drawbacks, but not as bad as the other options.

Conclusion

The best approach in most cases is probably to load initial data in a migration, as the Django documentation recommends. It's not perfect, but it avoids some of the fatal flaws of other approaches. And the new (in Django 1.8) --keepdb option helps ameliorate the slow test startup.

I'm still curious if there are other approaches that I haven't considered, though.