How to deploy a database

Published on: 24 Nov 2010

When I'm building an application that stores data, there are a few things I try to make sure we can do when it comes to managing the lifecycle of our storage. I'm going to focus on SQL Server, but this applies to other relational storage.

What's in a database?

Even outside of production, a database is more than schema. A database "definition" consists of:

It may even include management job definitions (backup, shrink), though they tend to be left to DBA's.

Transitions, not States

As each iteration progresses, we'll make changes to our database.

One approach to upgrading existing databases is to "diff" the old state of the database with our new state, and then to manually modify this diff until we have something that works. We then employ the "hope and pray" strategy of managing our database, which generally means "leave it to the DBA" for production. Tools like Red Gate SQL Compare and Visual Studio Database Edition encourage this approach.

I never understood the state or model-driven approach, and I'll explain why. Here's a simple example of some T-SQL:

Again, an instruction. A transition. I don't tell the database what the state is - I tell it how to get there. I can provide a lot more context, and I have a lot more power.

You'll note that I used a default constraint above - that's because my table might have had data already. Since I was thinking about transitions, I was forced to think about these issues.

Our databases are designed for transitions; attempting to bolt a state-based approach on them is about as dumb as bolting a stateful model on top of the web (and Visual Studio Database Edition is about as much fun as ViewState).

Keep in mind that making changes to databases can be complicated. Here are some things we might do:

Add a column to an existing table (what will the default values be?)

Split a column into two columns (how will you deal with data in the existing column?)

Move a column from one table onto another (remember to move it, not to drop and create the column and lose the data)

Duplicate data from a column on one table into a column on another (to reduce joins) (don't just create the empty column - figure out how to get the data there)

Rename a column (don't just create a new one and delete the old one)

Change the type of a column (how will you convert the old data? What if some rows won't convert?)

Change the data within a column (maybe all order #'s need to be prefixed with the customer code?)

You can see how performing a "diff" on the old and new state can miss some of the intricacies of real-life data management.

Successful database management

Testability
I want to be able to write an integration test that takes a backup of the old state, performs the upgrade to the current state, and verifies that the data wasn't corrupted.

Continuous integration
I want those tests run by my build server, every time I check in. I'd like a CI build that takes a production backup, restores it, and runs and tests any upgrades nightly.

No shared databases
Every developer should be able to have a copy of the database on their own machine. Deploying that database - with sample data - should be one click.

Dogfooding upgrades
If Susan makes a change to the database, Harry should be able to execute her transitions on his own database. If he had different test data to her, he might find bugs she didn't. Harry shouldn't just blow away his database and start again.

The benefits to this are enormous. By testing my transitions every single day - on my own dev test data, in my integration tests, on my build server, against production backups - I'm going to be confident that my changes will work in production.

Versions table

There should be something that I can query to know which transitions have been run against my database. The simplest way is with a Versions table, which tells me the scripts that were run, when they were run, and who they were run by.

When it comes to upgrading, I can query this table, skip the transitions that have been run, and execute the ones that haven't.

Sample data

Development teams often need access to a good set of sample data, ideally lots of it. Again, these should be transition scripts that can be optionally run (since I might not want sample data in production), and in source control.

Document Databases

Most of these principles apply to document databases too. In fact, in some ways the problems are harder. While you don't have a fixed schema, you're probably mapping your documents to objects - what if the structure of the objects change? You may need to run transitional scripts over the document database to manipulate the existing documents. You may also need to re-define your indexes. You want those deployment scenarios to be testable and trusted.

Welcome, my name is Paul Stovell. I live in Brisbane and work on Octopus Deploy, an automated deployment tool for .NET applications.

Prior to founding Octopus Deploy, I worked for an investment bank in London building WPF applications, and before that I worked for Readify, an Australian .NET consulting firm. I also worked on a number of open source projects and was an active user group presenter. I was a Microsoft MVP for WPF from 2006 to 2013.

Stacy

02 Dec 2010

Very topical for me at the moment.

Our latest project is very database-heavy.

We're going to be using tarantino to run state transitions (but we are using SQL Compare to generate the first hack of state change).

We tried DB Pro for about a day. It failed us with strange compile issues (even though the db schema is valid and creates a DB) and generated an 8MB xml file.

Rich Fantozzi

02 Dec 2010

Good article agree with alot of it. I wonder if you extend the versioning from the database to the code that is being run on it. For example in our CI thier are tests to look at code version vs database version and ensures that they can be run against each other. We found that you can have code that builds, tests that pass but something in the db is forgotten. So for each code version has to mapped to a valid db version. While it is not fool proof it has greatly limited the number of misses.

I largely agree with the approach taken by Paul, and have used it successfully on some very large systems.

Actions like keeping your database scripts under version control, having isolated development environments (including the databases), and having automated builds and deployments of the changes are the minimum of what I would recommend.

One distinction I do make is between the schema (mostly tables) and other (executable code) elements.

For any table that holds data, it is usually a non-trivial exercise to move from v1 -> v2 -> v3. Migrations need to be made in order and you cannot simply dump v3 on top of v1.

On the other hand, a stored procedure is much more similar to your client application code. You can easily jump from v1 -> v3, or even then jump back to v2 without the complications of potential data loss.

You can still have problems if the version doesn't match the schema -- the same problems if client code doesn't match -- but this is easier to fix than data loss.

I've also used differencing tools (DB Pro) and had some success, but did find scenarios where the comparison failed. Without a trustworthy tool I found myself using comparison to generate a change script which I then had to manually verify and update.

A differencing tool can fail if you make too many changes, e.g. if I change the name, type (size) and position of a column: following on from the above if I change the FullName column to DisplayName --

Paul,
Can you share any examples of Versions Table?
What I meant is, some sample scripts and how / when do you insert data into the table etc!

Regards,
Antony.

Ujjaval Suthar

09 Dec 2010

Hey Paul,

Great post.

Just checking, When you say 'Source control the database', do you mean source control just scripts that defines the database or actually source controlling the database containing all master/test data into source control?

I can imagine that will be a really expensive approach if this is on top of DBA's regular backup strategies for databases.

So, if you don't have a shared database, exactly what are the integration tests (on the build server) running against? Surely you don't want them to point to your local DB; they should point to a central dev DB?

Where database.Deploy() creates a new database (with a GUID for the name) on the local SQL instance, deploys the schema/sample data using scripts, and does all the other things from above. database.SessionFactory would create an configure an NHibernate session factory conected to that database. Finally, the database.Dispose would delete the SQL database.

These integration tests would run on my dev machine (so long as I have a local SQL Server and the right permissions), and on the build server. The tests could also be run in any order.

Using a configuration option, I might be able to make database.Deploy act differently - in one configuration, it might create a new, empty database - in another, it might restore a scrubbed production database from backup and upgrade it. That way the same test suite could be run in a "new install" vs. "upgrade" scenario.

That seems really inefficient though (i.e. each test does that?) And surely it means trouble if someone else has a test testing something slightly different?

I'll be honest; I don't think that is quite ... "ideal". I'd be intrigued to see if you find this system actually works (legitimately interested); seems too "heavy" to be practical. But I don't know; I haven't given it much thought since I posted my response, which is so long ago now that I've basically forgotten what I was thinking at the time :P

Anton Gogolev

21 Jan 2011

Take a peek at octalforty Wizardby. This is my open-source project which actually provides a platform-independent framework for database versioning & schema migration.