Sunday, March 31, 2013

I've long been a fan of SQL Azure (under its various names), mainly because it's usually provided the easiest first step for migrating existing .NET apps into the cloud. That said, there have always been hundreds of features it's lacking compared to "real" SQL Server, and I've never been particularly satisfied with any of the official means for data migration.

So imagine (if you can!) my excitement last September with the announcement that it had became possible to add Azure SQL DB as a linked server. As well as allowing queries which joined tables across multiple servers, I also had high hopes that it would make migration of data from one to another much easier. Between two linked SQL Servers, I've often used the nice and simple SELECT INTO command to copy both the data and schema of a table in one easy step.

On my veritable roller coaster of emotions that day, imagine now the devastation caused by the following response from Azure:

Msg 40510, Level 16, State 1, Line 1Statement 'SELECT INTO' is not supported in this version of SQL Server.

...due, it seems, to something about tables without clustered indices not being supported :-(

After my initial disappointment, I figured that I could still move data from one place to another using INSERT INTO ... SELECT, providing the schema was already in place on the destination. After lots of playing, I finally came up with the procedure defined below for copying an entire database from a classic SQL Server into a linked Azure database - whilst also maintaining referential integrity. Various other restrictions encountered along the way means that it gets a bit ugly in places (especially due to all the dynamic SQL), but it seems to work pretty well and I've found it especially useful in scenarios where I've needed to re-run the data push several times over during development.

1. Build the schema on Azure

The apps I'm writing at the moment are based on Entity Framework (code-first), so I must admit that building the actual schema of databases is something I get for free these days. If you're in a similar situation to me, just plug into your Web.config the connection string which the Azure portal provides for your server, and fire up your app to have Entity Framework build the empty database.

Otherwise, you could use Management Studio's "Script Database as" menu item to give you the SQL to create the schema.

3. Ready the Azure schema for migration
You'll need two new columns on each table to facilitate the data migration: DataImport_OriginalId (to contain the original primary key value of the data that will be transferred - needed to keep foreign keys valid across tables), and DataImport_IsClone for some of the jiggery-pokery that takes place in (5). I'm assuming that each table has one PK which is an identity and an int.

You'll also need to disable FK constraint checking in readiness for the data to be bulk loaded (because it's quite likely that referenced rows won't yet exist at the time of loading).

This generates some SQL to add the columns to each table, and disable the constraints. If you're brave, replace the PRINT commands with EXEC, but I prefer to view the SQL it generates and then copy and paste to run it manually.

5. Tidy up!
We now have all of the data sitting in our Azure database, but the primary keys will almost certainly be different from their originals, hence any foreign key values will reference non-existent or incorrect rows. That's because we can't SET IDENTITY_INSERT ON across successive connections, and with a linked server you have no control over which connection each SQL statement runs on. Instead, we have the original IDs in that new DataImport_OriginalId column we defined earlier.

Because we can neither modify the newly assigned PKs, nor even reseed identity columns (DBCC isn't supported in Azure SQL DB), the only way I've found to get the rows back with their original PK values is as follows:

Switch on IDENTITY_INSERT

Find a suitably distant PK value from anything that currently exists or will need to exist... lazily, I've used MAX(PK) + MAX(DataImport_OriginalId)

Insert (and then delete) one row to set the identity seed to this value

Switch off IDENTITY_INSERT

Clone all rows, flagging them as cloned using the DataImport_IsClone column

Delete all original rows (where DataImport_IsClone = 0)

Switch on IDENTITY_INSERT

Clone all cloned rows, with PK now explicitly set to DataImport_OriginalId

Switch off IDENTITY_INSERT

Delete all rows from the first clone!

Repeat for all tables; then (and only then) re-enable FK constraint checking.

Here comes the humdinger of a query to generate the SQL for all tables:

Note that I've deliberately left behind the two columns I created in (3), because they have no negative impact on my application, and they allow me to repeat this process as many times as I need by emptying the database, and re-running (4) and (5).

Improvements?
I find this technique for migrating data into the cloud useful, but it would be great to find a way to remove the need for all the data shuffling that goes on in (5) just to restore the original identities. Something like the ability to switch on IDENTITY_INSERT for all connections would make things a whole lot nicer.

If you have any thoughts or suggestions to share for making this cleaner, I'd love to hear from you!

Tuesday, March 19, 2013

If you're using Entity Framework code-first without database migrations, you'll be only too aware of the pain of updating the live database schema whenever you make changes to the model.

Assuming you're using a DropCreateDatabaseIfModelChanges initializer on your debug build, your development database will be automatically rebuilt whenever the model changes, but (assuming you don't want to lose data), any changes will need to be manually made on the live server before deployment.

The following scripts are what I use to identify the differences between development and live schemas.

Running script (2) will return a set of modifications to be made on the live server.

Items that appear with 'Remote Table' and 'Remote Column' as NULL need to be added to the live server, according to the details given in the other columns. Items that have entries for 'Remote Table' and 'Remote Column' but NULLs elsewhere need to be removed from the live server.

So, in the example shown below, a new column HomeAddress of type nvarchar(max) needs to be added to the Students table on the live server. And Telephone2 needs to be removed from the Students table.

4. Update the hash

Script (2) can be run as many times as needed. When it returns no rows, you know that the schemas on both development and live servers are the same. In order to let Entity Framework work with the new version of the schema, you'll also need to update the hash manually - copying the hash that was generated automatically on the development server.