Monday, 23 August 2010

Incremental Load - TSQL

For my first technical post in a short while I thought I would look at something I have worked on over the summer that I thought would be good for me to document and share. I needed to develop a relatively simple incremental load process for a very small web app that took the contents of a spreadsheet and loaded that spreadsheet into a SQL Server database table.

At this stage I have to point out that I learned many of the techniques I used here from reading Andy Leonards' (Blog | Twitter) Chapter in SQL Server MVP Deep Dives - The Autonomy of an Incremental Load.

I'll provide a brief overview of my process before I get down to the nitty gritty of the load itself.

I take the spreadsheet that holds the data I want to load and load it into a staging table in my database. This was done using SSIS.

I then compare the staging table with actual live table and decide if any updates need to be made.

If they are identical I simply send an email to the 'keepers of the spreadsheet' informing them.

If there are differences between the staging table and live then I want to:

Insert any new rows in staging but not in the destination table

Update any rows in the destination table where the keys match rows in both tables but expected date column is different.

If rows exist in the destination but don't exist in the staging table then it has been deleted from the spreadsheet and should be removed from the destination table.

The 'Keepers of the spreadsheet' should then be emailed informing them that the database table has been updated with the changes that they made.

Before we begin lets create a database, a staging table and a destination table for us to work from:

So we now have data in the source that doesn't exist in the destination table. Running a simple SELECT * against both tables will show us this:

--SELECT * shows the differences between the tablesSELECT * FROM dbo.StagingTableSELECT * FROM dbo.DestTable

Part of my requirement was to email the 'Keepers of the Spreadsheet' if the no updates were needed. I decided that it would be good to check if the incremental needs to be run by checking for differences in the staging and destination tables first. If there are differences then I run the incremental load, If there are no differences my script won't do anything but email the necessary people saying there were no changes made.

To do this I stumbled across a function that I have not previously used called CHECKSUM_AGG. Books Online says the following about this function:

CHECKSUM_AGG can be used to detect changes in a table.

The order of the rows in the table does not affect the result of CHECKSUM_AGG. Also, CHECKSUM_AGG functions may be used with the DISTINCT keyword and the GROUP BY clause.

If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change.

If you run the following script against the database tables and data created above you should get a printed message that says there are differences between the source and destination:

IF (@DestCheckSum <> @SrcCheckSum OR @DestCheckSum ISNULL) BEGINPRINT'There are difference between source and destination tables'END

ELSEPRINT'No changes to make'

We have run our check which tells us we need to run the load so what do we do, well if rows exist in the staging table but not in the Destination table we want to insert those rows into the destination table. So how do we do that with TSQL? The answer is an INSERT INTO, a SELECT and a LEFT JOIN.

We insert into the destination table the result of the select from the source table joined against the destination table with a left join. The left join means that all the rows in the left table (the source table) are returned regardless of whether a match is made on the join key (id) in the other table. If no record exists or is not known for that key in the destination table then a NULL value is returned in the left join result set for the rows in the destination table (the table in the right hand side of the JOIN), The WHERE clause

WHERE d.id IS NULL

Ensure that the query only pulls out only rows that exist in the source table, as a NULL for the ID column are not allowed in the destination table we can safely assume that these only exist in the source table and should be inserted into the destination table.

If we run the insert above we get (6 row(s) affected) If we run it again straight after we get (0 row(s) affected) because the rows exist in both tables.

To confirm that your tables are now identical you can run the checksum_agg script above. I get a " No changes to make" message.

The next thing we need to do is update columns in the destination table that have the same key as rows in the staging table but the other field has a different value.

Lets update a row in the staging table, so we have the same record ID but different ExpectedCompletion date in the staging and destination tables.

We then need to run the incremental load but this time not inserting new rows but update existing records where the ID exists but the expectedcompletion date is different between the staging table and the destination table.

This is achieved by running an update statement against the destination table which includes an inner join with the staging table on the ID fields. If the ID's match in both tables but the expected completion date is different, we update the destination table record (d) with the corresponding value for the same record in staging. Here's the TSQL:

We have shown how to load new records and we have shown how to update records that have changed, all that's left is to delete records that have been removed from the spreadsheet and hence needs to be removed from the destination table as part of the incremental load.

Again we will use our staging table and destination table as an example and we will delete a record from the staging table so I can then demonstrate how the incremental load handles deletes. In reality this is very similar to the insertion of new records using an outer join.

First up, lets delete a record from staging so we are in a situation where we need to remove a record from our destination table.

DELETEFROM dbo.StagingTableWHERE id = 1

We now have a record in the destination table that no longer exists in the spreadsheet and staging table. How does the incremental load cope with this? The following TSQL show's how to delete the records from the destination table where the corresponding record in the staging table no longer exists:

So there we have it, the components of my incremental load laid out, from inserting new records, updating changed records and deleting removed records. Andy Leonards' chapter in SQL Server Deep Dives also demonstrates how to perform this using SSIS. If you are interested in this I suggest that you check out the book and Andy's chapter.