An expert across Business Intelligence, Richard works upstream, helping companies articulate their business critical strategies that will yield tangible results through data mining and analysis. As a consultant, he leverages his career with IBM and Microsoft as well as his MBA from London Business School, to ensure positive outcomes for his clients. Areas of speciality include Microsoft SQL Server BI developer; relational database performance; OLAP/data mining; Microsoft Virtual TS.

Saturday, June 9, 2012

SQL Server 2012 Integration Services is similar to previous versions in that it supports extremely fast (optionally minimal logging) table loading. Like previous versions, fast loading is assisted by having large commit sizes. That is many thousands of rows are inserted in the same transaction. This can be a challenge when there are occasional records which error on load, such as duplicate primary key, or data type error. This post is about achieving very fast data loads while handling infrequent errors. Essentially the pitfall to avoid is that when one record has an error, all records in that transaction are sent to the rejects, while you only want to send the error record(s) to the rejects. Note, this isn't the only way, or necessarily the recommended solution, but it does illustrate the issue well, and will help you determine the best architecture to implement at your site.
Imagine the scenario where we have 2,000,000 rows to insert into a 50 billion row table. We want to do it fast, but there is a possibility that some of the rows will already have been inserted (so we want to ignore them) and other rows might have data out of range and are truly rejects. If we just implement a Data Flow with rejects going to an error table, we will have good records going to the rejects as all records in the "error" batch go to error destination.

So what we can do is take the error output and perform a lookup. If the PK exists already we can ignore the record. If the PK doesn't exist, then we try to insert it to the destination table for a second attempt. This time, we set the commit size to 1, so that any records that error will not take good records with them. The error output from this second attempt are truly bad records, since they don't exist and they have caused an error themselves. During most executions there won't be duplicate records or errors so extremely fast loading will be achieved. When there are errors, these will be handled efficiently.

Here is an example Data Flow from SQL Server 2012. The input data has 2,000,001 records, one of which exists already in the Data Warehouse and can be ignored. Notice how there are 2,000 rows rejected from the Data Warehouse, this is the batch size. There is one duplicate record and the other 1,999 rows in the transaction are valid. The Lookup task finds the match and sends that row to a row count (logging purposes) while the remaining 1,999 rows can be loaded into the Data Warehouse successfully. The second attempt to insert is done with a batch size of 1. If there were any errors in this second attempt they would be redirected to the rejects table. Generally the rejects table will have a much lower data constraints so they don't error.

This is the connection properties on the Data Warehouse table (first attempt). Notice the Maximum insert commit size is 2000.

This is the lookup component to see if the PK exists already. This lookup is without a cache, as the destination table contains billions of rows and the package will generally be looking up few, if any, rows. A full cache would be extremely expensive on such a large destination table. Don't do that. A partial cache probably is of no use since the lookups will tend to be all for different PKs.

This is the connection properties on the Data Warehouse table (second attempt). Notice how the maximum commit size is 1. So any errors will only affect 1 record.

As I mentioned earlier, there are many ways to perform fast load. The main take away from this post is that an error record will spoil the whole batch. So you have to deal with that. The mechanism above, essentially retries to load records from the spoilt batches with a commit size of 1, so the residual errors are truly errors. I have seen developers do something similar by performing checks before the load. You should determine what is appropriate for your application, keeping in mind the behaviour of error batches.

About Me

A Business Intelligence and database expert. Richard's early career was with IBM and Microsoft as a database specialist. He has combined that with an MBA from the London Business School, giving an unusual combination of business intelligence know-how and business savvy. This makes Richard a useful person to help unearth innovative business strategies and create high performing database solutions.