For most SSIS developers, their first experience in loading them is to use the SSIS Slowly Changing Dimension Wizard. It’s fairly easy and straight forward to use, but offers painfully poor performance. There are also 3rd Party SQL Server Integration Services (SSIS) SCD options available, but I’ve tested all of them and never found one that I was happy with. As you can probably guess, I prefer to use T-SQL Merge statements in an SSIS Execute SQL Task. There are several reasons for this.

What are the Pros and Cons of using Merge to load Slowly Changing Dimensions?

Merge statements are very versatile and allow for high degree of customization. This is important when you run into a scenario which deviates from standard data warehousing practices.

The Merge source can be a table, view, common table expression, etc. In fact, the Merge destination (or target) could be a view as well.

Cons

The Source and Destination should be on the same SQL instance. Yes, you can perform Merge with linked servers, but I wouldn’t do so.

Merge statements can be wordy and tedious to write, as you’ll see in a minute. Merging a couple of tables with 20+ columns can be described as tedious at best.

Merge performance relies upon proper indexing and setup. Small mistakes can result in drastically reduced performance.

The initial load of a table can be painful since all of the information is typically written twice, once in staging and then again in the destination table.

Merge statements can be confusing at first so there’s a bit of a learning curve. However, once you get the hang of it, I think you’ll agree that Merge is a great way to go.

When discussing Merge today we are going to stipulate the following:

Each entity in our Dimension will have 1 active record which will be designated by a RowIsCurrent (or another variation) field and/or RowStartDate and RowEndDate.

Entities in our Dimension will be continuous. So, for our DimCustomer, we’ll assume that we don’t have customers cancelling their accounts and then re-starting them. There will be no gaps in time.

On each load, there will only be 1 record per entity to Merge. In other words, for a particular customer, there will only be 1 changed record per load. You’ll notice that my code includes a field called “RankNo” which handles situations where this is not the case such as initial loads. I’ll dive deeper into that process in a future article.

Let’s Get to the Code!

First download the following file which will create a database called JediMergeDemo and the necessary schema and tables. For those who read my last article, these tables and data will look very familiar. As in the previous article, CustomerName will be a Type 1 change, while Planet will be Type 2. I’m including the Fact Table even though we aren’t going to load it. I’m including it because Foreign Key relationships affect how you write your Merge statements.

First, let’s pre-load our Dimension table and also our Staging table. We’ll start by loading a record that requires a Type 1 change in staging. Obi-Wan Kenobi changes his name to Ben Kenobi before he goes into hiding.

Type 1 Merge

Unfortunately, using T-SQL Merge to process Slowly Changing Dimensions typically requires two separate merge statements. Also, it’s important to note that I’m covering the Type 1 Merge process first because it is the simplest to understand. In a real world scenario, I typically handle the Type 1 Merge step AFTER I process Type 2. I’ll explain why in a future article. So, did you read the previous articles on Merge? If you did, the Type 1 Merge code will look eerily similar. The Type 1 Merge statement is simple because all it does it check for changes to the Type 1 columns of the dimension. If any of those change, an update is run against the entire column with that new value. As I mentioned earlier, we only have a single Type 1 column, CustomerName. The Type 1 Merge statement is as follows:

USING (Select CustomerNum, CustomerName From Staging.DimCustomer) as Source

However after extensive testing, even with a dimension where only 1 of many attributes is handled as Type 1, there was no difference in performance or query plan.So, Type 1 Merge is simple, right? Well, Type 2 Merge is a lot more difficult.

Type 2 Merge

Note: Properly handling Type 2 changes with T-SQL Merge is complex, and most people don’t perform it correctly. In order to convey the basics, as well as the complexities, I’m going to present this in layers. If you want to simply want to see the end result, feel free to scroll down to the final statement.
So, before we tackle the Type 2 Merge, let’s review what our code needs to accomplish. To process a Type 2 change, our code must:

Find the current record for each entity (per business key) in the Dimension. Most often this is done by finding the record where RowisCurrent = ‘Y’ or some variation. However, In certain situations, you may need to find the current record by comparing the change date to RowStartDate and RowEndDate.

If there are changes to any of the Type 2 columns for that current row, expire it. This is typically done by setting the RowIsCurrent value to ‘N’ as well as assigning the RowEndDate equal to the change date. In the last article, I explained that some people handle these dates differently.

Insert a record with a new Surrogate Key for that Dimension record and insert all of the new values into it. The RowEndDate will be ’12/31/9999′, NULL, or some other value you choose.

Merge dbo.DimCustomer AS Target
USING Staging.DimCustomer as Source
ON Target.CustomerNum = Source.CustomerNum

Followed by a WHEN NOT MATCHED BY TARGET statement which inserts any new records. If my Staging table was durable, meaning it kept every single record that had been loaded into the Dimension, I may have wanted to add a WHEN NOT MATCHED BY SOURCE section as well.

The previous snippet inserts brand new records, but what about handing the Type 2 expiration process? Well, this next snippet of code handles just the expiration, not the insertion of the newly created records.

Now, this is one of the places Type 2 Merge statements get tricky. You cannot insert those expired records and insert them from within the Merge statement. You must capture them with an Output clause, which filters for only the Updated records. Enclose the entire statement (so far) with parenthesis and alias the sub-query.

So, we’re filtering for only the expired records (Update) and outputting all of the columns, not just the Type 2 columns; because we’re going to use the results from this sub-query as the source of an insert statement. Before I share that portion of the code, I want to point out that this is another common error that people make when they write these statements. Suppose you appended the following to the beginning of Merge statement and ran it. You’d receive an error like the following:

The target table ‘dbo.DimCustomer’
of the INSERT statement cannot be on either side
of a (primary key, foreign key) relationship when the FROM clause contains a nested INSERT, UPDATE, DELETE, or MERGE statement.
Found reference constraint ‘fk_FactSalesDetails….’

So, how exactly do we get around this? As my good friend Peter Schott points out, a good method is to first insert these records in a Temp Table, and then insert them into the Dimension from there. Let’s look at the code:

As far as I’m concerned it is not absolutely necessary to define the Temp Table the way I have. I’ve seen it successfully done by using a Select * into #DimCustomer from….. Since my Merge statements are mostly generated automatically, I spell it out. Also, it should be noted that you could simply have another version of the destination table that acts as a temporary storage for these records. That way, you wouldn’t have to regenerate a Temp Table each time, though you would need to remember to Truncate the additional holding table. Anyway, Let’s test it with the following Type 2 Change. You’ll notice that I’m truncating the Staging table before a new Merge. Yoda moves to Degobah.

So, I bet you’re thinking, “Well, that was a bit more complex than I expected, but at least that’s it. That is it, right? Right?” Well, not completely.

Transactions and Error Handling

The code above will work under the conditions I stipulated when we started. However, what happens if there’s an error? Well, when these statements are executed, I either want the entire Merge set (both statements) to succeed or fail together. I also tend to include code which records any errors which occur in a table for analysis. For the sake of this demo, I’ll just display them to the user. So, if we add that code to our Merge statements we have:

To test this, let’s deliberately cause an error. We’ll add two records for Yoda. If you’ll remember, one of the stipulations above was that we would have only 1 changed record per entity per merge. Run the following code and re-run our final merge statement.

The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.

Feel free to take the code samples and test out merging records with both Type 1 and Type 2 changes. Also, I’m constantly refining my processes, including Merge so if anyone has suggestions for improvement, I’d love to hear them.

How to perform those kinds of loads with T-SQL Merge. This will also handle incremental load situations where more than one Type 2 change may occur between extracts.

Merge statement performance tuning.

How to optimally Load Fact Tables using Merge.

Options for T-SQL Merge statement “automatic” generation. This will include 3rd party products, free open source projects, and my own generator code. I have T-SQL Code which generates Merge statements, but I’m also generating them using BIML Script.

I’m busy working on SSAS cube solution- multidimensional model. i need to model a many to many relationship where both dimensions are slowly changing. The Fact table has monthend data (snapshot measures).

I used the following process:
1. use your approach (with slight modification to handle reloads) to model the Dim1 and Dim2 dimensions. Dim1 and Dim2 are slowly changing and between them, there is a many to many relationship.
2. Create at bridge table between Dim1 and Dim2 with fields: Dim1Key, Dim2Key, MonthEndDateId. The bridge table is handling the relationship between the 2 tables;
3. Join Dim1 to the Fact table: Dim1 join results in distinct records in the Fact table;
4. In SSDT Dimension Usage, i included a relationship on the Date dimension joining to the Bridge table on MonthEndDateId. this gives me the ability to see changes in Dim1 and Dim2 over time.

Its working fine, but i wanted to know of a better implementation of this solution. If i denormalise the dimension, each record in the Fact table gets multiplies about 20 times.

Great article, you’ve convinced me that sp is the way to go; not only for scd, but for all ETL tasks. I’m sure Microsoft will get rid of SSIS in about 5 years and everything will have to be redone anyway.

Very nice desscrition for loading SCD. I need a advise from your side, typically in DW enviroment SCD dimension contains many columns and each and every columns has own defintions some of columns have Type 1 property and other may have type2 property. We want to write T-SQL procedures like you have given an example, we need to put OR conditions for type 1 Like this and so on for Type2.
MERGE dbo.DimCustomer AS Target
USING Staging.DimCustomer as Source
ON Target.CustomerNum = Source.CustomerNum
WHEN MATCHED AND EXISTS
(SELECT Source.CustomerNum, Source.CustomerName
EXCEPT
SELECT Target.CustomerNum, Target.CustomerName)
— I Can put OR Condition … here for other type 1 columns
THEN
UPDATE SET Target.CustomerName = Source.CustomerName;

Can you suggest, I can think of creating Meta tables as well where I can store all Dims columns Property and build a generic SP for handling SCD.

Brian, my staging tables that I use for the Merge are typically not durable. The data in them is temporary and only represents the changed data since the last extraction. If I added a “When not matched by target” then I would erroneously expire records. Incidentally, in one my latest Biml presentations I generate merge statements with “when not matched by Target” and do expire them.

Hi GREAT Article, I need some help on creating a dim table for track Employment status of an Employee. Basically I have to look in two tables :-
Employee_details
(employee_details_id, employment_Start_Date, Employment_end_date) and
Employee_status_history table
(Status_hitory_Id,employee_details_id,StatusName,StatusStartDate,StatusEndDate)

There is no entry in employment_status_history table when the person joins the company it has no entry in employment_status_history until when there is any of the below status (employment_status)change for the employee. In essence all employees will not have an entry in employment_status_history – those who don’t have entry are assumed to be active Employee since the member start date (i.e employee_details.start_date).

When there is an entry in Employee_details END_date it means employee is InActive.
Inemployment_status_history table Only , 8 and 9 status (Permanent Retirement and Left Service) are inactive Employment and end date is NULL if there is end_date for Left Service it means the Employee has re-joined.
I need to keep track of two status, Employment_Status (Active or INactive) and Within Employment various status (Maternity Leave, Long Sick, Onsite deputation etc)

Great example and tutorial. I thoroughly enjoyed it. I have a request where I will use the type 2 change but I am also required to record the previous value of a field. For example…attribute 1 = color, attibute 2 = previous color. The attribute 1 is treated as a type 2 change..Once the change occurs the business side rules state the previous value must be recorded in attribute 2. How can I accomplish this with the merge? Thank you.

Do you have any information on how to use the “Select * into #DimCustomer from…..” syntax instead of manually defining each column? I’ve done some research and am not quite sure how to do this. I receive the following error and cant seem to find my out “A nested INSERT, UPDATE, DELETE, or MERGE statement is not allowed in a SELECT statement that is not the immediate source of rows for an INSERT statement”. Thanks!

Apparently they will be kept in the dimension, nonetheless what if we would like to flag those as ‘inactive’ or something similar?

The deleted records are not being updated that means that they stay at the Current= “Y” status. How would you update the deleted records to ‘Inactive’ assuming we have another column “Active” in our dimension.

Apparently they will be kept in the dimension, nonetheless what if we would like to flag those as ‘inactive’ or something similar?

The deleted records are not being updated that means that they stay at the Current= “Y” status. How would you update the deleted records to ‘Inactive’ assuming we have another column “Active” in our dimension.

A detailed explanation of how historical Data Warehouse loads (should) work.
How to perform those kinds of loads with T-SQL Merge. This will also handle incremental load situations where more than one Type 2 change may occur between extracts.
Merge statement performance tuning.
How to optimally Load Fact Tables using Merge.
Options for T-SQL Merge statement “automatic” generation. This will include 3rd party products, free open source projects, and my own generator code. I have T-SQL Code which generates Merge statements, but I’m also generating them using BIML Script.

Thanks for these great articles and best practices. I really like the Merge statement but it’s easy to make little mistakes when writing ETL using Merge. Do you have a kind of code generator (stored procedures) which generates etl stored procedures using metadata (like a source view or something). If you have, maybe it’s nice to post an article about this. I think a lot of people would be interested. I can build one myself but why do it if somebody else already did it?

Loved this post. I’m involved now in a dwh project for which I would like to use the MERGE statement on the type 2 dim tables. In this project some extra columns are added, to keep track of history. Among them are WA_Previous_ID and WA_Root_ID. The first holds the surrogate key of the record that was closed by a change, the second holds the surrogate key for the first occurrence.

To obtain these values I tried to take these values from the [Target] records and include them in the output. Unfortunately it seems impossible to so: ‘The multi-part identifier “Target.WA_Root_ID” could not be bound.’

Is there a way to incorporate attributes from the [Target] row that will expire in the MERGE output?

This is an excellent article/tutorial and I thank you for taking the time to put it together. It is presented in a way, unlike the MS docs, is possible to understand and very explanatory. There is one thing that threw me a little, and that’s in the test data. When you initially populate dbo.DimCustomer, RowIsCurrent and RowEndDate are not populated. Also, when expiring rows, you switch from CustomerName to Planet in the match clause. Both of these result in no records being added or updated in the initial Merge statements. This was easily resolved by modifying the data, but was a little confusing to me while I struggle to get a complete understanding of the Merge statement. Again, I thank you, for this is a great help to me.

I believe I’ve found a way to perform the update and row retirement in a single merge statement. This updates the original row and writes original columns to the retired record, maintaining the original surrogate key. Please reply if you find any caveats which I have missed:

Late in the game to this thread, however I am wondering if it is necessary to include the PK in the WHEN MATCHED AND EXISTS portion of this SCD2 merge statement. I wouldn’t think so, but I noticed that in this example you included it, but on your post related just to the SCD1 merge statement, the PK was not included in the WHEN MATCHED AND EXISTS portion of the SQL. The post I am referring to: “Writing T-SQL Merge Statements the Right Way”.