Tuesday, September 27, 2011

Learn SSIS : SCD - Slowly Changing Dimensions transformation

Slowly changing dimension transformation was one of the mystery transformation for me till now, as i never had chance to actually implement it. So today i block my time to implement this transformation by taking a simple example.

The concept of slowly changing dimension is explained very well and in detail on wikipedia, i would request readers to have a look at below link if you are not aware of SCD concept.

Before starting on our example, let me give you some glance of the SCD transformation provided in SSIS 2008. This transformation is used to implement Type 1 and Type 2 SCD, for other types we need to add some custom logic to our ETL. (Type 1: Identify the changes and update the record, no history will be recorded, Type 2: Any change identified we expire the old record and create new record with new values, here we save history information in old record)

OK.. Lets take simple Employee dimension example... in this example i am getting EmployeeFeed.xls file as input for my DimEmployee table (which is my dimension table) and i am using SCD transformation to identify any changes and implement DimEmployee as Type 2.

So before we start implementing any SCD, we need to first identify attribute in the dimension table for which we need to track changes. In this example i want to track changes for Designation, City and Phone attributes. I am expecting no change in Name attribute or column.

You might have noticed that there are two columns EmpId and EmpKey why these columns are needed in dimension table??

Ans:EmpId : This is a Business Key, which uniquely identifies a employee in entire data warehouse system.

EmpKey : Is a Surrogate key, which uniquely identifies record in dimension table, and also its a key to identify historical records.

We also have two more columns StartDate and EndDate, these two columns are used to track time of changes, if EndDate is null it means the record is most recent record in dimension table.

Steps to Implement SCD in a data flow.

1. After we add Source (which is excel in our case EmployeeFeed.xls), we need to add Data conversion transformation, to correct if there are any Data type conflicts.2. Then we add SCD transformation to Data flow, and this will open SCD wizard, Click next on welcome screen.

3. On Select a Dimensoin Table And Keys page, select your dimension table in this case its DimEmployee, Map all the columns from source excel to destination DimEmployee table. One important thing here we do is identify Business Key, which in our case is EmpID. Then click Next

4. On Slowly changing dimension columns page, we need to select appropriate change type of the Dimension Columns and here we have three types : Fixed Attribute--> No change expectedChanging Attribute --> Changes are expected, but no need to record history, same record will be updated.Historical Attribute--> If this attribute is changed, old record will be expired (by setting EndDate as current date) and new record will be inserted with new attribute value

In our example, we don't expect any change for Name Attribute hence we selected this as Fixed Attribute, and rest all (Phone,Designation and City ) will be selected as Historical Attribute. Once we are done Click Next

5. On Historical Attribute Options Screen, we have two option, we can use any flag column to show which record is expired and which is most recent and other option is to use StartDate and EndDate, in this example we are using second option, and also selected StartDate and EndDate column appropriately.

6. For all other screens in this wizard just select next, and on last screen select Finish.

If you have noticed, we have two outputs from SCD transformation, New Output and Historical Attribute Output. So if there are any new records which are not present in dimension table those records will be redirected to New Output, and all existing records with some changing attributes will be redirected to Historical Attribute output.

Running the data flow..

I have 9 records in my sample EmployeeFeed.xls file..

So when i run my data flow for first time, all these 9 records will be redirected to New Output and will be inserted to DimEmployee Table.

Next, I did some changes in EmployeeFeed.xls, Changes are marked in yellow... so there are 4 records which are changed and 2 new records added.

If you can see the data flow, two records are redirected through New Output pipeline and 4 moved through Historical Attribute output, so what happens to those 4 records is we update the EndDate to latest date, then again insert them with new changed attrribute keeping EndDate as null. as shown below.

Thanks pointing out the error, the reason why the dates are same is on Historical Attribute option page, I have selected variable to set date as System::CreationDate rather i should have selected System::StartTime.

I will correct it soon, thanks again for your reply. Please also let me know if you want me to cover few more topics.

HiI don't understand the last part, how does the SSIS know you changed those fields?Where does it compare the new file and the old one?I am trying this with a sql table, I changed the value with an sql statement, ran the SSIS again and it did not realize I made a difference..Your help is highly appreciatedThanks!

SSIS compares the input (in my case its excel file) with the existing Dimension table (DimEmployee in my case) based on BusinessKey ( in my case its Emp_Id). Please make sure the business key exist in your dimension table to see differences between input and output of SCD transformation.

getting The below errorCannot map columns of different types.Column 'Copy of ID' is of type 'System.Double' and column 'EmpId' is of type 'System.Int32'.Column 'Copy of phone' is of type 'System.Double' and column 'Phone' is of type 'System.String'.