21 Sep 2011

As I have mentioned in previous posts, I've been working on a data warehouse project. Within the project we decided to use SQL Server 2008 R2 Master Data Services (MDS) to store all the warehouse specific reference data. Here are some reasons why MDS was used:

Allowed us to manage the reference data.

Reference data could be loaded into the warehouse like a source system

Would allow the client’s data governance team to easily update the reference data and keep the warehouse up-to-date

For active development of entities and data loads we use a sandpit instance. When the ETL was ready to use the new entities or loaded data, a cut of the sandpit instance would be promoted to the development environment. We came across a problem when we needed to identify some changes which were accidentally made on the development instance.

I came up with a method which helped to identify the changes fairly easily, below is the method I used. I will say that it’s not a perfect solution and might not work for everyone or continue to work when MDS updates have been applied.

Run the following SQL script on both instance of the MDS database servers:

Then use the visual studio 2010 data compare tool to highlight any data differences.

While using the above methods to do the comparisons I found some differences which I had to ignore. The reason we had to ignore them was because of the following scenario:

While loading data through the batch staging process we had to set the default name attribute of the entities to be an empty string. This is because it would not allow nulls and we did not want to use this attribute. However the development instance has the default name attribute set with nulls. I believe this was because the MDS deployment tool had converted the empty element of the deployment package, which was created because of the empty string, to null while uploading.

19 Sep 2011

Introduction

I have, as mentioned in a previous blog, been working on a data warehouse project using CDC for extracting the source system data. I would like to share some of the experiences and evaluation criteria used for selecting a CDC tool for our project.
The reason for using a specialist tool was as follows:

Couldn't move all the source databases to SQL Server 2008

Needed the capture change data to be sent to another server instance

A consistent management experience for CDC between all our source systems. The two database vendors the tool needed to support was SQL Server and IBM AS400

Timescales

First we underestimated how long it would take to select a CDC tool. We plan for 1 month, for both evaluation criteria and running the tests, and it took us about 2/3 months. The time was taken in evaluating the tools against the criteria and gaining access to database systems from other projects.
However our development of the ETL solution was able to continue, while the CDC tool hadn’t been selected, as we used generic columns for the CDC metadata. Our requirements allowed this approach because only a selected set of metadata was needed from the CDC tool which most tools offered. The method to ensure that the changes were picked up in the right order was to use a row ID, which was a big integer identity seed, within the CDC destination table. This was able to work as most CDC tool records the changes in the same order as they are made.

Evaluation criteria

Here are some categories of criteria, which our client used, to select which CDC tool to evaluate, these are common sense categories:

Impact: How CDC would impact both source and target systems servers and databases. The Server impact was measured in how much resources it would take to use on the server: memory, CPU, etc.

The database impact was whether it made any schema changes and if it had performance implications for the application. This was important to know as there was a third-party tool where the support contract would have been invalidated if any schema changes were made to their database. To tests for schema changes I use a simple procedure which was to use VSDBCMD tool to import the schema into a schema file before installing CDC. After the install import again to another file and use VS2010 schema compare tool to compare against the schema files.

Schema changes: Does the CDC tool cope with schema changes being made on the source system. Could the tool continue to work if the schema changes were made onto columns which weren't being captured or if any new columns were added.

Speed: How quickly were changes committed to the target server. The metrics for this was by the volume of changes and speed to commit them. Also how quickly the initial synchronisation took to complete.

Management: What was the management tool, how easy was the management tool to use, how quickly CDC was recoverable from errors or disasters.

While evaluating CDC tool we also found that we had to consider some other factors. One consideration was how it impacted the operational procedures of the source systems, for example backups, recovery and deployments.
Another factor was the number of databases and tables which we wanted to capture data from. We had 20 source databases, which had about 15 tables each. Depending on the CDC tool more time would have to be spent on the development and deploying the CDC solution.

To develop or to configure that is the question? (sorry couldn't resist the Shakespeare pun)
That is the question that we found ourselves asking. As some tools required development of code to create a CDC solution. Also with the tools which required development the experience and environment had to be considered as part of the evaluation. Such as the level of privileges required on the development computer; is there any integration with source control or how to protect from loss of work; how easy is it to transfer development from one developer to another.
Another aspect of the CDC tools evaluation was how easy it was to deploy from our development environment to production. How easy was the CDC solution to deploy and was there any automation through scripting or exporting.

Observations

While evaluating we found the same tool gave a different level of change data capture experience depending on the database vendor it was configured against. For example: when configured against the AS400 the tool was able to give a full row of data. But when configured against and SQL server it was only able to give the columns which had changed. The reason for this was how the tool had implemented CDC for the SQL server. There was no requirement for the replication components for SQL server to be installed, without this component the SQL server log file only records the columns which had changed. Hence why the tool was only able to give only the change columns data and not the full row of data.
We also found that the same CDC tool behaved differently when running against different processing editions (x86 and x64). The difference was with settings and memory requirements.

Introduction

I have been working on a data warehouse project with a difference. The difference is that the ETL is not doing the classic extract of source system data, instead the source system is going to send its data to the data warehouse by using change data capture (CDC). The decision for using CDC was as follows:

To have the ETL only process the data that has changed within the source system between each run of the ETL. This would help the ETL perform as it would only have to process the change data and not work out what the changes were first.

Not to have the ETL processes impact the central source system database. The CDC would be responsible for delivering the changes of the source system to another database server. Then the ETL could be run at any time within the day and wouldn't be responsible for blocking the transactional source system. As there was plans to have the ETL run at the end of the day for two time zones e.g. US and UK.

I would like to share some of the experiences, decisions and challenges that we face while trying to build a data warehouse using CDC. The first decision, which I will put into another post, was which CDC software to use. As a consequence of using CDC we faced a challenge on how to extract data from source systems where CDC couldn't be applied, One such system was Master Data Services (MDS) this was due to how the data is stored within its database. As this would mean the transforms for this source system would have to use a different approach to where CDC was used. What we decided to do was mocked CDC through use of SSIS and then stored the delta changes that we found through our dataflow process. The reason for choosing this approach was so that all the transforms had a consistent approach.

Development

Early into development we discovered that we had to define what was meant by processing intraday changes. Did this mean every row which was captured by CDC within that Day? In other words, that every change which was captured should create a history change within the data warehouse leaving the last change as current. Or did this mean capture events changes within the day? For example scanned the changes for that day and find the flags for the records that are required. For us this meant capture event changes. Importance of understanding the meaning of intraday changes had an impact on how we approached coding the transform procedures.

Another challenge that we faced was how to deal with transforms which require data from two tables. The reason why this was a challenge is because of the following scenario: An entity within the warehouse has a business rule which requires data from 2 tables from the source system before it can be evaluated. However within the source system data has been changed and captured for one table only. For example: business rule order status requires data from the OrderHeader table and the OrderLine table. As illustrated below.

The source system updates the OrderLine table and a CDC captures the changes and sends them to the ETL. Then the ETL process runs with only the OrderLine changes tries to evaluate the order status business rule but how to evaluate this when OrderHeader data has not been supplied? As illustrated below.

There are many ways this can be resolved, here are some examples:

Change the application process to create updates to the required linked tables. This might not solve the problem as the CDC software may not have committed all of the changes before the ETL process starts running, unless the CDC is part of the ETL. Also this may cause performance problems for the application team.

To design the data warehouse scheme and business rules so that the transforms don't require joins between tables to create a data warehouse. This I will highlight may not be practical as it could mean it basically becomes a copy of source systems schemas which would give you no value for reporting and analysing.

To get transform to read the missing data from the data warehouse if CDC hasn’t passed on the data. This was a good option to consider however there are few considerations to be aware of:

Could have performance issues with the transform for some of the following reason: Using left outer joins on warehouse tables and depending on the warehouse schema could require a large number of tables; have use of function calls to IsNull or Coalesce; The size of the tables data warehouse could make query tuning very difficult.

Adding complexity to untranslate business rules or transforms if source system data isn’t being natively stored

Harder to read and maintain as the transform isn’t just transforming source data

The data warehouse could become even wide or larger if the source systems native data is stored within the warehouse to help the transform. This will have an impact for querying and processing performance; also add overhead cost for storage.

We solve the problem by extending one of our client's requirements. The requirement was: keep a copy of the source system data changes for each day for a further 30 days. To meet this we configured the CDC to send all the changes to an archive database. Then have the ETL process to extract the day changes from the archive database. Rather than have CDC drop the changes to a landing database and then have the ETL copy the data again to the archive database as well as transform the data. The extension of the requirement was to keep all changes of the latest version of each primary key within the archive database. This then allowed us to write a post-extract process to extract any missing data required for any of the transforms where data wasn't present. This then allowed us to write the transforms in a manner that always assumed that all data was in the same location.

Testing

We found that we had to change the way that we prepared to test our ETL process by using datasets. Normally we would create some datasets which would mirror the source system in a particular state for testing certain scenarios for transforms and business rules. However this time we only needed to create change records for testing but the problem was how to ensure that the change records were fed in a controlled manner to ensure consistent mirrored states of source system and valid transitions were maintained. We resolved this problem by investing time to create a test harness solution. Which mimic our CDC tool of choice by inserting the test data into the archive database and then running the ETL process. One thing I would recommend is that you trial your CDC tool over your source systems to see what data changes will be actually captured. This will help to create valid test data for testing transforms and business rules within your ETL process