SSIS – Package design pattern for loading a data warehouse

I recently had a chat with some BI developers about the design patterns they’re using in SSIS when building an ETL system. We all agreed in creating multiple packages for the dimensions and fact tables and one master package for the execution of all these packages.

These developers even created multiple packages per single dimension/fact table:

One extract package where the extract(E) logic of all dim/fact tables is stored

One dim/fact package with the transform(T) logic of a single dim/fact table

One dim/fact package with the load(L) logic of a single dim/fact table

I like the idea of building the Extract, Transform and Load logic separately, but I do not like the way the logic was spread over multiple packages. I asked them why they chose for this solution and there were multiple reasons:

Enable running the E/T/L parts separately, for example: run only the entire T phase of all dim/fact tables.

Run the extracts of all dimensions and fact tables simultaneously to keep the loading window on the source system as short as possible.

To me these are good reasons, running the E/T/L phases separately is a thing a developer often wants during the development and testing of an ETL system. Keeping the loading window on the source system as short as possible is something that’s critical in some projects.

Despite the good arguments to design their ETL system like this, I still prefer the idea of having one package per dimension / fact table, with complete E/T/L logic, for the following reasons:

All the logic is in one place

Increase understandability

Perform unit testing

If there is an issue with a dimension or fact table, you only have to make changes in one place, which is safer and ore efficient

You can see your packages as separate ETL “puzzle pieces” that are reusable

It’s good from a project manager point of view; let your customer accept dimensions and fact tables one by one and freeze the appropriate package afterwards

The overview in BIDS, having an enormous amount of packages does not make it clearer ;-)

Simplifies deployment after changes have been made

Changes are easier to track in source control systems

Team development will be easier; multiple developers can work on different dim/fact tables without bothering each other.

So basically my goal was clear: to build a solution that has all the possibilities the aforesaid developers asked for, but in one package per dimension / fact table; the best of both worlds.

Solution:

The solution I’ve created is based on a parent-child package structure. One parent (master) package will execute multiple child (dim/fact) packages. This solution is based on a single (child) package for each dimension and fact table. Each of these packages contains the following Sequence Containers in the Control Flow:

Normally it would not be possible to execute only the Extract, Transform, Load or (cube) Process Sequence Containers of the child (dim/fact) packages simultaneously.

To make this possible I have created four Parent package variable configurations, one for each ETL phase Sequence Container in the child package:

Each of these configurations is set on the Disable property of one of the Sequence Containers:

Using this technique makes it possible to run separate Sequence Containers of the child package from the master package, simply by dis- or enabling the appropriate sequence containers with parent package variables. Because the default value of the Disable property of the Sequence Containers is False, you can still run an entire standalone child package, without the need to change anything.

Ok, so far, so good. But, how do I execute only one phase of all the dimension and fact packages simultaneously? Well quite simple:

First add 4 Sequence Containers to the Master package. One for each phase of the ETL, just like in the child packages

Add Execute Package Tasks for all your packages in every Sequence Container

If you would execute this master package now, every child package would run 4 times as there are 4 Execute Package Tasks that run the same package in every sequence container. To get the required functionality I have created 4 variables inside each Sequence Container (Scope). These will be used as parent variable to set the Disable properties in the child packages. So basically I’ve created 4 variables x 4 Sequence Containers = 16 variables for the entire master package.

The LOAD and PROCESS Sequence Containers contain variables are based on the same technique.

Results:

Run all phases of a standalone package: Just execute the package:

Run a single phase of the ETL system (Extract/Transform/Load/Process): Execute the desired sequence container in the main package:

Run a single phase of a single package from the master package:

Run multiple phases of the ETL system, for example only the T and L: Disable the Sequence Containers of the phases that need to be excluded in the master package:

Run all the child packages in the right order from the master package: When you add a breakpoint on, for example, the LOAD Sequence Container you see that all the child packages are at the same ETL phase as their parent:

When pressing Continue the package completes:

Conclusion:

This parent/child package design pattern for loading a Data Warehouse gives you all the flexibility and functionality you need. It’s ready and easy to use during development and production without the need to change anything.

With only a single SSIS package for each dimension and fact table you now have the functionality that separate packages would offer. You will be able to, for example, run all the Extracts for all dimensions and fact tables simultaneously like the developers asked for and still have the benefits that come with the one package per dimension/fact table approach.

Of course having a single package per dimension or fact table will not be the right choice in all cases but I think it is a good standard approach. Same applies to the ETL phases (Sequence Containers). I use E/T/L/P, but if you have different phases, which will be fine, you can still use the same technique.

Download the solution with template packages from the URL’s below. Only thing you need to do is change the connection managers to the child packages (to your location on disk) and run the master package!

Comment Notification

Comments

Nice article and examples. I agree with the design of master-child implementation and using package variables to control the execution. But I am slightly of a different opinion in implementing the T and L part.

I feel we do not need a separate package all together for T and L. With a proper design, for ex. using package variables to decide whether to stage data or directly load them after transform, and using variables values from config to pass hints whether to load data from staged area or the stream, same can be implemented in one package. More layering by creating more packages for a single ETL cycle of an entity, also means that if different developers are working on it, they need a standard interface for passing data. Also separting T & L so much for each and every package makes staging inevitable after transform.

You say you have a slightly different opinion in implementing the T and L part, can you clarify this some more?

You say we don't need a separate package for T and L and I agree with that, that's why I want to use a single package with all phases in it. So I wonder on which specific part your opinion is different, as it looks like we say the same here.

Sorry, I missed the detail. I understand that you are including all the phases in one package, and executing the same package 4 times, and each time you execute a different phase from the same package.

The tricky part of this design is staging becomes inevitable to achieve flexibility of execution of each phase E/T/L at will. But this seems a valid trade-off, compared to the flexibility achieved.

Yes you are correct. We share the same ideas, it's just I thought it was a different package for E/T/L/P part for a single dim/fact i.e 4 packages per entity. Now I feel it's even better than I thought it was :)

Nice article! I am only wondering about the Extract step. How do you implement the extract of a table that is used in multiple dimensions? Do you keep track of the tables you already extracted or do you simply extract them multiple times (which would mean doing things more than necessary)

Nice to hear that you like the article and thanks for the useful comment!

I would say that's typically logic that should be handled in the master package. So I would still create the dimension packages as stand alone "objects" that have all their logic in it and handle things like this from the master package. How I would handle it exactly depends on the specific situation, it can be as easy as running only the extract of one of the two dimensions. If it's more complicated you can create some more variables to handle this, just like Siddharth already suggested. I think there are many ways to solve challenges like that.

ETL steps in all my child packages have a OLEDB source (E), a Data Conversion Task (T), and an ADO.Net Destination (L). I was trying to seperate them in to the ETLP blocks of the child package but couldn't figure it out. I looked in to your SQL2008 example but except for the blocks in the control flow section I couldn't find any tasks in the data flow section. I am new to this and totally confused. Do you have a complete example that I can refer to?

Guys, could someone describe to me exactly what goes in the Transform part of this exercise? For me I get the data out of the source system and store in a cache table (extract) then I have a kimball method component which loads the data into the dimension/fact table (Load). The query that the kimball method component takes is in an sql query which effectively does the "transform" part, but this is transitory. I presume that to have a transform layer you would store the output of the kimball method component into a number of places and then in the load simply merge the data in. Could someone give me an idea what the benefit of this extra storage step is as for me I can't see any?

As far as I understand SSIS sequences in the model above, in every sequence (for examople TRANSFORM) the packages execute multanioously. Here you have just a couple of dimensions and facts. But what if you have 70 dimensions and 10 facts, does this reduce performance when all dims and facts are being processed simultaneously? And second, what about the readability of such a big SSIS project?

I love the design with regards to seperating out the phases of the ETL process as well as the ability to control different aspects through configuration - my only concern would be the unwieldiness of deployments and configurations if the warehouse contains a large number of facts/dimensions

The post is awesome. This is the first time I even heard about SSIS for the purpose of my project and I followed your method. I was able to easily do it considering this is my first time. Everything works well for me except when I use "Derived Column" component to see what is loaded into my Fact table the keys are "Missing Reference". My Master package structure is exactly like yours except that I have two different sets of Dim and Fact packages. I created the variables exactly like yours and connected the two child packages with the parent one. I am not sure where i went wrong. Will you be able to help me in this? Thanks a lot in advance

Oh i forgot to mention. This problem only occurs when i try to run individual sequences from the master package or run the entire sequence from the master package. When i run the child packages individually its perfect! Thanks

How do you justify your comment that having a smaller number of packages improves source code control? We have split up our packages to be very granular just because having larger packages makes it difficult for multiple developers to work on the same project at the same time (the file format for dtsx makes it almost impossible to logically merge two changed dtsx files in a source code control system)