Introduction

Microsoft SQL Server 2005 Integration Services (SSIS) is a platform for building high performance data integration solutions, including extraction, transformation, and load (ETL) packages for data warehousing. (read more from MSDN)

Business Intelligence (BI) Studio provides graphical tools for designing and debugging packages; tasks for performing workflow functions such as FTP operations, for executing SQL statements, or for sending e-mail messages; data sources and destinations for extracting and loading data; transformations for cleaning, aggregating, merging, and copying data. SSIS ships along with a rich set of and application programming interfaces (APIs) for programming the Integration Services object model.

Integration Services replaces Data Transformation Services (DTS), which was first introduced as a component of SQL Server 7.0.

In this article we will explore the SSIS APIs and how we can use them for our purpose.

Background

Couple of months ago, I had to work with SSIS APIs to implement a custom import wizard (similar to SQL server import wizard) and during that time I explored these APIs. I found that it contains a great improvement over its previous version (a.k.a. DTS).

Unfortunately I did not find adequate code sample on these APIs which could help me to run faster in my business way. And that's why I wrote an article explaining how we can programmatically create an SSIS package in Code Project. I personally received several emails asking me to explain more on this. And that's why I am writing another article on this.

A quick architectural overview

SSIS Architecture (image source : MSDN)

Integration services (SSIS) architecture consists of several containers and task components. These elements can be categorized in four sections:

Integration Services Service

Integration Services Object Model

Integration Services Runtime

Integration Services Data Flow

The first one, Integration Services service, a Windows service, monitors running SSIS packages and manages the storage of packages. It's available in SQL Server Management Studio.

SSIS object model is a managed set of APIs that allows us to write custom applications, command line utilities etc.

SSIS Runtime is an environment that runs the SSIS package, manages transactions etc. Runtime executables are the smallest part that runs into the environment. For example, packages, containers, tasks, event handlers all of these are runtime executables.

Data flow is the most important (and critical too) part of SSIS architecture. It contains a data flow engine that manages the data flow components. A significant number of data flow components ships along with the SSIS installation. And these can be categorized in three categories – source components (which extracts the data from a system), transformation components (performs transformations, modifications onto the extracted data) and load components (which simply performs the data loading tasks into the destination systems). Besides the available data flow components, we can write our own custom data flow components to accomplish any custom requirements.

Although the architectural codes are written in native language (C++, COM), we can use any CLR languages to start programming with these APIs.

For a detailed knowledge about the architecture of SSIS, please go to MSDN.

Start writing a pilot project

I believe, in order to understand the details it's always better to start implementing a "pilot" application first. Therefore let's implement a simple package. For this pilot project we will use C#.

Pilot package into the designer

I am going to create a vanilla SSIS package that consists of a single data flow task. And the data flow task contains two data flow components. A source data flow component that reads a source (CSV file) and a destination component which loads the data into a destination (SQL server database - in this sample).

The dataflow components

Before writing any code, let's create a comma separated file which we will use as our data source. So create a file named as Sample.csv and paste the following data into the file and finally save it.

It is a CSV file delimited by a comma. And the file decorated with double quote (") as a text qualifier.

Open Visual Studio .NET 2005 and create a new application (you can use Windows application or console application whatever you like). Now add some assembly references that are for SSIS object model. The assemblies are:

Microsoft.SQLServer.DTSPipelineWrap

Microsoft.SQLServer.DTSRuntimeWrap

Microsoft.SQLServer.ManagedDTS

Microsoft.SQLServer.PipelineHost

Our project reference will look something like this:

Create a new class named ImportPackage and append the following directives inside the ImportPackage.cs code file.

using Microsoft.SqlServer.Dts.Runtime;
using PipeLineWrapper = Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using RuntimeWrapper = Microsoft.SqlServer.Dts.Runtime.Wrapper;

We are now going to declare some member variable that will be required later.

So this method simply instantiates a new package variable and sets a value to some initial properties.

According to the diagram created in SSIS designer (that we have seen earlier) we will create a data flow task now. And the dataflow task will contain two data flow components (source and destination) – both components will require a connection component to extract and load data from and to the source and destination. So let's create the connection component for the source first.

It is worth mentioning one thing here, which is, for a flat file connection manager we need to explicitly create the source columns (the columns that are available into the CSV file). We are going to write a method CreateSourceColumns() which will serve this purpose. You might have already guessed that we need to read the available column names from the source file to do this. For this pilot application we are simply hardcoding these inside the code. But it should be read from the file using .NET IO APIs (you can use Microsoft.Jet.OLEDB.4.0 objects as an alternative of .NET IO APIs). In order to hard code the available column names, I am using a string collection as a member variable inside the class.

Here we are creating a dataflow task executable by using a moniker. A data flow task is an executable that can contain one or more data flow components (i.e. source, transformation or destination components).

Generally all data flow components have two collection properties. One is Input collections and another is output collections. As the names indicates the input collection contains one or more inputs (can come from different sources), the output collection object contains one or more outputs (can provide data to different destinations) and one error output – which contains the errors that has taken place during the data manipulation inside the component.

Now we can create those data flow components under this dataflowTask object.

The dataflow components are COM objects. When we invoke the TaskHost's ComponentMetaDataCollection.New(), it creates a generic component meta data object but still it is not populated with all specific properties that is necessary for the source component. The following two lines will actually create an instance of a specific data flow component (In this case a source component-according to the given CLSID).

// load the COM for the given GUID
PipeLineWrapper.CManagedComponentWrapper
managedFlatFileInstance
= sourceComponent.Instantiate();
// get the populate the properties
managedFlatFileInstance.ProvideComponentProperties();

The source component needs to refer to the connection manager object that we created earlier.

managedFlatFileInstance.ReinitializeMetaData() will read the input column names from the connection manager object and will automatically populate an input object inside its input collection object. You can check the Count property of the sourceComponent.InputCollection , you will find that it is 1. Now we need to create the mapping between the input and output columns – so that the component's output contains the values that it will read from the source file.

It is important to understand that, all the objects into a package have an id. In fact, each column (output or input whatever) has an ID that is unique to the scope of the package. I am going to use a dictionary object to keep track between each output column id and it's name. We will need this information later.

So the source dataflow component is ready to use now. Let's create the destination dataflow component now. Before creating the data destination component we need to create the database and the datatable that will be the destination. So let's create the datatable using the following SQL scripts. (For the sake of simplicity we are creating the database manually – but it is supposed to be created programmatically as well. But it must exist before invoking the destination data flow component's ReinitializeMetaData(), otherwise it will generate an error.)

The destination dataflow component is created. We need to create a path between the source components output and destination components input so that the data that gets out from the source component will feed directly to the input of the destination component.

That's enough for the pilot package project. If we run our application it will move the data from the CSV to the destination database.

Using Data Conversion component

So far, our package was assuming the same data type for the destination columns as it is in the source columns (that's why the columns of our data table are defined as varchar). Now we will modify the package so that it can change the data type before moving the database to the destination component.

In order to change the data type we need to use a different data flow component – Data Conversion Component. So our objective package will be something like the following:

We need to create a new data flow component (data conversion component) and we need to place the new component between the source and destination. Therefore, we will now modify the CreateDataFlowTask() method. Let's insert the following lines just after the managedFlatFileInstance.ReleaseConnections(); // close line.

We need to update the data destination configuration code as well. Basically we need to use the derivedLineageIdentifiers to retrieve the source lineage id and update the external metadata column's data type of the destination component.

Before doing this, let's quickly change the data table structure into the SQL database as follows:

Here notice that the first row contains missing data in the JoinDate field and the second row contains bad data in the Salary field. So the data conversion component will fail to convert these two rows. Now if we want to report these error rows during the data flow task processing, then we need to create an extra destination data flow component into the package. This extra destination component will keep track the error rows and will move them into a SQL database. In order to accomplish this, let's modify the code base again. Now our desired package will look something like this:

You might have already guessed that we are going to use another OLEDB destination component to keep track of the error rows. So let's create a connection manager for the new error destination component (we could use the previous one – but let's create another one so that the error can be moved to any other database rather than the destination database).

As we discussed earlier in the article, the output collection contains two outputs; one is the data and another is for error outputs. So we have created the new path with dataconversionComponent.OutputCollection[1].

An error component typically stores the error code and error column lineage id along with the entire row data. So the error data table contains the columns available into the CSV file along with two extra columns (errorCode and errorColumn).

In this way we will only find the error code and the error row lineage ID (which is simply an identifier) of the output column. But we can write some extra code to find out an error description along with the error row name rather its identifier. I found an interesting article on this.

Now if we run the application it will move those two error rows into the error table and move all other rows into the datatable.

Using Derived Column Component

The CSV file that we last discussed has two error rows. And our current implementation moves both rows into the error destination table. But what we will do, if we need to bring the first row into the destination database rather than moving it to the error destination? For example, let's assume our destination data table schema says that it can contains null in the JoinDate and salary fields. But our current implementation will never move the missing data rows into the destination database. It will always fail to convert the missing data and finally considers it as an error row. Therefore for some businesses, may be it will be a problem indeed. To resolve this problem, we can replace the data conversion component with a Derived column component to get rid of this problem. Derived column component is an awesome component that is mainly used to transform the data. The Derived Column transformation creates new column values by applying expressions to transformation input columns. An expression can contain any combination of variables, functions, operators, and columns from the transformation input. The result can be added as a new column or inserted into an existing column as a replacement value. The Derived Column transformation can define multiple derived columns, and any variable or input columns can appear in multiple expressions. For more details on this please read MSDN.

In order to replace the data conversion component with a derived column component we need to make some minor change into the existing code base that we developed so far.

Creating a derived column component is exactly the same as creating a data conversion component except for the component class ID. The Component class id for a derived column component is "{9CF90BF0-5BCC-4C63-B91D-1F322DC12C26}".

As you can see, the component creation is pretty much the same as we did for the data conversion component creation. Now we will write the configuration code for the derived column component. This is also the same as the way in which we configured the data conversion component – except here we will create a new property for the expression. The configuration code is as follows:

So that's all for our derived column implementation. Now our package should bring the missing rows into the data destination table – as it is supposed to. Although the derived column usually does the real data derived tasks, it can be used to resolve this issue as well.

Conclusion

In this article I just described the way in which we can programmatically use different data flow components to meet our custom purpose. I hope you will enjoy this.

Hello Moim.
I downloaded your code and executed it on my machine.
I changed flat file path, connection string, created the datatable and error table on my server.
I corrected all the initial errors and now the package executes and is saved on my c drive as dts.dtsx
But the problem is that it does not transfer any of the rows in the database, niether in datatable nor in errortable

I am very new to Business Intelligence. Most of my work is on asp. Thats why i don't know abt the classes used in the programming. I have created a demo project using BIDS without programming (by wizard).
I want to do it through programming now.

I may have got a solution. There is an article which says that I will hv to execute .dtsx package from a machine where SSIS is installed.
So I am gonna try that and If it doesn't work I will post again.
Otherwise thanks for the article.
bye

In addition, I am unable to get the code on the page to work in a program, because a number of the properties/methods don't seem to exist on some of the objects - for instance Column.ColumnLength doesn't exist, but Column.Length does.

Do you have a downloadable example or some better code for the derived column transform?

I getting the System.runtime.Interop.Comexception error.
it is on Acuireconection(null)
please reply as fast as possible ..
My code is same as wot i have dowloaded from this link.
only datasouce and flatfile path i have changed.
thanks.

Try to open the dtsx that is generated by your code, and then open it from the BIDS and see if the error happens there too. Or as another way, you can create the same package using BIDS onto the same machine with the same Oledb stuffs...see if it creates the error. please let me know the result.

yes i have done that also .. but it gives error that output column is not synchronized with externalmetadata column..

column "Name" has to be added to the externalmetadatacolumn
column "Age" has to be added to the externalmetadatacolumn
column "JoinDate" has to be added to the externalmetadatacolumn
column "salary" has to be added to the externalmetadatacolumn
column "Retierd" has to be added to the externalmetadatacolumn

it is very urgent can u please reply me fast..
I m trying to store CSV File to OleDb.

I can't resend the code, cause I am right now far away from my home, I am traveling.

anyway, I think the code that is available here will work, because it just works for me still now.
I can suggest you the followings..

Make it run from the designer first. You can manually create the synchronization from BIDS. and then see if it works from BIDS. its important to identify if the AccuireConnection fails from the BIDS or not. If it works from BIDS then it will be easier to step down to the particular trouble that you are having from C#.

okay gud. now open the dtsx file using any xml editor and open the dtsx that is generated by the program using another instance of that xml editor and compare if there is any different into the connection manager xml blocks..

Go thru the properties of destination connection manager, and if possible try the entire thing into another machine..see if it still generates error. If you have any kindya version mismatching issue that you will find your code is running all other boxes, but if it says the same in anyother machine then certainly you are doing mistakes into the configuration properties of the connection manager.

Hi ,
ya my code is also not working in the other machine ..
you are right.i have a problem in configuration property of destination component. But i have used the same code provided by you.. then if it is not creating problem to you then why in me machine ?
i have maualy change the configuration property in which i have used
SSIBRD14.India Test instead of SSIS Connection Manager for OLEDB.

this is done manuly .. then it works ..
but how to change it programatically? becouse i have to do it programatically