Tuesday, January 21, 2014

SSIS Tips: Deduping Flat File Data without using Staging Table

Recently there was a SSIS challenge that was brought up by one of my fellow colleagues in his project. The project involves importing data in the form of flat files daily.Some cases the file is found to have duplicate data entry. Client want to apply a deduplication process for removing them but didn't want to use staging table as they didn't have the access to create objects . The question was how to develop an alternate logic in SSIS. This is the logic I came up which I'm sharing through this blog.

For illustration lets take the case of very simple flat file in the below format.

The above represents a simple illustration of file arriving from client. As you see it has multiple instances of ID,Name values in it . The attempt is to eliminate the duplicate occurrences and save just a single occurrence of name (with the earliest date) in the database.

The package for the above requirement looks like the following

The package consists of two data flow tasks. The first data flow task is used for storing the aggregate information from the source data. The data flow consists of below tasks

The flat file source is used to extract data from the source file. The aggregate task will group by relevant fields ( ID,Name in this case) to get unique values and apply aggregation (MIN) over remaining fields (Date in above example). This is then stored in cache using cache transform.
Now that we've the aggregated data ready we can utilize it in a lookup and get unique values from it. The second data flow is used for this purpose with below workflow

The flat file data is looked up against the aggregated data stored in cache. This will give match result only for the cases where date value is minimum among the group of ID and Name fields. So what effectively comes out of the match output would be the unique values. Link it to OLEDB destination task pointed to the destination table to get unique values out.
The Lookup task makes use of the cache connection manager to connect to the cache populated in the earlier data flow task and tries to match source values against the reference aggregated values.
Now if you go ahead and check the destination you can see only unique values getting populated in the the destination table as shown below

This gives a method for eliminating duplicates from flat file without using any staging table but just utilizing the cache as the temporary store. Hopefully It will help you out in similar scenarios.
The package and source file used can be downloaded from the below links