Importing Files in Parallel with SSIS

Introduction

With “big data” becoming more prevalent in everyday scenarios, we need ways of being able to import data faster and more efficiently. When faced with having to import thousands if not millions of files, there are a few cloud-based solutions which make this fairly simple. However, due to the policies within many organisations, cloud solutions may not yet be feasible options. Nevertheless many companies do have existing SSIS implementations and they can leverage this.

Problem

The main problem is that if you are importing thousands of files with the exact same file structure, your package still needs to process each file sequentially. Even if each file only takes a second to import, it’ll take a few thousand seconds to import all of the files.

Solution

A way to improve the speed of the import, would be to run multiple instances of your package in parallel whereby each instance processes a portion of the total number of files.

Step 1

Create a SSIS solution and add some project-level parameters:

SourceDirectory – is the path to all the source files

PackageName – is the full path of the package that will import the files

NoOfPackageInstances – is the number of parallel instances of the package to run

FileMask – is the filename filter in case you only want to process certain files within the folder

Figure 1

Step 2

Create a new package with a “Script Task” component:

Figure 2

Step 3

Open the Script Task, add the project-level parameters, and then click on “Edit Script”:

Figure 3

Step 4 (see the full code at the end of the article)

Declare variables and assign them the values from the project parameters. Declare variables to keep track of the filenames to be imported as well as which group they’ll be split into:

Figure 4

Loop through each file and assign it to a particular group based on the number of package instances:

Figure 5

Loop through each file group and execute an instance of the package:

Figure 6

The ExecutePackage method:

Figure 7

Step 5

Create a new package and add some variables:

FileList – is the list of files to process. The name must be the same as in the code in step 4. It must be writable

FilePath – is the name of the individual file being processed

Figure 8

Step 6

Create a Foreach Loop Container in the new package that contains a Data Flow Task inside:

Figure 9

Step 7

In the Foreach Loop Container, select “Foreach From Variable Enumerator”, and select the “User::FileList” variable:

Figure 10

Then select “User::FilePath” under “Variable Mappings”:

Figure 11

Step 8

Update the Data Flow so that it has a source, destination and any required transformations:

Figure 12

Step 9

In the source connection manager, set the expression for the connection string property to “User::FilePath”:

Figure 13

Step 10

In the destination connection manager make sure that "Table lock" is not selected, otherwise each package instance won't be able to write to the table concurrently:

Figure 14

Full Script Task Code

Please excuse my code if it isn't the neatest or most efficient way of doing things