Description

There should be a way to prioritize control flow tasks in SSIS. When multiple tasks are available for execution at the same time there does not appear to be any way to influence which tasks will begin executing first. If there are more tasks than available worker threads then each thread picks a task to execute, apparently at random, and the other tasks have to wait for a thread to complete before they can begin executing. So far there have been two scenarios identified where this can lead to some highly inefficient flow patterns.
Scenario 1: Multiple tasks running in parallel, having significant differences in execution times.
Assume 10 tasks with no dependencies and 4 worker threads. Tasks A&B each require 10 minutes, the remaining 8 tasks each require 2 minutes. In the best case scenario the first two threads will pick tasks A&B to start executing, the remaining two threads will select two of the remaining tasks (C&D). Two minutes later tasks C&D will be complete and those two threads will start on E&F, etc. End result is that the entire process takes 10 minutes with two threads sitting idle for the final two minutes. Total idle time is 4 minutes. The worst case scenario is that tasks A&B are the lasts tasks to be picked, in which case they do not get to start until 4 minutes into the process and then they take an addition 10 minutes to run. End result is that the same process takes 14 minutes with two threads sitting idle for ten minutes. Total idle time is 20 minutes.
Scenario 2: Multiple tasks running in parallel having varying levels of dependencies.
This scenario is much more difficult to describe. It involves having multiple branches of sequential tasks. In some ways you could look at each branch as being a single task from scenario 1. The issue here though is not how long any one tasks takes to complete but how many other tasks have to wait for it to finish before they can start processing. In an optimal run (assumes all tasks have similar run times) the task with the most dependencies should be executed before the tasks with fewer dependencies. What I have often observed however is tasks with no dependencies being executed before tasks with a great number of dependencies. This results in threads sitting idle waiting for a single task to complete so that they can move on to process its dependencies.

Assign To

You describe a fairly sophisticated use of our product, one that we would like to improve upon in a future release, but unfortunately not in the next release. We are hoping to exploit other Microsoft platform technologies in a future version of control flow support, and when we do that, we should be able to address your issue. Meanwhile, as a somewhat inelegant workaround, you could try putting each task in a package persisted in a separate file and then use a script task to execute these packages. In the script task, you might employ multiple threads and manipulate their thread priority.

We welcome any further feedback and are happy to discuss ideas to make your efforts successful.

-- The SSIS Team

Posted by mstout on 10/19/2010 at 6:28 AM

Here is a detailed case that falls under scenario 2.

I recently converted a large data warehouse load to SSIS in order to take advantage of parallel processing. Our load consists of about 60 control flow tasks, some of which have many dependencies and/or long run times. One of the goals in this conversion was to maximum use of available processing power, i.e. minimize the amount of time that threads were idle while waiting on something else to complete.

When my package starts there are 16 tasks that can be executed initially (no upstream dependencies or ancestors). One of these tasks has over 40 downstream dependencies or descendants, three of the tasks have between 8 and 11 descendants, one task has a single child, and the remaining eleven tasks are completely independent. (Yes there is overlap in terms of the dependencies.) A similar pattern occurs at the next level, the task with the 40+ descendants has 13 immediate children that can be executed as soon as the original task is complete. One of these new tasks has over 20 descendant tasks dependent on it, nine have no descendants and the others have between 3 and 10 descendants. This general pattern of one task having significantly more dependencies than its siblings repeats itself again and again down the flow though the relationships start getting more complicated as we have to start accounting for multiple ancestors.

Ideally this package should execute with the first thread working the task with the 40+ dependencies, then next three threads working on the tasks with 8-11 dependencies, then the fifth thread working the task with the single dependency, then any remaining threads start working on the tasks with no dependencies. When each task completes its children get added to the queue in order of their importance. So once the first thread is complete the 13 children of the original task get added to the queue with the task with 20+ dependencies jumping to the head of the line the three with lesser dependencies falling in behind it, and the remaining 9 joining the remnants of the original 16 tasks at the back of the line.

However what generally tends to happen is the first 4 or 5 threads pick tasks with no dependencies and the task with the 40+ descendents does not get picked up until the 7th or 8th thread. So instead of my tasks with no dependencies being used as filler to keep threads working while waiting on dependencies they are getting knocked out up front and later on I have one or two threads working while the rest are sitting idle.

Adding sequencing that forces the low priority tasks to wait for the high priority tasks to finish defeats the purpose of this conversion, which is to take maximum advantage of the parallel processing potential of SSIS. What we need is a mechanism that makes a low priority task wait for higher priority tasks to START but does not make it wait for them to finish.

I rasied this issue in the SSIS forums and we came up with a partial work around in this thread: http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/579b9f3a-5e66-41a9-940e-0b08405a2441

The short version is that you use a series of Boolean variables to control loops that act like roadblocks and prevent the flow from reaching the lower priority tasks until the higher priority items have started.

The steps involved are:• Declare a bool variable for each of the high priority tasks and default the values to false.• Create a pre-execute event for each of the high priority tasks.• In the pre-execute event create a script task which sets the appropriate bool to true.• At each choke point insert a for each loop that will loop while the appropriate bool(s) are false. (I have a script with a 1 second sleep inside each loop but it also works with empty loops.)

If done properly this gives you a tool where at each choke point the package has some number of high priority tasks ready to run and a blocking loop that keeps it from proceeding down the lower priority branches until said high priority items are running. Once all of the high priority tasks have been started the loop clears and allows any remaining threads to move on to lower priority tasks. Worst case is one thread sits in the loop while waiting for other threads to come along and pick up the high priority tasks.

The major drawback to this approach is the risk of deadlocking the package if you have too many blocking loops get queued up at the same time, or misread your dependencies and have loops waiting for tasks that never start. Careful analysis is needed to decide which items deserved higher priority and where exactly to insert the blocks.