Dataflow (Task Parallel Library)

.NET Framework 4.5

The Task Parallel Library (TPL) provides dataflow components to help increase the robustness of concurrency-enabled applications. These dataflow components are collectively referred to as the TPL Dataflow Library. This dataflow model promotes actor-based programming by providing in-process message passing for coarse-grained dataflow and pipelining tasks. The dataflow components build on the types and scheduling infrastructure of the TPL and integrate with the C#, Visual Basic, and F# language support for asynchronous programming. These dataflow components are useful when you have multiple operations that must communicate with one another asynchronously or when you want to process data as it becomes available. For example, consider an application that processes image data from a web camera. By using the dataflow model, the application can process image frames as they become available. If the application enhances image frames, for example, by performing light correction or red-eye reduction, you can create a pipeline of dataflow components. Each stage of the pipeline might use more coarse-grained parallelism functionality, such as the functionality that is provided by the TPL, to transform the image.

This document provides an overview of the TPL Dataflow Library. It describes the programming model, the predefined dataflow block types, and how to configure dataflow blocks to meet the specific requirements of your applications.

Tip

The TPL Dataflow Library (System.Threading.Tasks.Dataflow namespace) is not distributed with the .NET Framework 4.5. To install the System.Threading.Tasks.Dataflow namespace, open your project in Visual Studio 2012, choose Manage NuGet Packages from the Project menu, and search online for the Microsoft.Tpl.Dataflow package.

The TPL Dataflow Library provides a foundation for message passing and parallelizing CPU-intensive and I/O-intensive applications that have high throughput and low latency. It also gives you explicit control over how data is buffered and moves around the system. To better understand the dataflow programming model, consider an application that asynchronously loads images from disk and creates a composite of those images. Traditional programming models typically require that you use callbacks and synchronization objects, such as locks, to coordinate tasks and access to shared data. By using the dataflow programming model, you can create dataflow objects that process images as they are read from disk. Under the dataflow model, you declare how data is handled when it becomes available, and also any dependencies between data. Because the runtime manages dependencies between data, you can often avoid the requirement to synchronize access to shared data. In addition, because the runtime schedules work based on the asynchronous arrival of data, dataflow can improve responsiveness and throughput by efficiently managing the underlying threads. For an example that uses the dataflow programming model to implement image processing in a Windows Forms application, see Walkthrough: Using Dataflow in a Windows Forms Application.

You can connect dataflow blocks to form pipelines, which are linear sequences of dataflow blocks, or networks, which are graphs of dataflow blocks. A pipeline is one form of network. In a pipeline or network, sources asynchronously propagate data to targets as that data becomes available. The ISourceBlock(Of TOutput).LinkTo method links a source dataflow block to a target block. A source can be linked to zero or more targets; targets can be linked from zero or more sources. You can add or remove dataflow blocks to or from a pipeline or network concurrently. The predefined dataflow block types handle all thread-safety aspects of linking and unlinking.

When you call the ISourceBlock(Of TOutput).LinkTo method to link a source to a target, you can supply a delegate that determines whether the target block accepts or rejects a message based on the value of that message. This filtering mechanism is a useful way to guarantee that a dataflow block receives only certain values. For most of the predefined dataflow block types, if a source block is connected to multiple target blocks, when a target block rejects a message, the source offers that message to the next target. The order in which a source offers messages to targets is defined by the source and can vary according to the type of the source. Most source block types stop offering a message after one target accepts that message. One exception to this rule is the BroadcastBlock(Of T) class, which offers each message to all targets, even if some targets reject the message. For an example that uses filtering to process only certain messages, see Walkthrough: Using Dataflow in a Windows Forms Application.

Important

Because each predefined source dataflow block type guarantees that messages are propagated out in the order in which they are received, every message must be read from the source block before the source block can process the next message. Therefore, when you use filtering to connect multiple targets to a source, make sure that at least one target block receives each message. Otherwise, your application might deadlock.

The dataflow programming model is related to the concept of message passing, where independent components of a program communicate with one another by sending messages. One way to propagate messages among application components is to call the Post(Of TInput) and DataflowBlock.SendAsync methods to send messages to target dataflow blocks post (Post(Of TInput) acts synchronously; SendAsync acts asynchronously) and the Receive, ReceiveAsync, and TryReceive(Of TOutput) methods to receive messages from source blocks. You can combine these methods with dataflow pipelines or networks by sending input data to the head node (a target block), and receiving output data from the terminal node of the pipeline or the terminal nodes of the network (one or more source blocks). You can also use the Choose method to read from the first of the provided sources that has data available and perform action on that data.

Source blocks offer data to target blocks by calling the ITargetBlock(Of TInput).OfferMessage method. The target block responds to an offered message in one of three ways: it can accept the message, decline the message, or postpone the message. When the target accepts the message, the OfferMessage method returns Accepted. When the target declines the message, the OfferMessage method returns Declined. When the target requires that it no longer receives any messages from the source, OfferMessage returns DecliningPermanently. The predefined source block types do not offer messages to linked targets after such a return value is received, and they automatically unlink from such targets.

When a target block postpones the message for later use, the OfferMessage method returns Postponed. A target block that postpones a message can later calls the ISourceBlock(Of TOutput).ReserveMessage method to try to reserve the offered message. At this point, the message is either still available and can be used by the target block, or the message has been taken by another target. When the target block later requires the message or no longer needs the message, it calls the ISourceBlock(Of TOutput).ConsumeMessage or ReleaseReservation method, respectively. Message reservation is typically used by the dataflow block types that operate in non-greedy mode. Non-greedy mode is explained later in this document. Instead of reserving a postponed message, a target block can also use the ISourceBlock(Of TOutput).ConsumeMessage method to attempt to directly consume the postponed message.

Dataflow blocks also support the concept of completion. A dataflow block that is in the completed state does not perform any further work. Each dataflow block has an associated System.Threading.Tasks.Task object, known as a completion task, that represents the completion status of the block. Because you can wait for a Task object to finish, by using completion tasks, you can wait for one or more terminal nodes of a dataflow network to finish. The IDataflowBlock interface defines the Complete method, which informs the dataflow block of a request for it to complete, and the Completion property, which returns the completion task for the dataflow block. Both ISourceBlock(Of TOutput) and ITargetBlock(Of TInput) inherit the IDataflowBlock interface.

This example demonstrates the case in which an exception goes unhandled in the delegate of an execution dataflow block. We recommend that you handle exceptions in the bodies of such blocks. However, if you are unable to do so, the block behaves as though it was canceled and does not process incoming messages.

The second way to determine the completion status of a dataflow block is to use a continuation off of the completion task, or to use the asynchronous language features of C# and Visual Basic to asynchronously wait for the completion task. The delegate that you provide to the Task.ContinueWith method takes a Task object that represents the antecedent task. In the case of the Completion property, the delegate for the continuation takes the completion task itself. The following example resembles the previous one, except that it also uses the ContinueWith method to create a completion task that prints the status of the overall dataflow operation.

The TPL Dataflow Library provides several predefined dataflow block types. These types are divided into three categories: buffering blocks, execution blocks, and grouping blocks. The following sections describe the block types that make up these categories.

The BufferBlock(Of T) class represents a general-purpose asynchronous messaging structure. This class stores a first in, first out (FIFO) queue of messages that can be written to by multiple sources or read from by multiple targets. When a target receives a message from a BufferBlock(Of T) object, that message is removed from the message queue. Therefore, although a BufferBlock(Of T) object can have multiple targets, only one target will receive each message. The BufferBlock(Of T) class is useful when you want to pass multiple messages to another component, and that component must receive each message.

The following basic example posts several Int32 values to a BufferBlock(Of T) object and then reads those values back from that object.

' Create a BufferBlock<int> object. Dim bufferBlock = New BufferBlock(Of Integer)()
' Post several messages to the block. For i AsInteger = 0 To 2
bufferBlock.Post(i)
Next i
' Receive the messages back from the block. For i AsInteger = 0 To 2
Console.WriteLine(bufferBlock.Receive())
Next i
' Output: ' 0 ' 1 ' 2 '

The BroadcastBlock(Of T) class is useful when you must pass multiple messages to another component, but that component needs only the most recent value. This class is also useful when you want to broadcast a message to multiple components.

The following basic example posts a Double value to a BroadcastBlock(Of T) object and then reads that value back from that object several times. Because values are not removed from BroadcastBlock(Of T) objects after they are read, the same value is available every time.

The ActionBlock(Of TInput) class is a target block that calls a delegate when it receives data. Think of a ActionBlock(Of TInput) object as a delegate that runs asynchronously when data becomes available. The delegate that you provide to an ActionBlock(Of TInput) object can be of type Action or type System.Func<TInput, Task>. When you use an ActionBlock(Of TInput) object with Action, processing of each input element is considered completed when the delegate returns. When you use an ActionBlock(Of TInput) object with System.Func<TInput, Task>, processing of each input element is considered completed only when the returned Task object is completed. By using these two mechanisms, you can use ActionBlock(Of TInput) for both synchronous and asynchronous processing of each input element.

The following basic example posts multiple Int32 values to an ActionBlock(Of TInput) object. The ActionBlock(Of TInput) object prints those values to the console. This example then sets the block to the completed state and waits for all dataflow tasks to finish.

' Create an ActionBlock<int> object that prints values ' to the console. Dim actionBlock = New ActionBlock(Of Integer)(Function(n) WriteLine(n))
' Post several messages to the block. For i AsInteger = 0 To 2
actionBlock.Post(i * 10)
Next i
' Set the block to the completed state and wait for all ' tasks to finish.
actionBlock.Complete()
actionBlock.Completion.Wait()
' Output: ' 0 ' 10 ' 20 '

The BatchBlock(Of T) class combines sets of input data, which are known as batches, into arrays of output data. You specify the size of each batch when you create a BatchBlock(Of T) object. When the BatchBlock(Of T) object receives the specified count of input elements, it asynchronously propagates out an array that contains those elements. If a BatchBlock(Of T) object is set to the completed state but does not contain enough elements to form a batch, it propagates out a final array that contains the remaining input elements.

The BatchBlock(Of T) class operates in either greedy or non-greedy mode. In greedy mode, which is the default, a BatchBlock(Of T) object accepts every message that it is offered and propagates out an array after it receives the specified count of elements. In non-greedy mode, a BatchBlock(Of T) object postpones all incoming messages until enough sources have offered messages to the block to form a batch. Greedy mode typically performs better than non-greedy mode because it requires less processing overhead. However, you can use non-greedy mode when you must coordinate consumption from multiple sources in an atomic fashion. Specify non-greedy mode by setting Greedy to False in the dataflowBlockOptions parameter in the BatchBlock(Of T) constructor.

' Create a BatchBlock<int> object that holds ten ' elements per batch. Dim batchBlock = New BatchBlock(Of Integer)(10)
' Post several values to the block. For i AsInteger = 0 To 12
batchBlock.Post(i)
Next i
' Set the block to the completed state. This causes ' the block to propagate out any any remaining ' values as a final batch.
batchBlock.Complete()
' Print the sum of both batches.
Console.WriteLine("The sum of the elements in batch 1 is {0}.", batchBlock.Receive().Sum())
Console.WriteLine("The sum of the elements in batch 2 is {0}.", batchBlock.Receive().Sum())
' Output: ' The sum of the elements in batch 1 is 45. ' The sum of the elements in batch 2 is 33. '

Like BatchBlock(Of T), JoinBlock(Of T1, T2) and JoinBlock(Of T1, T2, T3) operate in either greedy or non-greedy mode. In greedy mode, which is the default, a JoinBlock(Of T1, T2) or JoinBlock(Of T1, T2, T3) object accepts every message that it is offered and propagates out a tuple after each of its targets receives at least one message. In non-greedy mode, a JoinBlock(Of T1, T2) or JoinBlock(Of T1, T2, T3) object postpones all incoming messages until all targets have been offered the data that is required to create a tuple. At this point, the block engages in a two-phase commit protocol to atomically retrieve all required items from the sources. This postponement makes it possible for another entity to consume the data in the meantime, to allow the overall system to make forward progress.

You can enable additional options by providing a System.Threading.Tasks.Dataflow.DataflowBlockOptions object to the constructor of dataflow block types. These options control behavior such the scheduler that manages the underlying task and the degree of parallelism. The DataflowBlockOptions also has derived types that specify behavior that is specific to certain dataflow block types. The following table summarizes which options type is associated with each dataflow block type.

Every predefined dataflow block uses the TPL task scheduling mechanism to perform activities such as propagating data to a target, receiving data from a source, and running user-defined delegates when data becomes available. TaskScheduler is an abstract class that represents a task scheduler that queues tasks onto threads. The default task scheduler, Default, uses the ThreadPool class to queue and execute work. You can override the default task scheduler by setting the TaskScheduler property when you construct a dataflow block object.

The default value of MaxDegreeOfParallelism is 1, which guarantees that the dataflow block processes one message at a time. Setting this property to a value that is larger than 1 enables the dataflow block to process multiple messages concurrently. Setting this property to DataflowBlockOptions.Unbounded enables the underlying task scheduler to manage the maximum degree of concurrency.

Important

When you specify a maximum degree of parallelism that is larger than 1, multiple messages are processed simultaneously, and therefore, messages might not be processed in the order in which they are received. The order in which the messages are output from the block will, however, be correctly ordered.

Because the MaxDegreeOfParallelism property represents the maximum degree of parallelism, the dataflow block might execute with a lesser degree of parallelism than you specify. The dataflow block might use a lesser degree of parallelism to meet its functional requirements or because there is a lack of available system resources. A dataflow block never chooses more parallelism than you specify.

The value of the MaxDegreeOfParallelism property is exclusive to each dataflow block object. For example, if four dataflow block objects each specify 1 for the maximum degree of parallelism, all four dataflow block objects can potentially run in parallel.

The predefined dataflow block types use tasks to process multiple input elements. This helps minimize the number of task objects that are required to process data, which enables applications to run more efficiently. However, when the tasks from one set of dataflow blocks are processing data, the tasks from other dataflow blocks might need to wait for processing time by queuing messages. To enable better fairness among dataflow tasks, set the MaxMessagesPerTask property. When MaxMessagesPerTask is set to DataflowBlockOptions.Unbounded, which is the default, the task used by a dataflow block processes as many messages as are available. When MaxMessagesPerTask is set to a value other than Unbounded, the dataflow block processes at most this number of messages per Task object. Although setting the MaxMessagesPerTask property can increase fairness among tasks, it can cause the system to create more tasks than are necessary, which can decrease performance.

The TPL provides a mechanism that enables tasks to coordinate cancellation in a cooperative manner. To enable dataflow blocks to participate in this cancellation mechanism, set the CancellationToken property. When this CancellationToken object is set to the canceled state, all dataflow blocks that monitor this token finish execution of their current item but do not start processing subsequent items. These dataflow blocks also clear any buffered messages, release connections to any source and target blocks, and transition to the canceled state. By transitioning to the canceled state, the Completion property has the Status property set to Canceled, unless an exception occurred during processing. In that case, Status is set to Faulted.

Several grouping dataflow block types can operate in either greedy or non-greedy mode. By default, the predefined dataflow block types operate in greedy mode.

For join block types such as JoinBlock(Of T1, T2), greedy mode means that the block immediately accepts data even if the corresponding data with which to join is not yet available. Non-greedy mode means that the block postpones all incoming messages until one is available on each of its targets to complete the join. If any of the postponed messages are no longer available, the join block releases all postponed messages and restarts the process. For the BatchBlock(Of T) class, greedy and non-greedy behavior is similar, except that under non-greedy mode, a BatchBlock(Of T) object postpones all incoming messages until enough are available from distinct sources to complete a batch.

Explains how to use the JoinBlock(Of T1, T2) class to perform an operation when data is available from multiple sources, and how to use non-greedy mode to enable multiple join blocks to share a data source more efficiently.

Describes how to use the BatchBlock(Of T) class to improve the efficiency of database insert operations, and how to use the BatchedJoinBlock(Of T1, T2) class to capture both the results and any exceptions that occur while the program reads from a database.