Monday, April 22, 2013

Abstract: With .NET 4.5, the Task Parallel Library team went another step ahead and built a little known library called Parallel Dataflow Library. This article explains why as a .NET developer, you must know about this powerful library.

By now, we’ve all heard about the Task Parallel Library (TPL) introduced in .NET 4.0. TPL is really neat and goes a long way in making parallel programming palatable for a bigger slice of developers. With .NET 4.5, the TPL team actually went another step ahead and built a little known library called Parallel Dataflow Library.

Parallel what… you ask?

Yeah well, the naming isn’t very helpful and nor has it been blogged about much! But the Parallel Dataflow library is actually quite powerful. It leverages the TPL to provide us with a nice in-process, message passing functionality. This helps in creating a ‘pipeline’ of ‘tasks’ when we are trying to build asynchronous applications. Confused? Let’s try with an example.

Let’s think of this scenario! You need to process multiple images (some processing resizing/applying Instagram like filters or something else, and then saving the new images). Also the images are going to come into your system intermittently. Once they arrive, you need to process them and put them at a destination location. How shall we go about it?

Well one way is to wait for images to arrive, when they arrive (either one at a time or in a bulk), pick each available image, load it, process it, move it to output and go back to waiting. We can do this in two ways

This as we can see is a monolithic and synchronous action for three potentially independent processes. For every file we are doing these three steps while the next files waits to be considered.

In the second approach, once a file is loaded, the ‘loader’ can handover the image to the ‘processor’ and go back to loading again! Similarly, once a processor has done processing, it can handover the file to the output system to save and go back to processing the next file, ditto with the output processor! So these three tasks can actually be piped together, on the input side we have file or files as they come in, on the output side you have file as it gets done. The above image would get transformed into the following

...

As we can see, it’s a pipeline of smaller tasks that communicate with each other when data is flowing through time. The Parallel Dataflow Library helps us build these kind of solutions with relative ease. The parallel execution and asynchronous behavior is taken care of by the TPL, on which the Dataflow library is build.

Now that we know what it does, let’s see some of the types and structures that help the Parallel Dataflow Library achieve this.

Parallel Dataflow Library – Programming Model

Continuing with our above example, if we look at the three blocks we have, the first one can be considered a Source Block, the Last one can be considered the target block and the center one can be considered the transformation block that moves data from source to target while doing some action with it.

...

Conclusion

That was a very quick and brief introduction to the Parallel Dataflow Library of TPL! It is quite extensive with respect to the Blocks that it supports and it’s a matter of cooking up the use case we want to throw at it. I hope to have at least setup the premises under which such an architecture should be considered.

Overall the Dataflow library helps us do ‘in-process’ message passing that can be used for dataflow and pipelining at a much higher level as in with objects in managed code, as opposed to with bits and bytes inside of CPU execution pipelines! "

If you have data or files that flow from one state to a next and have a bunch of them and want or need to live in a multithreaded world (and who doesn't these days?), you might want to check out the Task Parallel Dataflow library. I mean it's there already, you might was well, right?