Introduction

This is the third part of my proposed series of articles on TPL. Last time I introduced Continuations, and covered this ground:

Some more TPL background

Continuation, what's that

Simple continuation

WPF synchronization

Continue "WhenAny"

Continue "WhenAll"

Using a Continuation for exception handling

Using Continuation as pipeline

Catching exception in Continuation antecedent

Cancelling a Continuation

This time we are going to be looking at how to use Parallel for/foreach loops. We shall also be looking at how to do the usual TPL like things such as Cancelling and dealing with Exceptions, and we shall also look at how to break out of parallel loops, and how to use Thread Local Storage within loops, and how to break out of loops.

Article Series Roadmap

This is article 3 of a possible 6, which I hope people will like. Shown below is the rough outline of what I would like to cover:

Now, I am aware that some folk will simply read this article and state that it is similar to what is currently available on MSDN, and I in part agree with that; however, there are several reasons I have chosen to still take on the task of writing up these articles, which are as follows:

It will only really be the first couple of articles which show similar ideas to MSDN; after that, I feel the material I will get into will not be on MSDN, and will be the result of some TPL research on my behalf, which I will be outlining in the article(s), so you will benefit from my research which you can just read...Aye, nice.

There will be screenshots of live output here, which is something MSDN does not have that much of, which may help some readers to reinforce the article(s) text.

There may be some readers out here that have never even heard of Task Parallel Library so would not come across it in MSDN, you know the old story, you have to know what you are looking for in the 1st place thing.

I enjoy threading articles, so like doing them, so I did them, will do them, have done them, and continue to do them.

All that said, if people having read this article, truly think this is too similar to MSDN (which I still hope it won't be), let me know that as well, and I will try and adjust the upcoming articles to make amends.

Parallel For/Foreach

A lot of us probably write a lot of sequential code like this:

foreach(SomeObject x in ListOfSomeObjects)
{
x.DoSomething();
}

Where we are doing something to each of the items in some source, and there is no relationship at all between the objects in the source list, we simply want something to happen to every item in some source collection of these objects. The fact that we want to do something to all these objects and the fact that they are not tightly coupled to each other makes this sort of thing an ideal candidate for parallelism, and the designers of TPL thought so too, so they came up with the ability to create Parallel.For and Parallel.Foreach loops.

The rest of this article will look at ways in which you can use Parallel.For and Parallel.Foreach loops in your own code. Obviously, since we are dealing with parallelism, there are a few added complications, but overall, it's still pretty easy to follow.

Creating a Simple Parallel For/Foreach

Demo solution project: SimpleParallel

Let's start by creating a dead simple Parallel.For and Parallel.Foreach loop; here is an example of each:

Question is, can you also do this sort of thing when using Parallel.For and Parallel.Foreach loops? Well, yes you can, we just need to use a TPL class called ParallelLoopState, which we can use using one of the many overloads of TPL's Parallel.For and Parallel.Foreach.

By using ParallelLoopState, we can use two loop control methods: Stop() and Break().

ParallelLoopState.Stop()

Communicates that the System.Threading.Tasks.Parallel loop should cease execution at the system's earliest convenience.

ParallelLoopState.Break()

Communicates that the System.Threading.Tasks.Parallel loop should cease execution at the system's earliest convenience of iterations beyond the current iteration.

We can also get details about how the Parallel.For or Parallel.Foreach loop's work proceeded using the ParallelLoopResult struct, which we can use as a return value for a TPL Parallel.For and Parallel.Foreach loop. Usage of the TPL ParallelLoopResult struct to examine the status of a TPL Parallel.For and Parallel.Foreach loop is also shown in the following three examples:

And here is a screenshot of this running. See how the first two (the ParallelLoopState.Stop() ones) do not provide a ParallelLoopResult.LowestBreakIteration when queried. This is due to the fact that they used ParallelLoopState.Stop() and therefore did not actually break.

Handling Exceptions

Demo solution project: HandlingExceptions

We covered general Task Exception handling in article 1, and you can use any of those techniques; however, I find by far the easiest is to use familiar constructs that we all use in non-asynchronous code, i.e.: try/catch, where we simply catch the TPL AggegateException, where the pattern is something like this:

It can be seen that we got two AggregateExceptions thrown by the Parallel.For loop; this is due to both iterations that threw the silly throw new InvalidOperationException("Don't like nums > 5") already scheduled, and therefore run, which caused two AggregateExceptions to occur and be caught.

Cancelling a Parallel Loop

Demo solution project: CancellingLoops

We covered general Task cancellation in article 1, and the idea is still pretty much the same one you will see crop up again and again when using TPL. How do you cancel a parallel operation? Use a CancellationToken, that's how.

From what we now know from the two preceding articles, we know that we should catch AggregateException and we should typically use a section of code within our parallel code, something like this:

token.CancellationToken.ThrowIfCancellationRequested();

Which is what we have been doing until now. Unfortunately, this is not what we need to do when working with Parallel.For and Parallel.Foreach loops, and will lead to an unhandled "OperationCancelledException", as shown in the figure below:

So what must we do? Well, it is quite simple in this particular case (i.e., cancellation). We must make sure to provide a catch handler specifically for a OperationCancelledException, and it seems to make no odds whether we use the line:

token.CancellationToken.ThrowIfCancellationRequested();

or not. The Parallel.For and Parallel.Foreach loops always end up throwing a OperationCancelledException when their respective CancellationToken is cancelled. I guess it is down to personal preference whether you include that line or not, at the end of the day. Since it seems to make no difference as to whether a OperationCancelledException is thrown or not, I am choosing not to include that section of code in the Parallel.For and Parallel.Foreach loop demos.

Here is a full listing of a Parallel.For and Parallel.Foreach loop, both of which get cancelled by the same CancellationToken, after 5 seconds.

We do indeed see our cancellation messages, and nothing was scheduled from the Parallel.Foreach.

However, this looks weird, right?

I would draw your attention to this...Note how we only got the for loop to print 10 and 500 (OK, it was designed by me to only print if the index % (modulus) 10 ==0, so there is no real way to tell what went on), but this does show that ordering is certainly not guaranteed by Parallel.For loops, and you should not rely on any ordering at all, and to do so would spell certain doom.

TPL will schedule Parallel.For delegates onto a worker thread as and when it sees fit, as shown above. We may have got other indexes called, but all we really know is that we got index 0,500 called before we cancelled. What about 100,200,300.... the order is not maintained, so please do not expect any ordering to be maintained by TPL, that will not happen.

That said, it does nicely demonstrate that the Parallel.Foreach work did not get scheduled at all, as we see no output from it above at all. This may vary dependant on how many cores your PC has, mine only has 2, so that is the output I saw.

Partitioning for Better Performance

Demo solution project: OrderedPartitioner

I don't know how many of you have been eagle eyed enough to notice that when we run a Task/Continuation or a Parallel.For/Parallel.Foreach loop, we are effectively queuing up a delegate of work that will eventually be run on a ThreadPool worker thread. The thing with that is that what can happen, especially in the case of Parallel.For/Parallel.Foreach loops, with small delegate bodies, is that the amount of time taken to create/swap out these small delegate payloads can have an adverse affect on performance.

Question is, is there anything we can do about it? Well yes, TPL does allow us to create our own partitioning chunks, where the idea is that the overall Parallel.For/Parallel.Foreach loop workload is broken up into chunks that are of a size that we specify in our own code. We are effectively creating a custom partitioning algorithm. If we go with the default Partitioner, it uses a default partitioning algorithm that takes into account the number of cores amongst other things, and may not give the best results.

To demonstrate this, I have included three examples below:

No partitioning at all

Use the default TPL partitioning algorithm (my laptop has 2 cores)

Use our own custom partitioning logic

Here is the full code listing to demonstrate these three scenarios. The idea is that I use a simple Task that is used to fill an array with some dummy data, and then each of the three scenarios outlined above is run (and timed) in turn. Each of the three scenarios perform the same job. It gets an item from the original dummy list and squares it, and writes that value to another results array. By making sure all the three scenarios do exactly the same thing, we should be able to get a realistic comparison.

At the end of the block under test, we also verify that all the elements were hit by checking the result array for 0. 0 should only be present when the result data array is reinitialised at the start of each of these three scenarios. We are basically ensuring each element from the original list got hit by the scenario.

I am just using the standard Threading synchronization primitives (a ManualResetEventSlim) to control the running of the three scenarios, only allowing one to run at a time, and they run in the order in which they are declared in the source code file.

And here is the result of running this code (again my laptop has 2 cores, so your results may differ):

I think the results speak for themselves. Using no partitioning is OK, but then we use the default partitioner, and things get worse (more than likely due to my laptop only having 2 cores amongst other factors), but look what happens when we take full control over what partition size to use, that gave us the best performance of all.

So there you go, food for thought.

Thread Local Storage

Sometimes what we need to do inside a loop is accumulate some sort of running total, something like this in sequential code:

So how would we do something like this using Parallel.For/Parallel.Foreach loops? The answer lies in using ThreadLocalStorage, which has been around for a while, but TPL puts a new spin on it.

Luckily, when constructing Parallel.For/Parallel.Foreach loops, we have all the relevant overloads we need to use ThreadLocalStorage, it's just a matter of understanding how to do so. To this end, I have provided two examples, which are as follows:

Demo solution project: ThreadLocalStorage

This example searches a source list of words for a particular search word, and keeps a count of how many times that word is found. It is important to know that when we access the count variable, we are effectively accessing a shared object that needs thread synchronization of some sort. I have chosen to use lock(..) (which is really using Monitor.TryEnter()/Monitor.Exit() behind the scenes), but you could use any threading synchronization primitive that you like.

The format of the loop is pretty much dictated by TPL if you want to use ThreadLocalStorage, it is just a pattern that you will have to learn and follow. This is what the method signature and comments look like within .NET 4.0 itself:

//// Summary:// Executes a for each operation on an System.Collections.IEnumerable{TSource}// in which iterations may run in parallel.//// Parameters:// source:// An enumerable data source.//// localInit:// The function delegate that returns the initial state of the local data for// each thread.//// body:// The delegate that is invoked once per iteration.//// localFinally:// The delegate that performs a final action on the local state of each thread.//// Type parameters:// TSource:// The type of the data in the source.//// TLocal:// The type of the thread-local data.//// Returns:// A System.Threading.Tasks.ParallelLoopResult structure that contains information// on what portion of the loop completed.//// Exceptions:// System.ArgumentNullException:// The exception that is thrown when the source argument is null.-or-The exception// that is thrown when the body argument is null.-or-The exception that is thrown// when the localInit argument is null.-or-The exception that is thrown when// the localFinally argument is null.publicstatic ParallelLoopResult ForEach<TSource, TLocal>(IEnumerable<TSource> source,
Func<TLocal> localInit, Func<TSource, ParallelLoopState, TLocal,
TLocal> body, Action<TLocal> localFinally);

It looks a bit hairy, but I think the code is somehow easier to understand than this TPL metadata.

Here is what this code looks like when run:

Demo solution project: ThreadLocalStorage2

This example initializes a source list with random double values, and then adds these input values up to form a single output value. Again, it does this using ThreadLocalStorage, and as such, we are again accessing a shared variable that needs thread synchronization of some sort. Again, I have chosen to use lock(..). The other thing to note is that since the data type I am dealing with in this demo is a double, I must use the generic available on the Parallel.For<T> which is double in this case.

That's it for Now

That is all I wanted to say in this article. I hope you liked it, and want more. If you did like this article, and would like more, could you spare some time to leave a comment and a vote? Many thanks.

Hopefully, see you at the next one, and the one after that, and the one after that, yes 6 in total. I better get busy.

Ah its just cos that is the last block of 3 scenarios granted it does not need to signal the ManualResetEvent any more and no one is waiting after this 3rd block, but it really doesn't matter that much either

Generally this has been a well written and easy to read series. Thanks!

Just to point out a couple of things that may assist your readers:

1. You do not need to use a cancellation token ThrowIfCancellationRequested() within a Parallel.For and Parallel.ForEach simply because the main loop body within the Parallel class will do this for you on each (or some short sequence of each) iteration automatically. The reason you would include it within the body would be in the case when the body may itself take longer to execute; perhaps with an inner loop or other logic. Various articles from Microsoft recommend checking the cancellation token every few thousand IL instructions. So, for a simple body this would not be needed yet these loops will still terminate rather quickly when cancelled.

2. The reason you need to catch a OperationCancelledException instead of AggregateException when cancelling a Parallel.For/ForEach (that is, if you put that inside a try/catch) is that Task (which Parallel.For/ForEach is based upon) will aggregate all exceptions except a single OperationCancelledException that holds the expected cancellation token. This is used to distinguish a fault situation from an acknowledgement of a cancellation request. The internal tasks would have been left in different states - in the former, it was marked as Cancelled, in the later as Faulted. Also, this is not unique to Parallel.For/ForEach; it would be the same if you wrapped a Task.Wait or Task.Result in a try/catch.

3. I've seen lots of discussions of "ThreadLocal" variables used for Parallel.For/ForEach when using the method signatures for these that allow for later aggregation. This is even in a lot of the MSDN documentation and blogs. Perhaps this was changed prior to release, but Parallel.For/ForEach do not use ThreadLocal for this purpose, but rather a simple stack variable within the special internal Task that processes a specific range/partition. The stack variable is satisfactory as it will only be used (normally) on that one thread and then passed upon completion of that Task body to the special final delegate.

4. In your partitioning example with 3 scenarios, you state that the scenario #1 is "No partitioning at all". However, this is not correct; this will in fact use the default internal partitioner. At least, I do not see any way that you are disabling the partitioner. The actual library code would seem to indicate that specific method call does not allow disabling the partitioner either -- ForEach(IEnumerable<TSource>, Action<..,..,..> calls Parallel.ForEachWorker method that uses Parallel.PartitionerForEachWorker with a partitioner created by Partitioner.Create(source) to do the work. Your scenario #2 is using the default partitioner, I've yet to understand though why the great difference in performance. The last point here is that your data sample and operation is likely simple enough that it would execute faster in a single thread without the overhead of PTL so scenario #3 "wins" only because you have forced a large data set to be processed by each Task spawned by the Parallel.ForEach; thus fewer Tasks. If the work was slightly more difficult then the default partitioner (which is actually scenario #1) may have performed better.

So I think your conclusion on that point may not be wholly accurate as the results do not speak for themselves. This could lead a reader to believe that in most cases it would be better to create a custom Partitioner when in fact the default one is actually quite good and should be used first until performance constraints and benchmarking actually determine otherwise.

[Edit. I now see your statement that this example was for showing how a custom partitioner could help when having small body. That didn't stand out for me the first time I read through this but did notice in the reference back from the 4th article and now I see this so your point is correct. In my continued reading I did find out the reason for the difference between scenarios #1 and #2. If you call Parallel.ForEach without a Partitioner and with an IEnumerable<T> that also is an IList<T> or a T[], in fact the ForEach will be executed as a Parallel.For which has less locking constraints than Parallel.ForEach. You can force a true ForEach by using an enumerable partitioner, such as in #2; or converting the List<T>/Array to an IEnumerable<T> only (e.g. inputData.Select(s => s)). So, scenario #1 is really a Parallel.For with a default indexing partitioner, scenario #2 is really a Parallel.ForEach with a default IEnumerable<T> partitioner. ForEach on a true IEnumerable<T> adds additional locking for iterating through the enumerable.]

Thank you again. This has been a good study and encouraged me to dig deeper to understand this better.

I only just saw this message, there is something wrong with the way I am getting notified from the forums, this one did not come to through. So I apologize for that. You have some good points and I will try and address them when I find some time.

Great article once again, Sacha. If I may, I feel that there is an addition that can be made to the SynchronizationContext section. In your example if I understand correctly, it assigns the value back to the original thread at the end of the work, however I was trying to find a way for the background thread to report progress during it's execution. (I'm accustomed to BackgroundWorker, and it's ReportProgress feature)

As it turns out this can also be done also using the SynchronizationContext, taking advantage of the method .Post(delegate, state).

Has man thanks for your comments. I did not mention Progress as such, though if you see this post and my answer you will see its something that has been done with TPL before : Task Parallel Library: 1 of n[^]

I will look to including progress once I figure a good way of doing it, where the Task is not right in the code behind, which I consider to be untestable code. There needs to be a way that the form/WPF app can tell the task where to report progress to without strong link on code behind. I'll have a think about it.

Probably sub class Task and expose Report progress event or something like that. Do you see what I mean, if you have Tasks started in your code behind that code can't be unit tested, which I am massively in favour of. I like patterns like MVP/MVC/MVVM that allow better seperation of concern, and testability.

I am not meaning to be arrogant or dismissive, its just I dont want to write up something about Progress when I have not figured it out properly myself yet....I'll stew on this a while

Sacha Barber

Microsoft Visual C# MVP 2008-2011

Codeproject MVP 2008-2011

Your best friend is you.
I'm my best friend too. We share the same views, and hardly ever argue

Oh true, I do see what you're saying. Placing Task in the code behind tightly couples the view to the threading code, which defeats the purpose of patterns like you mentioned in your comment.

My example aside, in the actual application I was writing I had the background thread update an ObservableCollection which I had bound a WPF control's ItemsSource to. It worked out alright and I was able to unit test it, though still not confident I would advertise it as being the best approach; I certainly understand wanting to have more time to work with the idea before including it in an article.

All nice and testable, and this would be even nicer in MVVM land. You could also abstract this even further by inheriting from Task, or have Task wrapper which exposes event, has properties to allow Min/Max to be set and accept Action/Func to do the actual work.

This example has no Model, but you see my point right?

Sacha Barber

Microsoft Visual C# MVP 2008-2011

Codeproject MVP 2008-2011

Your best friend is you.
I'm my best friend too. We share the same views, and hardly ever argue