Introduction

"With great power comes great responsibility"... As the hardware world is booming and its processing power increases, the responsibility lies with developers to utilize such power. OK, so the comparison is lame, but I am a fan of the Spider Man movie and was always looking for a way to use its slogan... guess I should keep looking :)

Seriously though, multi-core (and multi-processer) machines are dominant these days. Soon enough, as reported by some lead vendors, you will not be able to find machines with less than 8 cores. As such, the traditional way of writing serial programs will not be able to make use of such power, by default. These programs should be written in a way which utilizes the power of multi-cores.

Do we have to? Some might say. Wouldn't running my program on a multi-core machine automatically improve performance; after all aren't there more threads? The answer is no! Running a serial program on a multi-core machine has nearly zero performance enhancements. With the exception of IO operations, your program will still use one thread of a single core at a time. Thread switching occurs - which incurs overhead by the way - but at all times, only one thread will be active (again, with the exception of IO operations, which can happen on different threads).

Multi-threaded programming is the way which allows developers to take advantage of multiple threads concurrently. And while it has been available for many years, multi-threaded programming has often been avoided due to its complexity. Developers had to put most of their effort in writing clean multi-threaded code rather than focusing on the business problem.

What does multithreading have to do with multi-cores? Can't we write multi-threaded code if we have a single core? Yes, we can. But again, as long as we have a single core, only one thread at a time will work since all threads belong to that same single core... logical enough. When your multi-threaded program runs on a multi-core machine, threads belonging to different cores will be able to run together and hence your program will take advantage of the hardware power.

One final note before going on: multithreaded programming does not differ if you have a multi-core machine or multi-processer machine. This is a hardware separation; underneath, your code will be working with threads all the same.

Parallel vs. Concurrent

Well, both concepts are related but not the same: in its simplest definition, Concurrency is about executing multiple un-related tasks in a concurrent mode. What these un-related tasks share are system resources, but they do not share a common problem to solve; meaning, they are logically independent. Now, think of these tasks as threads. And, since these threads share system resources, you face the globally common problems of concurrent programming such as deadlocks and data races. This makes concurrent programming extremely difficult and debugging complicated.

What is Parallelism? Parallelism is taking a certain task and dividing it into a set of related tasks to be executed concurrently. Sweet! So again, thinking of these tasks as threads, Parallel Programming still has the same problems of concurrent programming (deadlocks, data races, etc...), and introduces new challenges, most notably, sharing and partitioning data across these related threads. This makes Parallel Programming even more difficult and debugging even more complicated :(

However, it is not all bad news. .NET 4.0 parallel programming is a great step onwards. The new API solves a lot of problems (but not all) of Parallel Programming, and greatly eases up parallel debugging...

Setting the Context of .NET Parallel Programming

The previous sections introduced multi-threaded programming and its need. This section introduces the context of Parallel Programming in the .NET Framework in a gradual manner:

We have multi-core machines and we need to think of a way to utilize their power. So is the need to think of Parallel Programming.

What is Parallel Programming all about? It's about how to partition a single piece of work into multiple concurrent units.

What partitioning options do we have? Option 1 would be static partitioning. Static partitioning is "hard-coding" the number of units and threads you want to use. For example, saying that you want to divide a large chunk of work into "n" units, each running on a single thread over a single core. The problem with this approach is that it is not scalable. While this solution would be fine on a machine with "n" cores, it quickly will become un-scalable for machines with "n+m" cores.

The natural evolution of solving the partitioning problem would be dynamic partitioning. This means that you divide your work into a single unit per available core. While this would solve the scalability problem, it introduces another one: load-imbalance. In real problems, it is really unlikely that all work units will take the same amount of time to complete. This means that some cores will finish their work while other cores will still be busy without being able to utilize the now-free cores.

Moving up the chain of solutions, we now have one more option: partition work into the smallest feasible units which will make use of the most feasible number of available threads. This will solve the load-imbalance issue as work is divided into the smallest possible units where each unit will run over an available thread. However, now we have a problem: thread overhead. Creating and killing threads is an expensive procedure.

This brings us to the final partitioning option: the .NET Thread Pool. The .NET Thread Pool maintains a scheduler and a thread queue. Instead of killing a thread once it finishes its work, this thread is pushed into the queue and it is now available to be picked up by the next unit of work scheduled for execution. The thread pool is exposed through the System.Threading.ThreadPool namespace, and it has been available since before .NET v 4.0.

Wait a minute, so what is exactly new in the .NET 4.0 Parallel Programming model if the above discussion applies to pre-v 4.0? The answer will come by inspecting the shortcomings of the thread pool as discussed. The thread pool is a fire-and-forget programming model. Once you create a thread by code, you basically have no way to perform any operations on it. There is no support for continuing, composing, monitoring data flow, performing work cancellation, and work waiting.

So what is the solution? It is the .NET 4.0 Task-based Programming. In v 4.0 of the Framework, you can program over Tasks as opposed to the lower-level threads. Now you can divide your work over Tasks instead of threads. This will give you much more power and control over your parallel work, as we will see later. Of course, you can still work with the thread pool directly if you want to work with the deeper-detail of threads.

.NET 4.0 Parallel Programming

The Parallel Programming model of .NET 4.0 is composed of the following:

The Task Parallel Library (TPL): this is the base of the Task-based programming discussed in the previous section. It consists of:

Task class: the unit of work you will code against instead of the previous thread model.

Parallel class: a static class that exposes a task-based version of some parallel-nature problems. More specifically, it contains the following methods:

For

Foreach

Invoke

Parallel LINQ (PLINQ): built on top of the TPL and exposes the familiar LINQ as Parallel extensions.

About the Examples

All examples discussed are downloadable. To illustrate the main ideas, some code samples will be shown in the below sections.

All examples are built using VS2010 RC1; nothing else is required. In order to feel the value of Parallel Programming, you will need a multi-core (or multi-processer) machine; the results shown are that of running the examples on a dual-core machine.

Example 1: Tasks

The first example shows the new Tasks programming model. As explained previously, Tasks - as opposed to the Thread Pool - grants more control over thread programming. In fact, you now deal with Tasks and not threads.

In the above code, we first create a new Task using the Task class, and we pass to the lambda expression the code to run inside that Task (in a separate thread).

Another feature is the ability to "continue" on a running Task. What this means is that once the Task is completed (in our example, Task "t"), start running another Task "t2" with the code to run supplied in an anonymous method (through a delegate).

Now uncomment block 1. Here, we see another feature which is to get a result back from a Task. Notice that we have assigned a value for Task "t2" called "Tasks Example" through the "return" statement. Now in block 1, we can retrieve that value. This kicks in a question: what will happen if the statement "t2.Result" is executed before "t2" finishes execution (recall that "t2" is running on a separate thread than the thread of the main program)? Force this behavior (that is why I am using Thread.Sleep), and you will see that in this case, the main thread will wait at "t2.Result" until "t2" finishes execution and its "return" statement is executed. Neat! A power which was not possible before.

Now uncomment block 2. Set two break points at the first and second "Console.WriteLine" statements. Run the program and notice that the line "Waiting my task" is printed before "Here I am". This indicates that the main program (thread) was waiting for task "t2" running on another thread to finish. Once "t2" finishes (delayed by using Thread.Sleep), its statement is printed.

Example 2: Parallel

This example introduces the Parallel class. Specifically, it shows how to use the new For method to perform what usually was done serially in a parallel manner.

The above code shows the difference in execution time of performing a unit of work (simulated by a Sleep operation) serially versus in parallel. In my dual-core machine, the time is cut by half. Now, this is a huge improvement! Imagine all the work you usually do inside For loops (or Foreach loops) and think of the time enhancement you now can get by utilizing the new Parallel.For and Parallel.Foreach methods.

One important note: sometimes it's just not that easy though. When parallelizing your work, sometimes you have to take care of special problems like synchronization and dead-locks; you still have to handle these, and I will discuss them in the next examples.

For the sake of variation, let's see an overload of the Parallel.For method. In order to support breaking out of a parallel loop, there is an overload of the Parallel.For method which takes ParallelLoopState as a parameter in the Action delegate:

The above code shows how to perform a deliberately time-expensive IsPrime function using LINQ (and For) vs. PLINQ. The AsParallel extension allows your code to run using PLINQ instead of LINQ. The step of using AsParallel wraps the data source with a ParallelQuery wrapper and causes the remaining extension methods in the query to bind to PLINQ rather than to LINQ to Objects.

Run the example and you will notice that while the time for execution using a serial For loop and traditional LINQ is nearly the same (actually opposite to many beliefs, LINQ has a performance overhead over For loop... but that's another story), the execution time using PLINQ is nearly cut in half.

Example 4: PLINQ Extensions

PLINQ has many extensions that affect how work partitioning is done. In this example, we will discuss one of these extensions. Examine the below code:

This code is similar to the previous example. It shows how to execute a function using PLINQ's AsParallel extension. Now run the program. What you will see is something that you might not have expected. Notice that values 0 through 199 are printed; however, they are not printed in the correct order.

Think about it: we are using an enumerator from 0 to 200. However, since we are using PLINQ and not LINQ, the partitioning itself is not serial. The work is divided into random parallel units; for example, from 0 till 10 is assigned to thread A, while from 11 till 30 is assigned to thread B, and so on. Once a unit of work is done, its output is immediately flushed back to the final result. That's why the output is not arranged. The below image gives an indication about this:

Run the example again, and this time notice that the output is arranged from 0 till 199. This time the AsOrdered extension forces the use of a buffer where all work units' outputs are gathered and arranged before being flushed back to the final output. The below image shows the process:

Example 5: Data Races

In Example 2, I mentioned that data races can occur when doing multi-threaded programming. This example shows a Data Race scenario and how to solve it.

The above code uses the Parallel.For method to execute a simple Sleep statement. A counter gets incremented each time the loop body gets executed.

What number do you expect to see if you run the program on a multi-core machine? 100000 is the logical result. Now, run the program and you will notice that the number is less than that (if you do get 100000, you are just being lucky, so try again). So what happened here? Why did we get this wrong when it seemed to be so obvious? The answer is data race occurred.

Let's explain this. The trick here is in understanding how the statement counter++ gets executed. When at Intermediate Language (IL) level, counter++ is not a single statement anymore. Actually, it gets executed in 4 IL steps by the JIT (Just In-time Compiler):

Get the current value of the counter

Get the value of 1

Add the two numbers together

Save the result back to the counter

So what does this have to do with Parallel Programming and using Parallel.For? When executed over multiple cores as we do now, the body of the loop gets executed in parallel over multiple threads. The following image explains what happens in this case:

As a result, one step increment was overlooked due to the fact that two threads were granted access to the variable counter at the same time. The above scenario occurs many times, and you will get the final value wrong, as we did when we ran the program. This is a classical case of Data Race.

The solution to such a case is to lock the variable so that once a thread is accessing that variable, some other thread has to wait until the lock is released before getting access to the variable. Granted this introduces a time delay, but it is a cost we have to pay for solving data races.

There are many ways we could lock a variable, and a full discussion is outside the scope of this article, but I will use the best option in my case, which is the Interlocked.Increment method as shown below (the "lock" statement is another method which is used in another example; the comments show how to use it instead):

The above code uses PLINQ to print all prime numbers of a given range. There is nothing special about the parallel code itself, except that the IsPrime function has a logical error deliberately thrown inside it: on running the program, you will notice that the numbers 9, 15, and 25 are reported as prime numbers, which is wrong. Now, say, you want to debug the IsPrime function and set a breakpoint inside it. Run the program and you will quickly notice the problem here: because multiple threads are executing, the debugging behavior is unclear as focus will switch from one thread to another as threads take turn hitting the break point.

Through this window, you can see all the running threads in Debug mode. You can select to freeze all threads except the one that you want to actually debug, and once done, you can run the other threads.

Through this window, you can see a graphical representation of the running threads of your program and how they originated.

Example 7: Deadlocks

Note: this example was built by Microsoft.

Example 5 explained Data Races and how to deal with them through locks. However, you should take care that locking can bite you in the ugly form of Deadlocks. Simply put, a deadlock is a case where a thread TA is locking resource R1 and a thread TB is locking resource R2, and now, TA needs R2 to continue while TB needs R1 to continue. Both threads are stuck and cannot continue, thus causing a deadlock.

The above code uses Parallel.Invoke to concurrently perform money transfer between bank accounts "a" to "b" and vice versa. Inside the Transfer method, locks are applied on both objects (using a "lock" statement) with the intention of making it thread-safe for concurrent execution. This method of locking will cause a deadlock scenario.

The "WatchDeadlock" class is used to force break using Debugger.Break if it notices that no new transfers have been completed after a certain amount of time.

Run the program and see how we have a deadlock between our hands.

Let's see how the VS 2010 debugging support helps us identify such deadlocks. With the program running, fire the Parallel Tasks window. The window will clearly show you a case of deadlock. Not only that, it will specifically tell you about the two threads causing the lock and how each thread is waiting for a resource (in this case, a Bank object) owned by the other thread:

So how do you avoid deadlocks? The answer is: simply do not do them. When doing parallel programming and locks, always program in a way that deadlocks will not occur!

Takeouts

It's obvious that the new Parallel Programming in .NET 4.0 solves a lot of the multi-threading complexities. It grants you more control over threads in the form of tasks, it gives you many method overloads for performing parallel operations like For, Foreach, and Invoke, it gives you the power of PLINQ, it does take care of partitioning for you, and it introduces new Thread-Safe collections in .NET 4.0 such as ConcurrentDictionary, ConcurrentQueue, and ConcurrentStack. In addition, improved debugging support makes it easier to handle parallel programming related problems. However, you still have to take care of multi-threading related concepts such as Data Races and Deadlocks.

The better news is that for Microsoft, Parallel Programming is still an ongoing project so we can expect more features and improved support in the next releases.

C# Parallel.For multi thread parallel, in native win7 64 test can be multiple threads to run in parallel, but deployed after the Windows Server 2008 R2, a maximum of 16 threads running in parallel, people know this is why? Have set the maximum thread does not limit.

I really enjoyed your article. I wonder if you would consider reviewing and providing feedback on a distributed parallel computing for .NET project I've launched on GitHub. If so, please send me a private reply and I will send you the link.

Thanks for the very clear and very useful article. It is an excellent introduction.

Just two things: (a) The debugger windows that you mention (Parallel Tasks and Parallel Stacks) do not appear to be available in Visual C# 2010 Express; (b) WatchDeadlock is not part of .Net, but is part of your downloadable source code (this was not clear from your text).

I was expecting PLINQ to be around 7/8, ran it a few times and got similar results. This was a live data server running mysql and and other services and only has 2Gb memory, where as the desktop was just running this example and has 4Gb of memory.

I ran the example when the server was idle, had task manager up, CPU load was about 52% for both the For loop and LINQ and 100% for the PLINQ

you have done a great job to introduce this new feature, but I'm not sure if you could control which thread will execute the work,for example,all 200 threads are busying doing heavy work,then,is it possible that all other light and heavy requests need to wait for a long time to be served? Can parallel handle this case? thanks.

the problem of date races could be solved perhaps locking and decrementing the variable totalNumber where of course available.For each task would not be necessary to know the actual worked Number.Maybe using a thread with just the task to manage the totalNumber variable and decrement it whenever the other threads ask it.

It would seem that many CPU-intensive tasks which will take a non-trivial amount of time to execute can be naturally broken down into a number of large but discrete steps. What are the performance implications of e.g. rendering a bunch of video a frame at a time using four CPU cores to handle each frame, versus having each frame handled by one CPU, but allowing multiple frames to be processed concurrently? I would think that if the cores share a level-2 or level-3 cache, having all of the cores working on the same frame of video might help performance of those caches, but since the level-1 and possibly level-2 caches of the cores would be separate, having the cores working on the same frame of video might impede the performance of those caches.