Introducing Parallelism into your programs

Overview

A little over a month ago, I volunteered to develop a session for the Windows 7 / Windows Server R2 launches that are being delivered across the country and beyond The session I worked on I affectionately call “An Introduction to Parallel Programming”. In that session I wanted to illustrate various approaches you can take when introducing parallelism to your applications: 1) Fine-grained parallelism, 2) Structured parallelism and 3) PLINQ.

The approach I chose was to start with an a sequential application and parallelize it using each of the 3 approaches. The application I wrote simply creates a thumbnail image for every image in a specific directory. This post provides a link to a screencast I did on this subject, provides a brief overview of each approach and illustrates how my sample program looks with each approach.

Screencast

Approaches for introducing parallelism into your applications

Fine-grained parallelism (aka imperative task parallelism)

This approach provides you with the most control. Accordingly it requires the most work. The following are some attributes of this approach:

Express potential parallelism via expressions and statements that take the form of lightweight tasks

Affords fine-grained control

The semantic is similar to how threads and the threadpool work today

You tend to use it when the other approaches do not suit your problem

Structured parallelism (imperative data parallelism)

Structured parallelism offers a higher level of abstraction. With this approach you tend to think in terms of parallelizing ‘blocks of code’. Daniel Moth did a great job of explaining this in his PDC 08 talk this way: One thread goes into that block. Within that block, many threads can execute the code within the block and one thread comes out. To this end, we provide a ‘Parallel’ type that allows you to easily parallelize for loops, foreach loops and arrays of actions. The following are some attributes of this approach:

Mechanisms used to express common imperative data-oriented operations

For loops

Foreach loops

Invoke

Think in terms of blocks of code

Parallelize loops

PLINQ (Declarative Data Parallelism)

As a declarative approach, you simply define what you want done, not how you want it done. You do this with PLINQ, an implementation of LINQ To Objects. PLINQ replaces LINQ to Objects extension methods with new ones that internally use parallelism. In many cases, it is as simple as adding a call to the AsParallel extension method to your query. The following are some attributes of this approach:

Implementation of LINQ-to-Objects that executes queries in parallel

Express what you want to accomplish, rather than how you want to accomplish it

Minimal impact to existing queries

Illustrating the 3 Approaches

In order to illustrate these 3 approaches, I decided to take a simple console application that was implemented sequentially and convert it to use each approach. The sample application simply creates a thumbnail image for each image in a specified directory. The process I will take is to show you the code for the sequential implementation, as well as each approach to parallelism. I will then explain the code and lastly show you the timings for each approach. I am running this on a dual core laptop using the Beta 1 of Visual Studio 2008 and .NET 4. I will run the code 3 times for each approach.

* Please note that the results in this post should not be used to evaluate which of the 3 parallel approaches is generally more performant than the others. That will be scenario driven. The numbers should simply be used to illustrate the relative performance gains the parallel approaches provide over the sequential approach.

Code Summary

As you can see, in CreateThumbnails, we create a string array of image paths. We then iterate over that array, calling CreateThumbnail for each item, as well as writing out the path and the thread id of the executing thread to the console. Lastly, we write out ‘Done’ to the console when complete. The implementation of CreateThumbnail is relatively unimportant to the theme of this post. It uses the Bitmap and Graphics types from the GDI+ libraries to create a thumbnail image (max height / width of 128 PX). If you are interested, you can download the sample code here.

Performance

Time (in milliseconds)

2274

2289

2286

Fine Grained Parallelism

The Code

1: privatestaticvoid CreateThumbnails()

2: {

3: string[] pics = Directory.GetFiles(imageDir, imageExt);

4:

5: var tasks = new Task[pics.Length];

6: for (int i = 0; i < pics.Length; i++)

7: {

8: tasks[i] = Task.Factory.StartNew(state =>

9: {

10: CreateThumbnail((string)state);

11: Console.WriteLine(state + TID);

12: }, pics[i]);

13: }

14: Task.WaitAll(tasks);

15:

16:

17: Console.WriteLine("Done");

18: }

Code Summary

The first thing you might notice is that this code is a bit more involved. If you watch the screencast I did on Channel9 on introducing parallelism into your applications, I describe in more detail how we get to this code from the sequential implementation.

Again, the first thing we do is to create a string array of image paths. The next thing we do (line 5) is to create an array of Tasks (the Task is the type we use to implement fine grained parallelism). We will need this array later. Next we iterate i number of times (once for each path in the array of image paths). Now comes the interesting part. On line 8, we call Task.Factory.StartNew, assigning the result to the appropriate item in the Task array. StartNew is a helper method that is analogous to creating a Task and then calling Start. In code terms, it is similar to doing the following, yet a bit more efficient:

You will notice that we are using an overload of StartNew where we explicitly pass the state into the Task using the state object (pics[i] is passed in line 12). This allows us to ensure that the Task is, in fact, working on the image path that we intended it to. If you are new to this, that may sound confusing. Let’s re-write the code for a moment to illustrate what may happen if we didn’t use the state object (or a proper closure):

At first glance, this code doesn’t appear to have any issues. However, to quote Inigo Montoya from The Princess Bride “I do not think it means what you think it means”. In this example, the pic variable is not captured correctly in the closure. What this ultimately means is that by the time the Task wakes up and begins calling CreateThumbnail on the pic variable, the pic variable may no longer have the original value from the outer loop.

Getting back to the proper implementation, you will notice that in lines 10 and 11 both CreateThumbnail and the Console.WriteLine call are operating on the state object. The last thing to point out is the call to Task.WaitAll in line 14. As you can see, we pass the Task array we created in line 5 and assigned to in line 8. This call will ensure that all of the Tasks will complete prior to the call to write ‘Done’ to the console.

Performance

Time (in milliseconds)

1486

1485

1510

As you can clearly see, this is a dramatic performance increase. This is due to the fact that we were able to distribute the work across the 2 cores on my machine, instead of just one like the sequential implementation. However, the code was a bit involved. Let’s take a look at a couple of other options that provide a higher level of abstraction.

Structured Parallelism

The Code

1: privatestaticvoid CreateThumbnails()

2: {

3: string[] pics = Directory.GetFiles(imageDir, imageExt);

4:

5: //foreach (string pic in pics)

6: Parallel.ForEach(pics, pic =>

7: {

8: CreateThumbnail(pic);

9:

10: Console.WriteLine(pic + TID);

11: });

12:

13: Console.WriteLine("Done");

14: }

Code Summary

The first thing you will notice is that this code is much simpler. In fact, it is semantically similar to the sequential implementation. Line 5 contains the original foreach construct that I commented out. I replaced it with a call to Parallel.ForEach in line 6. Notice how similar they are.

The ForEach method takes the IEnumerable you want to iterate over as its first argument and a delegate as the second. The Action delegate gets run for each iteration of the loop. In our case, we pass a lamda that calls CreateThumbnail and Console.WriteLine.

Performance

Time (in milliseconds)

1596

1534

1497

Here we see similar performance increases over the sequential solution.

PLINQ

LINQ Code

1: privatestaticvoid CreateThumbnails()

2: {

3: string[] pics = Directory.GetFiles(imageDir, imageExt);

4:

5: var output = pics.Select(pic =>

6: {

7: CreateThumbnail(pic);

8: return pic + TID;

9: });

10:

11: foreach (var line in output) Console.WriteLine(line);

12:

13:

14: Console.WriteLine("Done");

15: }

PLINQ Code

1: privatestaticvoid CreateThumbnails()

2: {

3: string[] pics = Directory.GetFiles(imageDir, imageExt);

4:

5: var output = pics.AsParallel().Select(pic =>

6: {

7: CreateThumbnail(pic);

8: return pic + TID;

9: });

10:

11: foreach (var line in output) Console.WriteLine(line);

12:

13:

14: Console.WriteLine("Done");

15: }

Code Summary

Think back to when you were a kid. Remember the game where you were shown 2 pictures and you were tasked with identifying the differences between them. In that spirit, the only difference between the LINQ and PLINQ code is the call to ‘AsParallel’. By simply adding the call to this extension method, you have introduced parallelism into your solution (see description above).

Performance

LINQ

PLINQ

2181

1683

2180

1533

2240

1550

Sample Code

Acknowledgements

I would like to thanks Steven Toub, Rick Brewster, Huseyin Yildiz, Chris Dern and Aaran Layman for their help answering my questions and offering guidance. Further, it should be noted that this sample was inspired by and was based upon Daniel Moth’s PDC 2008 talk.

Summary

As you can see, there are various approaches to introducing parallelism into your applications, each with their own pros and cons. Some provide greater control (Fine-Grained), while others provide a simpler model (Structured, PLINQ). Hopefully one of them will meet your needs and allow your application to take advantage of the hardware it is running on.

Funny – The change in blog domain doesn’t have anything to do with the punted demo. I am, however moving to London. I’m not sure whether I’ll be able to find a similar position in Microsoft, so to be safe, I am moving my blog to my own domain. It also gives me a bit more control to add some custom features…