Concurrent Programming - A Primer

An overview of Microsoft's Parallel FX initiative, including the Task Parallel Library and PLINQ.

Introduction

I have always been interested in parallel computing, starting in my teens when I worked on a robotics application in a language called SAIL12 (huh, look at that, it was Bill Gates' favorite language). I even implemented an extension in 6502 assembly to Commodore PET Basic that supported concurrent subroutine execution. It was clear to me by the 1980's that concurrent programming was something we developers would have to get our heads wrapped around, and soon. Apparently, "soon" meant more than 25 years later, as there is finally a dawning in the mainstream (meaning Microsoft, quite frankly) world that concurrent programming is not only the next logical step in software development, but the next necessary step.

This article is essentially a formalized journal of my research into what's going on right now with Parallel LINQ (PLINQ) and Microsoft's Task Parallel Library (TPL). There is obviously a push to utilize the functionality, no pun intended, of LINQ and lambda expressions in C# 3.0 and .NET 3.5 to gain a foothold in concurrent programming. While there is a plethora of information on parallel computing across numerous languages, it is in my interest (and I suspect in this community's interest) to see what Microsoft is doing with regards to C# and .NET (and obviously, VB.NET as well), as this will affect us more directly than other work (past or present). So yes, this article is biased in that it focuses primarily on Microsoft technologies.

I've put together my notes into a hopefully readable article on the topic. There's lots of references at the end of the article for you to dig deeper, and I've quoted liberally from other people, as they, not I, are the experts. I discovered that concurrent programming, the Microsoft "way", involves some digging into not just the TPL but the foundations of LINQ, lambda expressions, and functional programming. This has been revealing in that I feel it has created a more complete picture of the different technologies involved in concurrent programming, how they interact with each other, and their strengths and weaknesses.

As a disclaimer, any errors in understanding these technologies are entirely mine, and please correct me if I've made wrong statements and conclusions.

What is Concurrent Programming?

The answer is obvious: executing multiple tasks (or work units) in parallel. While true parallelism doesn't occur on a single core computer (instead, the CPU switches between tasks), on systems with multi-CPUs (and multi-cores in each CPU), true parallelism is achieved. Nor is parallel computing confined to utilizing the cores on one physical machine. Distributed computing is a form of parallel computing, in which work units are distributed across numerous machines. However, distributed computing adds additional requirements to task management (namely task distribution), and is not discussed here.

The essence of concurrent programming involves two things: task management and communication. A task manager is necessary to distribute work units to available threads, and communication involves setting up the initial parameters for a task and obtaining the result of the task's work. It is this last aspect, task communication, that is the most difficult. This is where, by incorrect locking mechanisms, a developer can kill any performance gains, and even worse, create subtle bugs when multiple tasks attempt to change the same memory locations simultaneously, and when tasks deadlock, each waiting for the other to complete its work.

The issues of task communication are more broadly categorized as state and memory sharing issues. A typical solution to the synchronization issues involved with shared state and memory is to use locks, monitors, semaphores, and other techniques to block threads from altering state while one single thread makes changes. Synchronization is hard to test, and this is where bugs and performance problems develop. In Erlang (a functional language), synchronization is not even an issue. This is achieved by treating variables as immutable, and by messaging between threads, where the message is a copy of the sender thread's variables.

Another technique is transactional memory. Locking is a pessimistic approach--a lock assumes that the memory will be written to by someone else, who is also going to lock the memory. Transactional memory is optimistic: "...a thread completes modifications to shared memory without regard for what other threads might be doing, recording every read and write that it is performing in a log. Instead of placing the onus on the writer to make sure it does not adversely affect other operations in progress, it is placed on the reader, who, after completing an entire transaction, verifies that other threads have not concurrently made changes to memory that it accessed in the past."10

So, fundamentally, concurrent programming involves task management and some form of state management, be it synchronization techniques, messaging with serialization, or transactional memory. There may be other techniques for state management that I'm not aware of. However, as you can see, task management is well, task management. State management, ah, there's the rub!

What is Microsoft Doing?

F#

First off, F# doesn't directly have anything to do with concurrent programming. "F# is a typed functional programming language for the .NET framework"1. However, F# includes the concept of asynchronous workflows, primarily for "writing concurrent and reactive programs that perform asynchronous I/O where you don't want to block threads."2 The TPL will eventually be incorporated into F# "as a key underlying technology for F# asynchronous workflows."4 Also, as I discuss later, functional languages are, by their nature, well suited to concurrent programming, and so it is natural that the F# team is interested in parallel computing.

The Task Parallel Library (TPL)

"The Task Parallel Library (TPL) is designed to make it much easier to write managed code that can automatically use multiple processors. Using the library, you can conveniently express potential parallelism in existing sequential code, where the exposed parallel tasks will be run concurrently on all available processors." 3

What To Do, Not How

I've heard told, you can tell a man what to do or you can tell a man how to do it, but you better not tell a man what to do and how to do it. The TPL, via the vehicle of PLINQ and the TPL API, eliminates the "how to do it" part of your imperative code with regard to performing work in parallel. You express what to do as a work unit, and the TPL figures how to best do that work. PLINQ takes advantage of the TPL by taking the query iterations and designating them as work units for TPL to assign to threads (typically processor cores).

Parallel Task Basics

Let's take CPian livibetter's Mandelbrot Set with Smooth Drawing13 and parallelize it, demonstrating the Parallel.For loop. The Parallel.For loop is the simplest "structured" method in the Parallel class. But, if you download livibetter's code to do this exercise, make sure you:

Convert the solution to a Visual Studio 2008 solution

Change the "Ms" project's target framework to .NET Framework 3.5

Add a reference to System.Threading

You will, of course, also need to download and install the TPL CTP.

First, we have to make the code thread safe by removing side effects. The main computational loop looks like this, and I've added comments to the side-effect code:

If we parallelize the outer loop, it's obvious that each work unit needs its own instance of z; otherwise, each work unit will be manipulating the same instance, which will have severe consequences to the work each unit is doing. Second, the increment of z.Re by xStep cannot be done anymore. The loop will be parallelized against x, so z.Re will have to be calculated for each x rather than simply incremented. The resulting parallelized version looks like this (using a lambda expression as a delegate):

The program will now utilize the available cores to generate the Mandelbrot fractal. It definitely runs faster, and you can clearly see the CPU utilization go to 100%.

An interesting question is, why parallelize the outer loop rather than the inner loop? One answer is that you are then adding overhead to the TPL's task manager. The task manager currently splits up the work (p.Width work units) across each core, and each core works the entire p.Width iteration (the work unit). If you parallelize the inner loop, then each work unit is much smaller--the while loop that determines the number of iterations before z escapes--and the overhead in fetching the next work unit is higher (now p.Width * p.Height times, rather than just p.Width times), not to mention that the Parallel.For queues the work units p.Width times, rather than just once as happens when the outer loop is parallelized.

Other Parallel Task Concepts

If you read Optimize Managed Code for Multi-Core Machines3, you'll note it also discusses an Aggregate method. This has either been dropped, or did not make it into the CTP. Currently, another concept supported by the Parallel class is the Parallel.Do method. This method takes, as a parameter, an array of Action instances, and the task manager will execute each Action asynchronously. Similarly, the Parallel.ForEach method takes a single action on an enumerable data source, and potentially processes the action on each item in the data source in parallel.

Another concept is called "a future". "A future, which is a task that computes a result, is constructed not with a normal action, but with an action that returns a result. This result is a delegate with the Func<T> type, where T is the type of the future value. The result of the future is retrieved through the Value property. The Value property calls Wait internally to ensure that the task has completed and the result value has been computed."3

The Task Manager

There really isn't any better way of stating it than Daan Leijen and Judd Hall already have: "All tasks belong to a task manager, which, as the name implies, manages the tasks and oversees worker threads to execute tasks. While there is always a default task manager available, an application can also explicitly create a task manager."3

Exception Handling

With regards to exception handling, tasks that generate exceptions are propagated to the code that invokes the parallel processing of the task. To quote: "...the Parallel.For and Parallel.Do functions accumulate all exceptions thrown and are re-raised when all tasks complete. This ensures that exceptions are never lost and are properly propagated to dependents."3

What exactly is PFX? That's a bit hard to describe. TPL is said to "use the Parallel FX Library"3. I'm unclear as to what that means, and perhaps Microsoft is suffering from a bit of acronym dyslexia itself. So, the only reason I mention it here is that it seems to be the umbrella assembly/library/extension for PLINQ and the TPL.

The Parallel Language Integrated Query (PLINQ)

PLINQ is LINQ where the query is run in parallel. Converting a query from sequential to parallel execution is accomplished very easily, by adding "AsParallel()" to the data source. For example14:

"LINQ's set-at-a-time programming model for expressing computations places an emphasis on specifying what needs to get done instead of how it is to be done. Without LINQ, the how part would typically otherwise be expressed through the use of loops and intermediary data structures, but by encoding so much specific information, the compiler and runtime cannot parallelize as easily. LINQ's declarative nature, on the other hand, leaves the flexibility for a clever implementation like PLINQ to use parallelization to obtain the same results."14

State and Memory Sharing

Does it strike you that the efforts of TPL, PFX, and PLINQ are focused on parallel task processing whilst completely ignoring state and memory sharing issues? This certainly struck me. It seems to me that the cart is being put in front of the horse. So far, the TPL addresses parallel computing only within the context of loop parallelism where the tasks are essentially autonomous tasks. To really leverage parallel processing for real world tasks, in which tasks need to communicate with each other, it becomes essential that state and memory sharing in a concurrent environment are resolved. One approach to this (taken by Erlang, to mention one language) is to completely eliminate shared state (tasks get copies of objects so they don't have to set locks or deal with race conditions) and eliminate shared memory (copies of objects are passed between tasks rather than references, again eliminating the need for locks). Instead, messages are used to communicate between tasks, discussed next.

Messaging and Message Serialization

In Slava Akhmechet's blog on Erlang style concurrency15, he takes the reader through the process of redesigning Java with passing concurrency in order to eliminate the issues of state and memory sharing, thus eliminating synchronization issues. He does so by eliminating synchronization keywords and by instantiating objects on a heap specific to that thread, so that access to objects on another thread is simply not possible. Of course, threads do have to communicate with each other, and this is accomplished by sending messages. Each thread gets its own message queue, and blocks until there is a message to process. However, since the message sent by thread 1 will be placed in the queue for thread 2, thread 2 now has a reference to thread 1's heap, which is violation of the goal. So instead, the message is serialized. Slava Ahkmechet next points out a very interesting thing--because the message is serialized and the send/receive messages implement an interface, we now have the ability to distribute work not only across cores, but across machines. However, this opens another can of worms--determining if the receiver actually received the message.

Drawbacks

One significant drawback of this approach is the performance hit you take when serializing an object to be passed between tasks. Since the target task receives a copy, it must be serialized by the sender, and the copy de-serialized by the receiver. I can imagine major performance problems if you have extensive communication between tasks using non-trivial (basically, non-value type) objects.

Software Transactional Memory

Another approach that appears to be on the radar is STM. "In computer science, Software Transactional Memory (STM) is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing. It functions as an alternative to lock-based synchronization, and is typically implemented in a lock-free way."10 In the Channel 9 video5, the interviewer asks about "transactional task management", which is referring to software transactional memory. As Anders Hejlsberg points out, this is a Microsoft research project. See Further Reading for links.

Considerations

Task Management

Not all tasks take up 100% of a CPU's cycles; therefore, it would be useful to be able to organize tasks in a manner that gives the developer control over the thread pool size for a particular category of tasks. Or, conversely, the underlying task manager should be capable of assigning additional threads, realizing that the processors are underutilized. With the TPL, an application can create its own task manager. As stated: "...you might want to use multiple task managers, where each has a different concurrency level or each handles a separate set of tasks."3

Debugging

It's not hard to imagine the complexity in debugging an application with numerous concurrent tasks. As stated: "Another important use of the task manager interface is to run all code sequentially using a single worker thread."3 While this obviously doesn't solve concurrency related bugs, it does provide a mechanism to debug your application in a sequential execution mode.

Also, with a concurrency model that supports messaging, debugging is facilitated by being able to audit the messages.

Side Effect Free

Anders Hejlsberg makes the following interesting statement regarding PLINQ: "...we can be much smarter about how to run these queries. We can make, maybe, some simple assumptions about whatever functions you call in here are pure and side effect free..."5 The key phrase here is "side effect free". This is the crux of concurrent programming: tasks must be side effect free. One mechanism to achieve this is, of course, the stateless, no shared memory paradigm used by Erlang and others. However, I would be hard pressed to say that making these assumptions is anything but "simple".

To reiterate, as Joe Duffy states: "If you can express your program side-effect free in a mostly functional manner using PLINQ..."5 There is a lot of discussion that TPL makes it easy to write concurrent applications; however, the hard work, making your tasks side-effect free, is apparently being largely ignored.

Confession time: "This library doesn't change the way you synchronize access to data or the way you transact it...but when it comes to access to shared resources or shared data, you still have locks, monitors, or whatever." (Anders Hejlsberg)5 Parallel task management is the simple side of the concurrent programming equation, and as Anders Hejlsberg says, "while the debate rages" on how to manage shared memory and state, the TPL simplifies task management. Unfortunately, the developer is left holding the bag in the real world use cases of the TPL, where shared memory and state is a consideration.

Does Concurrent Programming Require Functional Languages?

When asked if one of the reasons LINQ can take advantage of parallelism, Anders Hejlsberg replies: "It's because of the more declarative or functional style of writing programs and queries instead of the more traditional, imperative statement-like way of doing it."5 How does a more declarative style of writing programs facilitate parallelism? This quote on "Declarative Concurrency Permits the Runtime to Optimize" helps to explain the rationale:

"Serializing access to data with locks involves trading correctness for parallelism and scalability. Clever locking schemes can be used to reap the benefits of both, but require complex code and engineering. If you're crafting a highly reusable, scientific-quality library, this amount of work might be par for the course. But, if you're writing an application, the tax is simply too high.

Locks employ a technique known in database circles as pessimistic concurrency, while transactional memory prefers optimistic concurrency. The former prohibits concurrent access to data structures while the transaction executes, while the latter detects (and sometimes even tolerates) problematic conflicts at commit time. If two concurrent tasks work on the same data structure and do not conflict, for example, serialization is not necessary; yet, a naïve lock-based algorithm might impose it. Permitting the transaction manager to choose an appropriate execution strategy using runtime heuristics can lead to highly scalable code.

Moreover, in-memory transactions can eliminate hard-to-debug "heisenbugs" such as deadlocks, which often result in program hangs and are hard to provoke, and priority inversion and lock convoys, both of which can lead to fairness and scalability problems. Because the transaction manager controls locking details, it can ensure fair and deadlock-free commit protocols."11

However, when trying to understand why functional languages are well suited for concurrent programming, the primary reason I've found is that state is immutable. This becomes clearer in Chris Sells' Functional Language Summary6, to quote:

All "variables" are immutable, often called "symbols"

Program state is kept in functions (specifically, arguments passed on the stack to functions), not variables

The second point, that state is managed on the stack and not in variables, is the key to why functional programming is well suited for parallel computing. To quote again from Chris Sells:

Functions cannot cause side effects ("variables" are immutable) ...

No need for multi-threaded locks, as state is immutable

This makes functional programs automatically parallelizable

Ah ha! So, this is a much more definite answer as to the advantages of functional languages with regards to concurrent programming. And, since LINQ utilizes lambda calculus as the underlying mechanism in its query expressions, and "Lambda calculus provides a theoretical framework for describing functions and their evaluation"7, we have an association between LINQ and its application in concurrent programming.

So, does concurrent programming require a functional language? No, of course, not. But, it is greatly facilitated by the immutable state implicit in a functional language. In other words, the developer doesn't have to worry about locks, semaphores, etc., which is required in concurrent programming in which state is shared.

But is LINQ Truly Functional?

No. "The LINQ query language uses "lambda expressions", an idea originating in functional programming languages such as LISP - a lambda expression defines an unnamed function (and lambda expressions will also be a feature of the next C++ standard). In LINQ, lambdas are passed as arguments to its operators (such as Where, OrderBy, and Select) - you can also use named methods or anonymous methods similarly - and are fragments of code much like delegates, which act as filters."9 The caveat is that C#'s lambda expressions (and therefore LINQ) do not implement true functional programming; any lambda expression can access mutable variables outside of its function, and program state is still kept outside of the functions themselves. Therefore, the programmer must still be diligent in writing code with a lambda syntax that is thread safe, and this also means writing LINQ statements that are thread safe. For example8:

int i = 0;
var q = from n in numbers select ++i;

is not thread safe. As the PFX blog points out, "There is a set of 101 LINQ samples available at MSDN. Unfortunately, many of these samples rely on implementation details and behaviors that don't necessarily hold when moving to a parallel model like the one employed by PLINQ. In fact, some of them are dangerous when it comes to PLINQ."8

Work Stealing

Threads can steal work assigned to other threads. This helps distribute the work units across free cores/threads when units of work result in unbalanced thread usage. The intent of the TPL is to support this capability.

Conclusion

The .NET Parallel class definitely eases the development of threaded applications from the perspective of work unit management. State and shared memory issues remain for the programmer to solve. Adding concurrent programming capabilities to LINQ is a natural extension of LINQ, and should prove to be valuable in a variety of scenarios. In the end, the PFX, if I'm using the term correctly, should help in leveraging multi-core systems. I will be interested to see if the architecture Microsoft is putting together can be extended further into true distributed computing. I will also be interested to see what sort of state and shared memory solutions can work with PLINQ and TPL. Clearly, we are on the edge of another series of technological advances, this time in concurrent programming.

Personal Observations

In watching the videos with Anders Hejlsberg, I can't help but feel disappointed that there isn't any mention of Erlang. In fact, there seems to be an avoidance of the whole issue of mutability and state management. Granted, Anders Hejlsberg has a "we must be honest about this" statement when he points out that the TPL doesn't change the way you have to deal with synchronization. It is clear to me that he and Joe Duffy don't want to touch this elephant as the rest of the world "rages on in the debate" on how to deal with synchronization. Fair enough, the TPL certainly does make it easier to add parallelism to well defined work units. Still, I'm disappointed that the real meat of the issue, synchronization, has been pushed aside. I tend to draw the conclusion that, because .NET's lambda expressions aren't true functional programming, the .NET languages will never be capable of truly eliminating the developer's work (and therefore bugs) in dealing with synchronization in the way that languages like Erlang achieve success in resolving the synchronization issue.