Multi-cores are here, and they are here to stay. Industry trends show that each individual core is likely to become smaller and slower (see my post to understand the reason). Improving performance of a single program with multi-core requires that the program be split into threads that can run on multiple cores concurrently. In effect, this pushes the problem of finding parallelism in the code to the programmers. I have noticed that many hardware designers do not understand the MT challenges (since they have never written MT apps). This post is to show them the tip of this massive iceberg.Update 5/26/2011: I have also written a case study for parallel programming which may interest you.

Why finding parallelism is hard?

Some jobs are easy to parallelize, e.g., if it takes one guy 8 hours to paint a room, then two guys working in parallel can paint it in four hours. Similarly, two software threads can convert a picture from color to grayscale 2x faster by working on different halves of the picture concurrently. Note: Programs that fall in this category are already being parallelized, e.g., scientific computing workloads, graphics, photoshop, and even open-source apps like ImageMagick.

There also other programs that are sequential in nature, e.g., two guys will not be able to cook 2x faster than one guy because the task isn’t fully parallelizable: there are inter-task dependencies and the cooks end up waiting for each other at times. Unfortunately, a lot of programs have artificial inter-task dependencies because the programmers wrote them with an ST mind set. For example, consider this code excerpt from the H.264 reference code (I have removed unnecessary details to highlight my point):

Notice how the variable mb is written every iteration and no iteration uses the mb written by the previous iterations. However, mb was declared as a global variable probably to avoid its repeated allocation and deallocation. This is a reasonable ST optimization. However, from an MT standpoint, the loop iterations of the OUTER loop now have a dependency among each other and they cannot be run in parallel. To parallelize this code, the programmer has to first identify that the dependency is artificial. He/She then has to inspect 1000s of lines of code to ensure that this assumption isn’t mistaken. Lastly, he/she has to change the entire code to make mb a local per-iteration variable. All this is difficult to achieve (I parallelized H.264 for this paper).

So here is the status: Leaving the artificial dependencies in the code limits parallelism, mistakenly removing a real one breaks the program, and reaching the perfect balance requires prohibitive effort. Since its hard to identify all dependencies correctly the first time, they make errors and debugging begins.

Why is debugging difficult?

Debugging multi-threaded code is very hard because bugs show up randomly. Consider the following:

Say two threads T0 and T1 need to increment the variable X. The C/C++/JAVA code for this will be

X = X + 1

Their assembly code will look as follows: (instructions are labeled A-F).

T0

T1

A

Load X , R0

D

Load A, R0

B

Increment R0

E

Increment R0

C

Store R0, X

F

Store R0, A

The programmer wants X to be incremented by 2 after both threads are done. However, when the threads run concurrently, their instructions can interleave in any order. the final value of X depends on the interleaving (assume X was 0 before the two threads tried to increment). For example,

ABCDEF: X = 2 (correct)

DEAFBC: X=1 (incorrect)

ADBCEF: X = 1 (incorrect)

Basically, there is a dependency that D shall not execute before C (or A should not happen before F). The programmer has missed this dependency. However, the code does work fine half the times making it impossible to track the bug or test a fix. Moreover, traditional debugging techniques like printf and gdb become useless because they perturb the system, thereby changing the code behavior and often times masking the bug.

Why is optimizing for performance so important and challenging?

The sole purpose of MT is performance. It is very common that the first working version of the parallel code is slower than the serial version. There are two common reasons:

Still too many dependencies (real or artificial): Programmers often iteratively remove dependencies and sometimes even re-write the whole program to reduce these dependencies.

Contention for a hardware resource: Threads can also get serialized if there is contention for some hardware resource, such as a shared cache. Programmers have to identify and reduce this contention. Note that identifying these bottlenecks is especially challenging because hardware performance counters are not reliable.

After several iterations, the code becomes good enough for the performance target.

The work does not end here..

Unlike ST code which would get faster every process generation, MT code has complex non-deterministic interactions that can make its performance swing long ways when hardware changes. For example, I had a branch-and-bound algorithm (a 16-Puzzle solver) which would slow down with more cores because the algorithm would end up on a different path when more threads were running. Even a simple kernel like histogram computation can behave very differently with different inputs or machine configuration (see this paper). Thus parallel programmers are also burdened with the task of making their code robust to changes.

Conclusion

My goal here was not to teach parallel programming but merely provide a flavor of what it takes to write a good parallel program. It is indeed a complex job and I assert that it is not possible to appreciate the challenges without actually writing a parallel program. For those who have not written one yet, here is my call for action:

Write a parallel program to compute the dot-product of two arrays and get it to scale perfectly, 4x speedup on 4-cores, etc. It is simple but you will learn more than you expect.

Search the keywords: pthreads or winthreads to learn the syntax on linux or windows respectively. Share your experiences!

Update 5/30/2011: Thank you everyones for your feedback and readership. I have written a third post on parallel programming. It describes a profile-based solution for choosing the best task granularity in parallel programs.

56 Responses to “What makes parallel programming hard?”

Great article on MT programming. I’m not a programmer, i’ve doing Logo, basic, qbasic, Pascal, C, Cobol at school, now I’m more like an OS enthusiast. I don’t really want to start a debate here, but I’m curious if you’ve check on what Apple have done for solving most of the challenge parallel programming got?

In my limited understanding, with Apple addition of Block extension to C solve problem variables since the block use all variable of the function running in. And with libdispatch, you send block (job) in the serial or parallel to Grand central dispatch (GCD) who run all “wagon” coming from all apps and running all on his own threads. This have the advantage to simplify code, like it say above, loop and filter are very easy to transform into ^block and dispatch in serial or parallels to be execute on external threads. Another advantage this is very scalable, GCD will adjust how many concurrent threads giving the hardware and get advantage of shared cache CPU core vs multiple distinct CPU. GCD optimized jobs dispatched on multithreaded multicore multicpu design like the MacPro by dispatching all jobs of the same queue on multiples threads that can share the same cache on the same CPU. I’m not sure but I think you can even dispatch OpenCL code to multiple GPU this way.

Its a great comment and all your points are indisputable. GCD does ease the task of actually writing parallel code. As you point out, block constructs forces programs to think about dependences and “tricks” them into doing the right thing. The Serial and Parallel constructs makes it easier to enforce thread dependencies, and dynamic thread allocation relives the programmer of choosing the right number of threads.

However, GCD still does not solve some of the challenges I discuss above:

First and foremost, programmer still has to find parallel sub-tasks to insert in the queues. Programmer still have to identify all true dependencies to ensure that tasks are pushed in correct order. Furthermore they still need to think thread-synchronization to decide what goes in serial thread and what goes in parallel thread. All this implies that a big chunk of these challenges stays.

I have not coded in GCD enough to comment on the debugging experiences, may be someone else can elude us.

GCD as such does not help with robustness. However, indirectly it does help a lot. Robustness and scalability become a problem usually because programmers make wrong choices. By forcing programs to do the right thing and taking control of execution, GCD does improve scalability.

I will learn more on GCD and update my answer asap. Its a great topic to discuss.

I greatly enjoy your explanation, and I acknowledge way to use of parallel programming for high computational program, when working with complex algorithm it can be a very difficult task to “split” in chunk and being execute separately.

But I think parallel programming can (and will) be use for more common use like Apple is using GDC in all its software. Most program like Mail, iPhoto or iTunes are about managing and filtering content, execute repetitive tasks to multiple items. Mail filters is a great example of tasks that can be “transform” into block for parallels asynchronous execution. GDC solution is more than parallels execution, its a way of running “packetize” code thru a centralize dispatch. I think new gen of mobile device with multi-cpu need more than all parallels programming and some sort of OS controlled threads to allow multiple multithreads apps running at the same time.

I like this description of GDC: Islands of Serialisation in a Sea of Concurrency

I find your point on artificial dependencies very relevant. This is mostly due to shortcomings in current programming languages.

Let me give an example. You want to increment both x and y. In C, Java or similar languages, you will have to write either { x++; y++; } or { y++; x++; }. That is, you have to introduce an artificial dependency, although there was none, just because the language syntax does not allow otherwise. Then parallelizing tools will work hard attempting to guess whether such dependencies are real or artificial.

In a language with parallel composition such as Ateji PX for Java, the syntax makes it possible to write [ x++; || y++; ]. The parallel bar ||stands for parallel composition, it runs both statements in no particular order, or even in parallel when running on parallel hardware. This example shows that with an appropriate syntax, there is no need to introduce artificial dependencies any more.

I think you’re write that it’s mostly a language issue- the semantics for the majority of computer programming languages have a logical “top to bottom, left to right” temporal ordering. It is so deeply ingrained that it is difficult for most programmers to think of it any other way- Line 10 executes before line 11, and line 11 executes only after line 10. Very strong, very deeply ingrained mental model of temporal causality.

There is only one family of languages that I know of where this isn’t true- Hardware Description Languages, such as Verilog or VHDL.

Hardware Description Languages have a completely different logical and mental model for temporal causality because they have to- these are the languages that are used to design CPU’s, ASIC’s, etc. When you’re dealing with hardware like that, it makes no sense to think about things as if “This transistor executes before that transistor, and..” It’s non-sensical- every single transistor is executing at the exact same time. HDL languages reflect that fact- every single character of every single line is executing at the same time. There are of course ways to control causality, and at what point “something happens”, but it is unlike anything you’ll find in your typical computer programming language.

For me, it was quite an eye-opener. It also made me realize that there’s nothing inherently difficult about doing things in parallel or concurrently- it’s actually pretty trivial… once you are forced to use a language where doing things in a sequential, step by step fashion is simply not possible.

In Verilog, a <= b; b <= a; and b <= a; a <= b; mean exactly the same thing (I’m glossing over some important details, though). The short version of why this is so is that in Verilog, you have to explicitly declare when things happen. This typically means that you write something that says “At the start of the tick of the Clock, do this…”. So, at the start of the tick of the clock, the “inputs” are on the right side of the statement, and the assignment to the left side takes place “before the next start of the tick of the Clock, but have yet to take place during this tick of the Clock..” There’s even two different types of assignment, <= and =, where = means “the assignment takes place and is completed by the end of the statement containing the assignment.”

I have been thinking about the similarities and differences between VHDL and parallel programming a lot myself. The way I see it, there are two major differences: (1) the synchronous clock (as you point out), (2) communication latency trade-offs. This is what makes parallel programming a lot harder: Software has to deal with latencies at a very different granularity and asynchronous systems are much, much harder to design (which is why we have never seen an asynchronous chip).

I have seen ParC and its a pretty neat concept. I like the idea of sticking to C++. My only concern is that it does require asynchronous design and doesn’t simplify the communication latency issues. I may be mistaken as I only spent an hour toying with it. Whats your opinion? Does it solve either of these problems?

ParC supports both asynchronous and synchronous (RTL) style – the working examples on the site are synchronous and asynchronous versions of the same algorithm (game of life).

The asynchronous/no-shared-memory level is the best for extreme parallelism and hardware design. If you use the right methodology (e.g. CSP) the compiler and static analysis tools will do a lot of the hard work for you. The compiler/synthesis tools would be driven by latency, power and area constraints (using AoP).

See Achronix and Tiempo for companies delivering asynchronous chips.

Hardware designers need to move up from from RTL level so that the tools have more flexibility with creating the clocking schemes (which are really too complicated for humans now). Also the verification/design methodology also needs to be more power-aware, and ParC attempts to help by supporting more transistor-level/analog modeling than Verilog.

[...] Aater Suleman, Intel This post is a follow up on the previous post titled why parallel programming is hard. To demonstrate parallel programming, this article presents a case study of parallelizing a kernel [...]

Hi, Interesting read! I was wondering what you thoughts were on using OS processes vs threads. I’ve been finding that relying on the OS lets me focus more on my real software problems (see link). Of course communication between between processes is going to be harder than with threads, but it seems like this might be a fair trade off.

You are right that process level parallelism is good for several things but MT is required for cases where threads communicate. There are ways to get processes to talk to each other as well, e.g., dbus or IPC or pipes but I have written programs that way and the communication overhead is prohibitive. I would look at PostgreSQL as an example. By the way, for process level parallelism, make is your poster child and then there is GNU parallel as well.

Having said that, I do think multi-threaded code will be the right way to do it if you want to speed up a single problem.

What makes parallel programming hard is the legacy of tools we have to work with, the unholy trinity of CPU, OS and Programming Language.

X86 architecture and instruction sets aren’t parallel friendly; it is the operating system that performs multi-threading and multi-tasking, and the operating system itself is just an application.

The resulting Operating System ABIs to implement parallelism are hefty and expensive.

Take the maximally extreme case of

x++; y++;

At the lowest levels, the CPU will probably pipeline that, but to get one code-stream to execute those operations in parallel the cost is gigantic.

Under the hood, multi-core is just a hacky version of SMP. As a result, no real progress is being made in parallel programs outside of academic and extreme domain-specific scopes.

I would hazard a guess that the root of the problem is the old hardware guy vs software guy issue; what portion of the pan-domain, pan-discipline, pan-language programming community is working in x86 machine code or assembler?

I’d be surprised if it was more than a single digit percentile. If accurate, that means that 99% of all programmers are working at least one layer separated from actual machine instructions. When it comes to parallelism that’s a big deal, especially since parallelism itself is a pseudo-implementation.

Of course, the OS devs don’t want applications tearing up threads/whatever that do their own thing because … well, botnet anyone?

Languages like C/C++ etc carry on their legacy of single-threadedness because you have to find large jobs of work or else you actually decrease your apps performance due to the sheer volume of x86 instructions required to start a thread or dispatch work across cores.

It’s more than just the ABI/APIs, very few languages have adopted constructs like “is thread safe” decorators. So the compiler’s have their work cut out to do any kind of auto-optimization and, in my experience, most of them work well in test cases or if a programmer exactingly follows very careful coding sequences, but in general production environments they just don’t work out as well as they could.

“Under the hood, multi-core is just a hacky version of SMP. As a result, no real progress is being made in parallel programs outside of academic and extreme domain-specific scopes.”

Its much different from SMP because the communication overhead is much lower and trade-offs are very different. I measured that one a dual-chip (kind of SMP), core-to-core cache misses were going through memory and cost like 250 cycles. On a real CMP, like Nehalem, that cost is just tens of cycles. Thus, my disagreement that its that same as SMP.

I also disagree that no real progress is being made. ImageMagick is a good one. Adobe has done a lot parallel programming. Open source efforts are firing up quickly. I guess its subjective by my opinion, the motivation is now high and hence work is catching fire.

I agree with hardware-guy/software-guy issue. Fixing that issue is the theme of this blog. This post was for hardware guys to learn the troubles a software guys goes through so they don’t just stick it to the software (I am a hardware guy with software know how).

Legacy tools, languages, and hardware architectures indeed make it harder. Read my post here where I highlighted this very issue:

A very nice article. This article is very relevant to me since I am taking a Concurrency/Parallel Programming course this quarter.

While we parallelized few known algorithms as part of the course assignments, we discussed and used an awesome tool to identify & debug the hidden concurrency bugs. Its called “mchess” , part of Alpaca developed by Microsoft Research, Redmond.
This tool is open source and can be efficiently used to identify really-hard-to-find concurrency bugs.

It seems to solve some the issues you describe by enforcing variables declaration to be either inside or outside the parallelizable code. I’m not sure that it allows you to share ressources like your ABCDEF example, but maybe it’s just safer that way!

my vague & unthought out ideas…
most algs fall into one of the two categories, non or paralizable.
why dont we have standard implementations for each alg in most of the popular languages, available online to reference. why re-invent the wheel?
then all we need to do is minimize the path / execution time length in the big picture

Thanks for reading and taking the time to share your thoughts. You are right about having more code on the internet for reference. Its an absolute must. My only concern with standard implementations is that most interesting software tailors algorithms to their needs which makes it hard to use standard code. I do however believe that reference examples can help a lot.

On a side note, I do want to clarify that there are algorithms which are mid-way, e.g., a 16-puzzle problem is parallelizable but only partially.

One of the main reasons parallel programming is hard is because most of it is done using the POSIX threads model. Hoare’s CSP came up with a much easier approach to reason about and get correct. In some ways it mirrors electrical circuits in hardware. A motherboard is an immensely complicated parallel system when viewed as millions of transistors yet we can now design them with decent results.

The starting point for reading up on the computer languages influenced by CSP, including Google’s Go mentioned above, is Russ Cox’s Bell Labs and CSP Threads.

Personally I am of the opinion that parallel computing is hard because of the very poor memory model of the C based languages: mutable shared memory.

If you try programming in a near side-effect free language and use immutable objects the very real pain of parallelism is dramatically reduced.
If you add such tools as “share-nothing”, “use messages for communication between processes”, and/or Software Transactional Memory the pain drops further.

I cannot help but recommend looking at languages such as Clojure, Erlang or Scala with Akka if you want to reduce your pain with parallel programming. Whatever you do leave Java / C behind as its memory model is (in my opinion) broken.

I do agree with you that we need low-effort parallel languages. There is need for that and the examples you mention fill this important space.

At the same time, I don’t think they make parallel programming in C irrelevant. The reason is that these side-effect free languages are often outperformed by C in performance. My concern is that MT speedup of these languages are generally reported over single thread code written in the same language. I often wonder how will the numbers change if we use the C implementation as the single thread baseline (unless you argue that they also ease single thread programming thus making my requested speedup unfair).

I’ve been doing MT software development for years in C/C++ and in C#. The majority of problems occur when the developer fails to recognize a single access resource is being used and fail to protect it as a single access point for threads when modifying that single access resource. This is an educational issue of the developers and not a problem of the language being used. I had to get into the habit of recognizing the single access resources vs the in thread resource. In other words, reduce the usage of globally available single access resources to the minimum necessary.

The difficulty in creating a language that will do those “thread safe” operations on single access resources is that you then create a situation in which the compiler has to determine which resource is single access and put in place the thread safe mechanism around it. What I don’t see is how this is different than a developer doing this themselves. Yes it’s more automated, but there are cases in which multiple access to a single access resource is perfectly ok (i.e. reading the value of a counter to be reported on screen).

It is my opinion that better teaching of MT threaded programming is necessary and that articles such as this are necessary to get software developers to go and learn about it. I don’t think we need a new language to replace the existing ones.

Actually, I am also of the opinion that we may not need a brand new language. My experience is that the programmer can be trained or tricked into writing thread safe code in C as well. My top concern is that raising the level of abstraction will take away the ability to do performance optimizations. I do believe that automated tools cannot beat the expert programmers.

At the same time, I do see the arguments on the other side. Think of it as python vs. C argument. C is faster and flexible but python does indeed increase productivity. I think there is a shortage of low-effort languages like “python” in the parallel world.

[...] case of parallel programming that’s particularly true. Aater Suleman has written an article on what makes parallel programming hard, partly to educate hardware designers, and the story has ignited some debate [...]

Thanks for a good and simple Introduction.
Huh, one thing is not good, though:
“Basically, there is a dependency that D shall not execute before C (or A should not happen before F).”
This makes no sense, because what you say obliterates the ‘parallel programming’. You just want both increments to be done sequentially, and then there is zero parallelism, alas. A bad example.

I think the source of confusion is that you are thinking that those 6 instructions are the entire program. I did not intend it to be that way. They are just a part of the program which is why parallelism is not zero. The other 99.9% can still be fully parallelizable.

Just to clarify, that example shows that there are parts of parallel programs that cannot be done in parallel (and the programmer is required to recognize these dependencies and insert synchronization). It does not obliterate parallel programming, it just shows that our world is not 100% parallel. In case you disagree, I will give a simple example: try adding a 1000 numbers with 100% parallelism.

[...] design idioms. While researching the background on this article, I found a fantastic write up on What Makes Parallel Programs Hard. The author contends that parallel programs are hard because of inter-task dependencies. This [...]

Part of what makes parallel programming hard is identifying parallelism opportunities, and part is harnessing them.

A first problem is identifying dependencies accurately. One needs very smart people or very sophisticated analysis tools to discover dependencies. But this discovery also depends on the ability to harness; big chunks of code have more dependencies than little chunks, and thus the number of opportunties to find useful bits of parallelism depend mightily on how much code is required for efficient parallel exeuction. This in turn depends on the time to create and destroy units of parallelism.

Most GP programming languages have no model of parallelism; if any is harnessed, it comes about from additional libraries (e.g, OS fork calls from C code). While these libraries offer parallelism capability, the overhead is quite high (thousands of machine instructions) which means the parallel activitity must be much bigger than that in order for the parallelism overhead to be modest, let alone overcome. Try writing a loop whose iterations run in parallel using such a library; you end up with a huge performance hit.

What you want is a language in which the parallelism constructs have extremely low overhead, to allow very small bits of code to be considered as parallelism opportunities. As a practical matter, I think this means the parallelism constructs have to be built into the language directly, so the compiler can see and optimize them. Having explicitly managed parallelism constructs also allows one to write programs that just broken because of incorrect parallelism API usage. (What, forgot set aside enough stack space?)

PARLANSE is a language designed around this idea. Forked grains have some 50 machine instructions of overhead; if you can find a block of several hundred machine instructions (representing a high level code block), it can reasonably be executed with modest overhead; you have a chance of running one (or perhaps few) iteration in parallel. Multiple versions can run in parallel. Seehttp://www.semanticdesigns.com/products/Parlanse/index.html for some examples.

We use PARLANSE to support irregular symbolic computations that in effect reason
about program structures and information flow. When you have 10 million lines of code,
the information flow are very irregular and incredibly vast.

[...] of strangers – right? Footnotes [1] Some of the issues in parallel programming are summarized here. [2] Here’s that Mythbusters video. [3] The Cathedral and the Bazaar by Eric S. Raymond [4] One [...]

Thanks for a fantastic article! I am new to parallel programming, and you have explained everything so clearly here! I especially loved the analogies about painting a room vs. cooking as parallelizable and non-parallelisable jobs. I know that reading your article will improve my programming! Keep the great articles coming.

I like the valuable info you provide in your articles.
I will bookmark your webloog and chck again here frequently.
I’m quite certain I will learn many new stuff right here!
Best of luck for tthe next!

Have you ever considered creating an ebook oor guest authoring on oter blogs?
I have a blog centered on the same ideas you diiscuss and would love to have youu
share some stories/information. I knolw my visitors would value your work.
If you are evwn remotely interested, feel free to shoot mee an e-mail.

Hi there, I discovered your siote via Google whilst looking for a similar topic,
your website got here up, it appears good. I’ve
bookmarked it in my google bookmarks.
Hi there, just turned into alert to your blog through Google, and located
that it is truly informative. I’m gonna be careful for brussels.
I wiull be grateful when you proceed this in future. Many other people shall be benefited out oof your
writing. Cheers! buy instagram likes instant delivery