Posted
by
Unknown Lamer
on Friday May 27, 2011 @02:24PM
from the double-the-threads-double-the-fun dept.

An anonymous reader writes "Intel's Aater Suleman writes about why parallel programming is difficult. ... I was unaware ... that a major challenge in multi-threaded programming lies in optimizing parallel programs not just getting them to run. His analysis is insightful and the case study is very enlightening if you are unfamiliar with parallel code debugging. "

That's a different problem, related to the fact that people buy "parallel" but forget to read the part where it says "synchronize". It's kind of like not validating input, or ignoring return values. Not so much a pitfall as a failure to have basic skills.

The problem these folks are having is that they bought parallel but decided it just wasn't optimizing their stuff enough. It'd be really funny if they found they could optimize-away the parallelization, like, by precomputing certain results and using a l

I have to disagree on this. Synching multi-threaded programs is often VERY far from trivial, and takes much more than "basic skills". TFA made that very point; often there are hidden dependencies that may not be obvious even to the developer who originally wrote the program. If it only took "basic" skills then we would have a lot more multi-threaded programs these days.

I ran into this problem myself recently: I wrote a program that interfaced with websites via http, and the problem was that while it was

Your understanding of Amdahl's Law isn't quite correct. I guess if you want to use the street race analogy, it's more like you each have your own lane for part of the race, but for another part of the race you have to go single file into a single lane, and if you're parallel to someone as you approach the choke point you have to arbitrarily let them ahead of you or pass them based on some set of rules that may not necessarily be fair. And the goal isn't to win the race but you're all delivering something an

While true this isn't due to the problem of not being able to parallelize more tasks. If you want to use the "add more doesn't mean more gets created" metric, use the myth of adding more people to a project to speed it up. Even aside of training and other overhead, you cannot simply parallelize everything in software creation. Let's do a simple example.

There's two tasks that can be ran parallel. An administrative task (like meetings and organizing crap) and the programming task of creating the actual progra

Amdahl's law is saying that no matter how much paralization you get, you are still going to be limited by the longest non-paralizable path. This is no different than your real life example - your longest non-paralizable path (say CPU->PCB assembly->ROM flashing->final assembly->packaging) is going to limit how fast you can manufacture your iPhones. It doesn't matter how fast you can manufacture glass backs, or RAM, if it's not on that critical path, you won't be speeding up your result at all.

Importantly, comparisons to factories are generally unwise because manufacturing is 'embarrassingly parallel' -- it doesn't matter if you have to make the whole iPhone in one single step that takes two straight hours and you can't divide it into any constituent part. You can still manufacture as many iPhones a day as you like because you just set up however many manufacturing units which each make one iPhone every two hours, until you have the desired output capacity.

A couple of problems with the analogy:
1. In manufacturing, the idea is to do the exact same thing a jillion times with the exact same result. Interchangeable parts make different rates of a production easier to deal with. In computing this isn't the case. The assembly line make be identical each of the jillion times, but the data going through it is not.
2. Manufacturing plants are expensive to build. It makes sense for an engineer to spend weeks or even months optimizing the process. We don't have th

You're a bit smarter than me but I think you're saying there are lots of easy tasks that are easy to run on a single linear code thread but are hard to split and recombine with any less loss of time / resources.

I am kinda curious how anyone even tangentially involved in programming could not be aware that the problem with writing parallel programming was doing it for a gain in efficiency. Making a thread or process is generally just a couple lines of code, synchronization with data separation, mutex's and avoiding deadlocks and race conditions has been solved since almost the beginning of parallelism.

synchronization with data separation, mutex's and avoiding deadlocks and race conditions has been solved since almost the beginning of parallelism

And yet people constantly get these details wrong in practice.

It's an extra layer of complexity and it introduces extra chances to make mistakes, even around areas where a programmer could know better. There's not much way around that. If people only made coding mistakes around difficult problems software would be dramatically more bulletproof than it actually is.

I tend to think of it as an extra dimension in code. With non-parallel code, the code you have (it's sequence) is the same as what it's sequence would be when run. With parallel code, the run-time sequence is different than the code as it's laid out in source.

I see people have trouble with just async stuff (eg. AJAX) and have a hard time wrapping their mind around the fact that even though the callback function is in-sequence with the rest of the code, it's not actually called in that sequence - hence the '

I see people have trouble with just async stuff (e.g. AJAX) and have a hard time wrapping their mind around the fact that even though the callback function is in-sequence with the rest of the code, it's not actually called in that sequence - hence the 'callback'.

In my experience, the reason people have so much trouble with async stuff is that every single JavaScript API I've seen that does things asynchronously is rather fundamentally designed *wrong*. They don't provide any clean mechanism for passing dat

I blame that mostly on the languages not being explicit about what operations and methods are or are not thread safe. And for capable programmers those errors generally only occur when you try to make things more efficient by avoiding excess mutex and data duplication in the pursuit of efficiency.

I took a course in parallel programming. We worked with MPI in C if anyone is curious or interested. The hardest part was completely changing the way you thought about programming. It was like the first time you looked at prolog. Or the first time you tried a functional language like scheme. If you've only ever done procedural, it can be quite a big deal to switch. Also, it's quite a bit harder to debug multithreaded code, as the exact order of the instructions is different everytime you run it. Tracki

The key to parallel programming is compartmentalization. With a good enough foundation that compartmentalizes properly, parallel programming would only be a matter of coding the compartments and synchronizing their results.

That having been said, parallel programming these days solves a fairly niche problem. The speed of modern processors make faking parallelism with interrupts viable. The only area where parallelism truly makes sense is when working with extremely large amounts of large chunks of data.

I am kinda curious how anyone even tangentially involved in programming could not be aware that the problem with writing parallel programming was doing it for a gain in efficiency. Making a thread or process is generally just a couple lines of code, synchronization with data separation, mutex's and avoiding deadlocks and race conditions has been solved since almost the beginning of parallelism.

Actually, interestingly enough, parallelizing code not only results in computational efficiency gains, but it frequ

Good tutorial for someone who wants to jump into some parallel programming, but it's mostly Operating Systems 101 (or 601).

Honestly though, if you have not optimized your algorithm or code to for parallelism and you want to do it now, you might probably be better off writing the whole thing from scratch, and the tutorial explains why very nicely/.

Assuming that there is a good way to take advantage of multiple processors for the particular problem your code solves. For a course project once, we found ourselves trying to solve a linear program, and someone suggested using the research cluster to speed things up. As it turns out, linear programming problems are P-complete, and there is no known way to make meaningful use of multiple cores (it is very likely that no such method even exists, and that the problem is inherently sequential).

I'll take a stab and reply to you that "8 processors was enough for anyone", in the sense that multiplexing 8 programs is just insane. Better to just run 8 prorams each on their own core, and use some progs that can use 4 cores at a time. That leaves 4 free.

(Overly simpistic) I agree, but 1028 cores is not the answer. We need the next generation in raw core power to move computing forward. 8 killer cores will beat 1024 mediocre cores.

While you're correct from a temporarily practical measure, I disagree in theory. OS theory 20 or more years ago was about one very simple concept.. Keeping all resources utilized. Instead of buying 20 cheap, slow full systems (at a meager $6k each), you can buy 1 $50k machine and time-share it. All your disk-IO will be maximized, all your CPUs will be maximized, network etc. Any given person is running slower, but you're saving money overall.

If I have a single 8 core machine but it's attached to a netapp disk-array of 100 platters over a network, then the latency means that the round trip of a single-threaded program is almost guaranteed to leave platters idle. If, instead I split a problem up into multiple threads / processes (or use async-IO concepts), then each thread can schedule IO and immediately react to IO-completion, thereby turning around and requesting the next random disk block. While async-IO removes the advantage of multiple CPUs, it's MASSIVELY error-prone programming compared to blocking parallel threads/processes.

A given configuration will have it's own practical maximum and over-saturation point. And for most disk/network sub-systems, 8 cores TODAY is sufficient. But with appropriate NUMA supported motherboards and cache coherence isolation, it's possible that a thousand-thread application-suite could leverage more than 8 cores efficiently. But I've regularly over-committed 8 core machine farms with 3 to 5 thousand threads and never had responsiveness issues (each thread group (client application) were predominantly IO bound). Here, higher numbers of CPUs allows fewer CPU transfers during rare periods of competing hot CPU sections. If I have 6 hot threads on 4 cores, the CPU context switches leach a measureable amount of user-time. But by going hyper-threading (e.g. doubling the number of context registers), we can reduce the overhead slightly.

Now for HPC, where you have a single problem you're trying to solve quickly/cheaply - I'll admit it's hard to scale up. Cache contention KILLS performance - bringing critical region execution to near DRAM speeds. And unless you have MOESI, even non-contentious shared memory regions run at BUS speeds. You really need copy-on-write and message passing. Of course, not every problem is efficient with copy-on-write algorithms (i.e. sorting), so YMMV. But this, too was an advocation for over-committing.. Meaning while YOUR problem doesn't divide. You can take the hardware farm and run two separate problems on it. It'll run somewhat slower, but you get nearly double your money's worth in the hardware - lowering costs, and thus reducing the barrier to entry to TRY and solve hard problems with compute farms.amazon EC anyone?

In my opinion, many problems with software development, are just as applicable in other domains of our life, and parallel programming is definitely one of them. We equally well have problems managing large teams of people working in parallel. These are problems of logistics, management and also (and no I am not joking) - cooking. And we're as bad handling these as we now handle software development. It may be right, however, to start solving this with the computers - no need to throw away rotten food and/or

There has been quite a bit of work on formalizing parallel computing. NP problems are exactly that: problems that can be solved efficiently on a computer that can explore an unbounded number of solution paths in parallel. There is also the NC hierarchy, which can be thought of as problems that can be solved efficiently on a sequential computer and "much more efficiently" on a parallel computer (that is, polynomial time on a sequential computer, and polylogarithmic time on a parallel computer with a polyn

That's a monumental and exceedingly involving task - to create such a novel concept of a language, partly because few have been written, and even fewer are usable to mere mortals, who happen to be using imperative languages just fine, mind you. Articles upon articles full of either terms-within-terms or lack there-of, have been written. Some people once set upon defining this new computing upon which you seem to touch and created www.tunes.org, which has since stagnated.

That you have to do everything all at once. How would you tell 50 kids to sort 50 toy cars? How would you tell 50 footballers to line up by height all at once? How would you have 50 editors edit the 50 pages of a screenplay all at once so that it makes sense from a continuity perspective? All these problems are very easy, but become very hard when you have to do it all at once...

The ultimate solution to most of those appears to be, "have to do it fifty times, and assign one job to a person".

That said, not all problems are so easily dealt with as bulk operations. Perhaps you have some real need to take one item and spit the work up among multiple people. The real driving force for all the scenarios I can think of is a desire or need for an upper bound on job latency.

I do this kind of work as my day job. I've also got some experience in managing groups of coders (at work, not particu

I think the language Chapel [cray.com] being developed by Cray is taking concrete steps to make Parallel Programming easier. It is very similar in syntax to C++ and even supports some higher level constructs.

This looks like an advertisement for Chapel but I have no relation to Cray. Having taken a graduate parallel programming course, I cannot agree more with the statement that "Parallel Programming is difficult". I struggled a lot with pthreads and MPI before doing the final assignment in Chapel which was a plea

20 years ago when I was working with transputers we use Occam. It was a very pure parallel programming language and it wasn't too difficult. However, writing parallel code meant starting from scratch (getting rid of the dusty decks of old algorithms as my professor described it). However, this never really happened and we've ended up with primitive parallelisation nailed on to sequential code. The are many parallel architectures, SIMD, MIMD, distributed memory, shared memory and combinations of them all

Occam still exists. KROC supports the latest version of Occam (Occam-Pi), which supports mobile processes, generics and other concepts acquired from the developments in serial programming and computer clusters.

I consider it to be one of the finest languages out there for learning parallel programming and consider that most of the modern failures in the field are a result of people not knowing it and therefore not knowing the fundamentals. You can't run if you can't walk.

Using locks and the like make it very easy to do multithreaded and parallel programs.

The big problem comes when you need multiple locks because you find your program is waiting more on locks than anything else which is gumming up the whole works, and that can easily lead to deadlocks and other fun stuff.

Another way is to consider lockless algorithms, which don't have such blocking mechanisms. However, then you get into issues where atomicity isn't quite so atomic thanks to memory queues and re-ordering done in the modern CPU, and thus have to start adding memory barriers before doing your atomic exchanges.

Raymond Chen (of Microsoft) did a nice write up of the lockfree ways to do things and what Windows provides to accomplish them.

Worse, typical synchronization primitives such as in pthreads and Windows are optimized not for speed but for error handling etc. It is easy to beat their speed with custom implementations, sometimes with dramatic speed increases (see for example numerous articles at http://locklessinc.com/articles/ [locklessinc.com] [locklessinc.com] and I've implemented some of the mutexes and ticket locks and my Windows/Linux software has become faster as a result). In addition, using lock-free algorithms whenever possible can provide a f

>>Using locks and the like make it very easy to do multithreaded and parallel programs.

Eh, the Dining Philosophers would like to ask you out to eat. Locks can introduce timing issues which can result in a program locking up at random.

The real difficulty of parallel programming comes from two things (speaking as someone who has a Master's in the subject):1) The development environment isn't as well supported as single-threaded coding.2) It requires a different mindset to write code solidly. Remember ho

It also does not help that threads are very easy to mess up. Look at all the traditional programs that used multiple threads for nonparallel concurrency, and how much trouble the developers have with deadlocks, or forgetting to use locks on a shared variable access, or calling code intended only to be used by the other thread, etc.

So even if you have the ideal parallel version of the algorithm all planned out, actually implementing it correctly can still be problematic.

that most of today's popular programming languages do not accommodate higher-level forms of expression required for easy parallelism. Declarative languages have a slight edge at being able to express where sequential dependencies are.

One more reason why functional programming matters. Many programs become trivial to parallelize when you avoid mutation and side-effects outside of limited, carefully-controlled contexts.

It's truly a joy when you can parallelize your code by changing a single character (from "map" to "pmap"). There's usually a little more to it than that, but you almost never have to deal with explicit locks or synchronization.

One more reason why functional programming matters. Many programs become trivial to parallelize when you avoid mutation and side-effects outside of limited, carefully-controlled contexts.

More specifically, the problem of parallel programming is the problem of structuring state for concurrent access. Everything else (the mechanics of locks etc) is trivial and mainly a typing exercise.

The reason this is difficult is that it's an optimization; normally when we optimize code we write the naive version first (or near-naive), then optimize it. We might change a hash table to a vector, memcpy() to a few lines of inline assembler for a particular target, hand code CRC/checksumming/RC4 in assemble

JAVA doesn't produce great programmers IMHO. But Intel is going to need people with lower level skills and C, assembly, machine, forth is just too much to ask for a general curriculum. Assume they are going to have to pick that up on the job, IMHO.

What makes parallel programming hard is poor languages. Languages that allow state changes and don't keep them isolated. Isolate changes of state, all changes of state and be careful about what kinds of combinators to use. Google map-reduce works whenever

a) You can organize your data into an arrayb) You can do computations on array element in isolationc) You can do computations on the entire array via. associate operations pairwise

And most programs do meet those properties but they slip in all sorts of

There's a reason Mapreduce is said to operate on "embarrassingly parallel" problems. There are a lot of them. But there are also a lot of problems which are not embarrassingly parallel; for instance, they have nonlocal data dependencies.

No I hadn't read your futurechips article. Though I agree with what you wrote. But frankly the parallelism is obvious its only by using C you are making things complex:

a) construct a function that takes an array of char and returns a count hashb) construct a function that takes two count hashes and adds them to produce a count hashc) construct a function that splits an array of char evenly into n pieces

The dirty secret of parallel programming is that it's *NOT* so widely needed. I think a lot of academics got funding to study automatic parallelization or other parallel techniques, and they latch on to multicore as a justification for it, but it's not.

There is only one GOOD reasons to use multithreading -- because your work is compute-bound. This typically happens on large-data applications like audio/video processing (for which you just call out to libraries that someone else has written), or else on your own large-data problems that have embarrassingly trivial parallelization: e.g.

var results = from c in Customers.AsParallel() where c.OrderStatus="unfilfilled" select new {Name=c.Name, Cost=c.Cost};

Here, using ParallelLINQ, it's as simple as just sticking in "AsParallel()". The commonest sort of large-data problems don't have any complicated scheduling.

There are also BAD reasons why people have used multithreading, particularly to deal with long-latency operations like network requests. But this is a BAD reason, and you shouldn't use multithreading for it. There are better alternatives, as shown by the Async feature in F#/VB/C# which I worked on, which was also copied into Javascript with Google's traceur compiler). e.g.

Here it kicks off two tasks in parallel. But they are cooperatively multitasked on the same main thread at the "await" points. Therefore there is *NO* issue about race conditions; *NO* need to use semaphores/mutexes/condition-variables. The potential for unwanted interleaving is dramatically reduced.

So in the end, who are the people who still need to develop multithreaded algorithms? There are very few. I think they're just the people who write high-performance multithreaded libraries.

The dirty secret of parallel programming is that it's *NOT* so widely needed.

That's kind of begging the question, there.

Those who need it know they need it. Those who think it's neato-keen and want to play with it try to come up with ways to use that that are maybe not obvious, and for which it is maybe not even necessary. I've known about this at least since the first Thinking Machines came out and our school got one of the bigger ones, solved the problem they used in the grant proposal that paid for it in about a week, then realized they had a multi-million-dollar computer with

The problem is that people tend to focus on single-threaded designs for 3rd party libraries.. Then when those libraries get linked to larger libraries (or main apps) which are MT, then the whole world comes crashing down. Now you have to treat every function call of the ST-library as a critical region.Thus while YOU may not care about MT, you should strive to make all your code reentrant at the very least. To whatever degree this allows contention-free memory access (e.g. ZERO global variables). This fut

even image processing stuff like ImageMagick did it. Graphics has been doing it for a while

Graphics is some low hanging fruit that can get a fair bit of benefit from a lot of threads. A lot of operations with graphics involve doing the same thing to a large amount of data that can easily be carved into individual chunks and dealt with in parallel. It's the same with video where the same transformation or filter gets applied independantly to thousands of frames and all you care about is getting it done qui

The dirty secret of parallel programming is that it's *NOT* so widely needed. I think a lot of academics got funding to study automatic parallelization or other parallel techniques, and they latch on to multicore as a justification for it, but it's not.

It is widely needed, but perhaps not by you.

There is only one GOOD reasons to use multithreading -- because your work is compute-bound.

Ok, now you're confusing the issue. Are you talking about parallel programming or multi-threaded programming? Parallel programming is larger in scope the simple multi-threaded programming.

So in the end, who are the people who still need to develop multithreaded algorithms? There are very few. I think they're just the people who write high-performance multithreaded libraries.

No, there are quite a few and many of them make quite a lot of money to do so. Do you think the programmers at ILM, 3DS, Pixar, and NASA are just sitting around doing nothing? Does your MT algorithm library also know how to optimize for the GPU as well? Are you sure that a one si

There is only one GOOD reasons to use multithreading -- because your work is compute-bound.

...and therefore, most of the people in the world won't need anything beyond an Intel Atom because their tasks aren't compute-bound.

Seriously, I don't know what kind of code you write for a living, but the code I write is almost always has some portion of compute-bound submodules, even if what I do has nothing to do with video codecs or 3d or whatever field that there are convenient libraries.

The basic problem with parallel programming is that, in most widely used languages, all data is by default shared by all threads. C, C++, and Python all work that way. The usual bug is race conditions.

There have been many languages for parallel programming which don't have default sharing, but they've never taken over outside some narrow niches. Partly because most of them weren't that useful outside their niche.

The other classic problem is that in most shared-data languages with locks, the language d

The Scala [scala-lang.org] community has tried to move the problem into a more practical realm by adding things like parallel collections [infoq.com], DSL's [scala-lang.org] to abstract out the problem for specific applications and the Akka Project [akka.io] for simpler concurrency.

Most of the parallel programming discussion I've seen is very complicated and not likely to appeal to those who have to do practical day-to-day business projects. By pushing the abstractions up a level, I think the Scala folks have made parallel programming more accessible for the av

For certain types of problems, the Linda coordination primitives and shared tuple-space make parallel programming much easier. I used the original C-Linda many, many years ago, and IBM's TSpaces for Java more recently. If you're trying to do little bitty actions on lots of data with tight coordination, the overhead is pretty bad. Looking into PyLinda is on my list of things to do...

I've seen one example of a threaded serial task. Intel has HyperThread optimization PDF some where showing some interesting tricks using HT. The example they had was on an i7, so fairly recent. They had a serial task that iterated through an array. They loaded an extra thread that synced with the primary thread, ran it on the other virtual thread, and all it did was call the prefetch instruction on the array.

Even though they had a modern architecture with advanced prefetching and linear memory access, using

Okay, then the only problem is getting something useful out of Erlang.

Back in 1985 the Japanese government announced a "fifth generation" computing project, with software to be developed in Prolog. So I went and learned Prolog, an intriguing and amusing language. Only problem is, it was totally useless for any actual application, as the Japanese found out.

Sorry, but in order to believe any of the promises of one of those non-vonNeumann languages, I have to see a practical working application first.

von Neuman is an architecture, the word you are looking for is Imperative languages. Those "other" languages could be functional or logical. If you really want to get down to brass tacks, it comes down to whether you want stateful programming (imperative requires state), or stateless programming (functional and logical programming).

In that case you can restate the original question as "how do I implement an Erlang compiler and runtime?" If, indeed, Erlang embodies the most efficient way to solve all parallell programming problems. Which of course is absurd. Erlang is useless for implementing servers, kernels, runtimes, networking stacks, file systems, device drivers, VM systems, database servers, or anything else that actually makes a computer tick. It's simply an interface to an already-implemented computing system. It's good to

When I first read this I assumed the poster was being sarcastic, but reading again they actually believe it. Erlang useless for implementing servers? Erlang is used to implement loads of servers. You look at the back end of the a lot the the top 100 company web services and you will find them using Erlang to implement their server functionality. Erlang is used to provide database systems too.

The idea that Erlang is some sort of toy academic language that is not used for anything practical is a joke. Erlang

Mutexes and semaphores are a bit more than bits and integers. You have to be able to modify them atomically. If you don't design your PLC correctly, you end up with two paths attempting to modify the same bit at the same time. Without atomicity, you can't know which won.

A bit is a mutex if your processor has an atomic test and set instruction. Which these days they all do. A semaphore is an integer protected by a mutex. He oversimplified a bit, but if you know what you're doing it is that simple.

Funny, just went through three different manufacturer's PLC instruction sets and none of them listed test and set or any other kind of discernible atomic instruction. Do you have any references you could point me to? I'd be interested to know for sure.

If that's the case, there is no need for mutex or semaphore constructs because those are only needed for parallelism. Parent was stating that he's used both in PLCs. If that's the case, why? Either he has no clue, or you're wrong. I don't really care either way, just addressing the fact that you can't just implement a mutex as a flag or a semaphore as an integer as stated by parent.

In the end, what makes it harder is that the things you do on software are usually way more complex than the things you do on hardware. If not for that, that argument would make perfect sense as hardware also have issues with communication latency, non-determinism, and localized data (not just caches), to an even higher degree than software. And you also must make it robust.

That is only true in a very theoretical sense and completely wrong in practice. Almost everything that burns CPU on your computer today is easily parallelizable, your video encoder doesn't really care if the other CPU is crunching away on the next scene in the movie, most image filters work just fine when applied to only a section of an image, a game could easily do AI for each unit in parallel and your webbrowser shouldn't really care if each webpage is rendered on a separate thread either.

You are absolutely right on this one. Obviously we've hit limits of CPU performance and parallel is the way to go.

As an aside though let me point one other option which Intel was exploring in the late 80s-early 90s: break the CPU up into a series of processors each with different complementary instruction sets. Intel played around with 486/i960 combo where the 486 offered great task switching, high instruction speed, built in floating point and the i960 offered rapid vector calculation. IBM RS/6000 line

Because they require several different kinds of completely alien thinking; knowing imperative programming is almost a disadvantage to learning functional programming. Also, most functional languages were too purist to allow actually getting much real work done - the libraries were usually very weak so every problem involved reinventing dozens of wheels. Clojure [clojure.org] may have solved this problem - it isn't too purist and it can use all the Java libraries. It still requires a lot from the programmer, though perhap