I'm doing some research on multicore processors; specifically I'm looking at writing code for multicore processors and also compiling code for multicore processors.

I'm curious about the major problems in this field that would currently prevent a widespread adoption of programming techniques and practices to fully leverage the power of multicore architectures.

I am aware of the following efforts (some of these don't seem directly related to multicore architectures, but seem to have more to do with parallel-programming models, multi-threading, and concurrency):

Erlang (I know that Erlang includes constructs to facilitate concurrency, but I am not sure how exactly it is being leveraged for multicore architectures)

OpenMP (seems mostly related to multiprocessing and leveraging the power of clusters)

Intel Threading Blocks (this seems to be directly related to multicore systems; makes sense as it comes from Intel. In addition to defining certain programming-constructs, it also seems have features that tell the compiler to optimize the code for multicore architectures)

In general, from what little experience I have with multithreaded programming, I know that programming with concurrency and parallelism in mind is definitely a difficult concept. I am also aware that multithreaded programming and multicore programming are two different things. in multithreaded programming you are ensuring that the CPU does not remain idle (on a single-CPU system. As James pointed out the OS can schedule different threads to run on different cores -- but I'm more interested in describing the parallel operations from the language itself, or via the compiler). As far as I know you cannot truly do parallel operations. In multicore systems, you should be able to perform truly-parallel operations.

So it seems to me that currently the problems facing multicore programming are:

Multicore programming is a difficult concept that requires significant skill

There are no native constructs in today's programming languages that provide a good abstraction to program for a multicore environment

Other than Intel's TBB library I haven't found efforts in other programming-languages to leverage the power of multicore architectures for compilation (for example, I don't know if the Java or C# compiler optimizes the bytecode for multicore systems or even if the JIT compiler does that)

I'm interested in knowing what other problems there might be, and if there are any solutions in the works to address these problems. Links to research papers (and things of that nature) would be helpful. Thanks!

EDIT

If I had to condense my question down to one sentence, it would be this: What are the problems that face multicore programming today and what research is going on in the field to solve these problems?

UPDATE

It also seems to me that there are three levels where multicore needs to be concerned:

Language level: Constructs/concepts/frameworks that abstract parallelization and concurrency and make it easy for programmers to express the same

Compiler level: If the compiler is aware of what architecture it is compiling for, it can optimize the compiled code for that architecture.

OS level: The OS optimizes the running process and perhaps schedules different threads/processes to run on different cores.

I've searched on ACM and IEEE and have found a few papers. Most of them talk about how difficult it is to think concurrently and also how current languages don't have a proper way to express concurrency. Some have gone so far as to claim that the current model of concurrency that we have (threads) is not a good way to handle concurrency (even on multiple cores). I'm interested in hearing other views.

Can you explain how designing for parallel programming models and multi-threading is different than programming for multicores?
–
James BlackSep 5 '10 at 19:47

If your languages supports multi-core and your OS has good support, it can ensure the threads on on different cores or processors, so your multithreaded app can run concurrently.
–
James BlackSep 5 '10 at 19:49

@James I think parallel programming and multicore programming have more in common than multithreading and multicore programming. I should have mentioned in my question that I was talking about multithreaded on a single-CPU system. I can see how the OS could schedule different threads to run on different cores, but I am more interested about using the language/compiler itself to describe the parallelism.
–
Vivin PaliathSep 5 '10 at 19:54

@Vivin, there really isn't a difference between parallel/multithreaded/multicore programming, from a development standpoint, at least. Libraries such as Intel Threading Blocks and OpenMP are merely abstractions that make parallelization easier by hiding the gritty details. Multicore programming is essentially the same as multi-processor programming (the devil is in the details, though). You simply have more than 1 physical CPU on which to execute threads. Additionally, you'll find differing behavior at times between single & multi processor machines.
–
Nathan ErnstSep 6 '10 at 3:56

5 Answers
5

The major problems with multicore programming is the same as writing any other concurrent applications, but whereas before it was uncommon to have multiple cpus in a computer, now it is hard to find any modern computer with only one core in it, so, to take advantage of multicore, multiple cpu architectures there are new challenges.

But, this problem is an old problem, whenever computer architectures go beyond compilers then it seems the fallback solution is to move back toward functional programming, as that programming paradigm, if strictly followed, can make very parallelizable programs, as you don't have any global mutable variables, for example.

But, not all problems can be done easily using FP, so the goal then is how to easily get other programming paradigms to be easy to use on multicores.

The first thing is that many programmers have avoided writing good mulithreaded applications, so there isn't a strongly prepared number of developers, as they learned habits that will make their coding harder to do.

But, as with most changes to the cpu, you can look at how to change the compiler, and for that you can look at Scala, Haskell, Erlang and F#.

For libraries you can look at the parallel framework extension, by MS as a way to make it easier to do concurrent programming.

It is at work, but I recently either IEEE Spectrum or IEEE Computer had articles on multicore programming issues, so look at what IEEE and ACM articles have been written on these issues, to get more ideas as to what is being looked at.

I think the biggest impediment will be the difficulty to get programmers to change their language as FP is very different than OOP.

One place for research besides developing languages that will work well this way, is how to handle multiple threads accessing memory, but, as with much in this area, Haskell seems to be at the forefront in testing ideas for this, so you can look at what is going on with Haskell.

Ultimately there will be new languages, and it may be that we have DSLs to help abstract the developer more, but how to educate programmers on this will be a challenge.

I'm curious about the major problems in this field that would currently prevent a widespread adoption of programming techniques and practices to fully leverage the power of multicore architectures.

Inertia. (BTW: that's pretty much the answer to all "what does prevent the widespread adoption" questions, whether that be models of parallel programming, garbage collection, type safety or fuel-efficient automobiles.)

We have known since the 1960s that the threads+locks model is fundamentally broken. By ~1980, we had about a dozen better models. And yet, the vast majority of languages that are in use today (including languages that were newly created from scratch long after 1980), offer only threads+locks.

Thanks! What are the better models? I'm assuming that you're talking about the actor-model and "promises and futures"? I see that this is pretty common in concurrent/coordinating languages.
–
Vivin PaliathSep 7 '10 at 23:56

1

@Vivin Paliath: join-calculus, π-calculus, actor model, CSP, message-passing in general, dataflow, nested data parallelism, software transactional memory, transactions in general, agents, ... There's a lot, and we really haven't yet found out which ones work. We only really know that threads+locks don't. As long as we haven't figured that out yet, I like the Clojure model: instead of picking one concurrency model, provide a sane update model and build many different concurrency models (I believe there's 6 in Clojure 1.2 right now) on top of that.
–
Jörg W MittagSep 8 '10 at 0:06

One of the answers mentioned the Parallel Extensions for the .NET Framework and since you mentioned C#, it's definitely something I would investigate. Microsoft has done something interesting things there, though I have to think many of their efforts seem more suited for language enhancements in C# than a separate and distinct library for concurrent programming. But I think their efforts are worth applauding and respect that we're early here. (Disclaimer: I used to be the marketing director for Visual Studio about 3 years ago)

The Intel Thread Building Blocks are also quite interesting (Intel recently released a new version, and I'm excited to head down to Intel Developer Forum next week to learn more about how to use it properly).

Lastly, I work for Corensic, a software quality startup in Seattle. We've got a tool called Jinx that is designed to detect concurrency errors in your code. A 30-day trial edition is available for Windows and Linux, so you might want to check it out. (www.corensic.com)

In a nutshell, Jinx is a very thin hypervisor that, when activated, slips in between the processor and operating system. Jinx then intelligently takes slices of execution and runs simulations of various thread timings to look for bugs. When we find a particular thread timing that will cause a bug to happen, we make that timing "reality" on your machine (e.g., if you're using Visual Studio, the debugger will stop at that point). We then point out the area in your code where the bug was caused. There are no false positives with Jinx. When it detects a bug, it's definitely a bug.

Jinx works on Linux and Windows, and in both native and managed code. It is language and application platform agnostic and can work with all your existing tools.

If you check it out, please send us feedback on what works and doesn't work. We've been running Jinx on some big open source projects and already are seeing situations where Jinx can find bugs 50-100 times faster than simply stress testing code.

Look up fork/join frameworks and work-stealing runtimes. Two names for the same, or at least related, approaches, which is to recursively subdivide large tasks into lightweight units, such that all available parallelism is exploited, without having to know in advance how much parallelism there is. The idea is that it should run at serial speed on a uniprocessor, but get a linear speedup with multiple cores.

Sort of a horizontal analogue of cache-oblivious algorithms if you look at it right.

But i'd say the main problem facing multicore programming is that the great majority of computations remain stubbornly serial. There's just no way to throw multiple cores at those computations and make them stick.

The bottleneck of any high-performance application (written in C or C++) designed to make efficient use of more than one processor/core is the memory system (caches and RAM). A single core usually saturates the memory system with its reads and writes so it is easy to see why adding extra cores and threads causes an application to run slower. If a queue of people can pass through a door one a time, adding extra queues will not only clog the door but also make the passage of any one individual through the door less efficient.

The key to any multi-core application is optimization of and economizing on memory accesses. This means structuring data and code to work as much as possible inside their own caches where they don't disturb the other cores with acceses to the common cache (L3) or RAM. Once in a while a core needs to venture there but the trick is to reduce those situations as much as possible. In particular, data needs to be structured around and adapted to cache lines and their sizes (currently 64 bytes) and code needs to be compact and not call and jump all over the place which also disrupts pipelines.

My experience is that efficient solutions are unique to the application in question. The generic guidelines (above) are a basis on which to construct code but the tweak changes resulting from profiling conclusions will not be obvious to those who were not themselves involved in the optimizing work.