It is time to make Parallelism a full First Class Citizen in C and C++. Hardware is once again ahead of software, and we need to close the gap so that application development is better able to utilize the hardware without low level programming.

The time has come for high level constructs for task and data parallelism to be explicitly added to C and C++. This will enable Parallel Programming in C and C++ to be fully portable, easily intelligible, and consistently decipherable by a compiler.

Language solutions are superior to library solutions. Library solutions provide fertile grounds for exploration. Intel Threading Building Blocks (TBB) has been the most popular library solution for parallelism for C++. For more than five years, TBB has grown and proven itself. It is time to take it to the next level, and move to language solutions to support what we can be confident is needed in C and C++.

We need mechanisms for parallelism that have strong space-time guarantees, simple program understanding, and serialization semantics – all things we do not have in C and C++ today.

We should tackle task and data parallelism both, and as an industry we know how.

Task parallelism: the goal is to abstract this enough that a programmer is not explicitly mapping work to individual processor cores. Mapping should be the job of tools (including run time schedulers), not explicit programming. Hence, the goal is to shift all programming to tasks, not threads. This has many benefits that have been demonstrated often. A simple fork/join support is fundamental (spawn/sync in Cilk Plus terminology). Looping with a “parallel for” avoids looping to spawn iterations serially and thereby expresses parallelism well.

Data parallelism: the goal is to abstract this enough that a programmer is not explicitly mapping work to SIMD instructions vs. multiple processor cores vs. attached computing (GPUs or co-processors). Mapping should be the job of tools, not explicit programming. Hence, the goal is to shift all programming back to mathematical expressions, not intrinsics or explicitly parallel algorithm decompositions.

The solutions that show the most promise are documented in the Cilk™ Plus open specification (http://cilkplus.org). They are as follows:

For data parallelism:

Explicit array syntax, to eliminate the need for explicit looping in a program to loop (serially) across elements to do the same operation multiple times

Elemental functions, to eliminate the need for authors of functions to worry about explicitly writing anything other than the simple, single wide, version of functions. This leaves a compiler to create wider versions for efficiency by coalescing operations in order to match the width of SIMD instructions.

Support for reduction operations in a way that makes semantic sense. For instance, this can be done via an explicit way to have private copies of data to shadow global data that needs to be used in parallel tasks that are created (the number of which is not known to the program explicitly). Perhaps this starts to overlap with task parallelism…

For task parallelism:

spawn/join to spawn a function, and to wait for spawns to complete

parallel for, to specifically avoid the need to serially spawn individual loop body instances – and make it very clear that all iterations are ready to spawn concurrently.

Like other popular programming languages, neither C nor C++ was designed as parallel programming languages. Parallelism is always hidden from a compiler and needs “discovery.” Compilers are not good at “complex discovery” – they are much better at optimizing and packaging up things that are explicit. Explicit constructs for parallelism solve this and make compiler support more likely. The constructs do not need to be numerous, just enough for other constructs to build upon… fewer is better!

For something as involved, or complex, as parallelism, incorporating parallelism semantics into the programming language improves both the expressability of the language, as well as the efficiency by which the compiler can implement parallelism.

Years of investigation and experimentation have had some great results. Compiler writers have found they can offer substantial benefits for ease of programming, performance, debugging and portability. These have appeared in a variety of papers and talks over the years, and could be the topic of future blogs.

Top of mind thoughts are:

Both C and C++ are important. No solution should be specific to only one.

There is strong value in adding some basic task parallelism and data parallel support as a first class citizen into both C and C++. The time has come.

Task parallelism: nothing is more proven than the simple spawn/sync and parallel for of Cilk Plus.

Data parallelism: nothing is more simple than extending syntax to make data parallel operation explicit via array operations such as a[:]=b[:]+c[:] Fortran 90 added similar capabilities over two decades ago!

Data parallelism: elemental functions have an important role and should be included

Task parallelism goal: shift parallel programming to explicit language features, making parallelism easy to express and exploit task parallelism that can be optimized by a compiler, and can be more easily tested and debugged for data races/deadlock.

We need strong space-time guarantees in parallelism constructs.

Data parallelism goal: shift programming to EXPLICIT and EASY-TO-FIND (exploit) data parallelism, so all varieties of hardware can be addressed

Everything proposed here is incredibly easy to teach. The power that can be placed underneath via a compiler is a big bonus, of course! That power makes all these very compelling. The data parallelism is immediately convincing by its compact form, but the task parallel constructs are convincing as well.

None of this is radical – and none of it need be proprietary.

If we don’t get carried away adding other stuff, KISS, we can add these and make a fundamental and important advance for task and data parallelism in C and C++… the languages that lead the evolution to parallelism, and are becoming more (not less) important in the future.

Comments (16)

now we use the Intel Parallel Studio XE 2013 SP1 for Linux on openSUSE 12.3 and 13.1 and also on the SUSE Enterprise Server 11 SP 3,

its fine , after lots of tests we found the C/C++ Compiler (14.0) appr. 34 to 40 % faster as the standard gcc-4.8 or 4.9 we use a mixed developing environment and prefer CUDA for High Performance computing, i was surprise the ICC/ICPC works in our environment

fine with CUDA 5.5, for performance critical parts we only use the Parallel Studio C/C++ Compiler and Tools.

> [regarding C and C++] I would be (pleasantly) surprised if anyone found a solution that works well for both at once.

Many things work for both, and Cilk Plus is one of them. I agree many things can be difficult.
Cilk Plus is a very simple set of language extensions - looking just fine in C and C++.
If you think otherwise - let's find a way to update Cilk Plus to deal with it.

Q: Cilk is a tm for Intel. If PathScale takes the runtime and makes a Cilk-compatible solution what do we call it?

Implement the specification, and call it "Cilk(tm) Plus." [Yes, "Cilk" is registered to Intel Corp, originally registered by Cilk Arts, Inc.]
Same thing PathScale does for OpenMP(tm) [tm owned by OpenMP Architectural Review Board, initially registered by SGI]

The Cilk trademark is available to anyone implementing the specification to refer to their implementation. This is just like the OpenMP trademark - pretty simple, and protects the name "Cilk" so it is available for the language specification. Intel did not register the trademark originally. We acquired the trademark from a company, called Cilk Arts, so we could do this for the community when we open sourced Cilk Plus. The license will be spelled out in future version of the specification. I've been busy with travel and haven't been able to work to finish it with the team, I'm the bottleneck in releasing the language. My bad… drop me a note if you need more before an updated specification is posted. The trademark for OpenMP, for instance, had a similar evolution - it was owned by SGI at one time but it all worked out for the OpenMP board and their specification where it is owned now.

Cilk is a tm for Intel. If PathScale takes the runtime and makes a Cilk-compatible solution what do we call it? /* Serious question */ I don't know what it's being called for GCC, but clarification so others don't get in trouble for making PathScale Cilk would be good. Thanks

My operationg system, LoseThos, has support for explicit parallellization. How's the hopey, changey stuff working out for you? I happen to do graphics rendering for games with multicore because I don't use GPU.

You can compile my kernel with a flag and it will never put cores in a HLT state, so they can be dispatched quicker with no intercore interrupt. Obviously, the lower the overhead on dispatching a job, the finer grain the parallelization can be. I don'ty remember but I think I see improvement in execution time for as little as adding 10,000 numbers. Don't quote me. I think the benchmark should be how small a job you can parallelize and break even.