Just as with sequential sorting, parallel sorting has been studied for years, and the literature contains many efficient implementations, each with different tradeoffs. To implement a parallel sort_population function using native threads, you could select and hand-code one of those known algorithms.

But before doing that, here's the code for a parallel version of sort_population using TBB:

As you can see TBB supplies a parallel sort, so all you need to do is replace the std::sort call in the serial version with the tbb::parallel_sort call. Even without completing the exercise to implement a parallel sort using native threads, it's safe to conclude it will require more code, will likely be more prone to bugs, and the resulting code will suffer from the same performance portability problems as the two previous native code snippets.

Use Your Concurrency Platform Wisely

As the examples in this article show, using a concurrency platform over native threads has several advantages:

You can focus more on application features.

It's easier to write the code.

The library manages the concurrency, allowing the same code to adapt to new hardware platforms (almost) for free as the numbers of cores increase, the memory hierarchies change, or even if architectures change in more radical ways.

But those advantages don't remove the requirement to use concurrency libraries wisely and they don't automatically guarantee future scalability on new platforms. Most libraries provide some features that are more future-proof than others. For example, the TBB library provides tbb::tbb_thread, which is just a thin wrapper around the native OS threading library. That means it's easy to misuse the interface, potentially creating the same types of problems that you saw with the natively threaded example.

To avoid that, here are a few recommended guidelines to consider when using a concurrency platform. Following these will help ensure that your code's performance can adapt to new architectures as they arrive:

Avoid using "threads" in your application directly. Use the higher-level parallel algorithms provided by the concurrency platform of your choice instead—or, if you need to create a new algorithm, use its more flexible tasking interfaces.

Avoid using thread ids to divide up the data items and/or distribute work. For example, the OpenMP API provides routines that return the executing thread's id as well as the total number of threads in use by the system. It's easy to fall into the trap of scheduling work yourself to improve efficiency by using these APIs. But as future machine architectures emerge, your hand-coded scheduling may no longer be the best. Leave it to the runtime library to make these decisions.

Avoid using persistent thread-local storage. Making specific threads hold specific information may force you into scheduling computations in a particular way. The OpenMP API provides a threadprivate directive and the open-source version of TBB provides tbb::enumerable_thread_specific. Both these features have safe uses, but use them sparingly (if at all) to avoid tying specific computations to specific threads.

As multicore architectures become increasingly available, getting increased performance out of these architectures requires some additional work from programmers. A wisely-chosen programming model can result in applications that continue to scale as platforms advance. The key design choice is to express the concurrency in an application without explicitly managing the scheduling of that concurrency onto the hardware. Using a concurrency platform allows such designs; the platform's high-level algorithms and tasking interfaces allow you to express parallelism while leaving the low-level scheduling decisions to the runtime libraries.

Michael Voss is a Senior Staff Software Engineer at Intel Corp. Voss received his Ph.D. in Electrical Engineering from Purdue University in 2001. He is currently a Senior Staff Software Engineer in the Performance Analysis and Threading Lab at Intel and an adjunct Professor in the Edward S. Rogers Sr. Department of Electrical and Computer Engineering at the University of Toronto. His interests include parallel programming and optimizing compilers.