In Parallel Counting Sort: Part 1 I transformed the Counting Sort algorithm into a parallel implementation, resulting in a 3X speedup when sorting arrays of 8-bit numbers, and 2.5X when sorting arrays of 16-bit numbers, comparing to the sequential/scalar implementation running on a quad-core hyper-threaded processor. To simplify parallel programming, I used the Intel Threaded Building Blocks (TBB). Counting Sort is a O(n) algorithm, with a small constant term, performing n reads and n writes, when sorting arrays of 8-bit or 16-bit numbers. The parallel implementation is up to 65X faster than STL sort() for large arrays of 8-bit numbers, and up to 77X faster for large arrays of 16-bit numbers.

In Part 1, a portion of the algorithm was not converted to run in parallel on multiple processor cores. In this installment, I parallelize the rest of the algorithm, pushing the limits of the processor and parallel programming capabilities.

Templatized Starting Point

Listing 1 shows a portion of the Parallel Counting Sort implementation for arrays of unsigned 8-bit and 16-bit numbers, which will be the starting point. The core of this implementation was developed in Part 1, except for the two templatizations.

At the top-level, the two CountSortInPlace_TBB_L1() functions are the ones the users will be calling: one for sorting an array of unsigned 8-bit numbers, and the other for unsigned 16-bit numbers. These two overloaded functions are wrappers over the _internal_CountSortInPlace_TBB_L1() template function, to restrict usage only to arrays of unsigned 8-bit and 16-bit numbers. _internal_CountSortInPlace_TBB_L1() is the first templatization, handling both unsigned 8-bit and 16-bit implementations.

CountTemplateType is a templatization of the CountType developed in Part 1, to handle both unsigned 8-bit and 16-bit data types. Listing 2 shows CountTemplateType implementation, where the main difference is the ability to set the count array size based on the data type. The numberOfCounts constant is derived from the _Type passed to the template.

The rest of the Parallel Counting Sort implementation has a few slight improvements over Part 1. The overall implementation is more compact due to its use of templates, while being as simple as the one in Part 1.

The Parallel Counting Sort uses Intel TBB parallel_reduce() to split the counting portion of the algorithm among multiple cores. Each core keeps track of its own counts independently, processing its portion of the input array. The results are merged/joined when the cores are done processing. The parallel_reduce() enables the Parallel Counting Sort to scale with the number of processors available, while being oblivious to the actual number of processors that its running on.

The final step of the algorithm writes the sorted values back into the original input array (a[]), since Counting Sort is in-place. In this final writing step, the algorithm generates the sorted result array. This operation is accomplished by the two nested for loops, running on a single processor. Accelerating this final writing step of the algorithm, will be the focus of this article.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!