Parallel

A Robust Histogram for Massive Parallelism

By Rob Farber, September 24, 2013

Preserving highly parallel performance when every thread is simultaneously trying to increment a single object

A Dynamic Parallelism Histogram Example

It is fairly straightforward to create a histogram example program using shadowed classes. The source code in Listing Two, histogram.cu, demonstrates one such application. To make it interesting, dynamic parallelism was used to demonstrate how the ability to spawn device-side kernels cleans-up the code and how useful it is to have the ability to specify different execution configurations in kernel calls on the device.

The example code increments only bin one when the preprocessor variable PATHOLOGICAL is defined. This provides a worst-case histogram data distribution, where all the data fits into a single bin.

Performance Comparison

Table 1 shows that there is basically no difference between the ParallelCounter runtimes demonstrating performance degradation caused by the data distribution. The simpleHisto.cu example is just as susceptible as the NVIDIA histogram SDK example to performance degradation due to a small number of bins or non-uniform data distributions. However, the comparison of the two pathological cases reveals a more than 10x difference in performance!

Executable

nSamples

nBins

Runtime

histogram.exe

4 billion

16

538.15ms

histogramPathological.exe

4 billion

16

538.37ms

simpleHisto.exe

4 billion

16

711.87ms

simpleHistoPathological.exe

4 billion

16

5710 ms

Table 1: Observed performance results using nvprof on a K20c GPU.

Conclusion

The example code in this tutorial demonstrates that a vector of ParallelCounter objects can provide more than an order-of-magnitude greater performance than a vector of atomically incremented counters when many or all the threads need to increment the same bin in the histogram Meanwhile, performance for uniformly distributed datasets is as fast or slightly faster. Based on these performance results, I recommend that CUDA programmers utilize some form of this low-wait parallel counter. (Multicore programmers can also benefit from using this low-wait approach as well.)

Astute readers will note that the histogram.cu example had to preallocate storage for a known number of results. The next article in this series will provide the ability to generate results in a massively parallel manner on the device where the number of results is not known beforehand.

The integration of the general-purpose SHADOW_MACRO() into the ParallelCounter class adds much of the transparency and simplicity of mapped memory without sacrificing speed. Host-side STL classes can work with SHADOW_MACRO()-enabled classes to leverage the power and convenience of the STL, and potentially deliver significant performance gains compared with the host. For these performance and convenience reasons, developers should consider incorporating SHADOW_MACRO() into their classes.

Rob Farber is a frequent contributor to Dr. Dobb's on CPU and GPGPU programming topics.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!