Parallel

A Massively Parallel Stack for Data Allocation

By Rob Farber, October 01, 2013

A fast, constant, type memory allocator and parallel stack are essential for initiating kernel launches from the CUDA device

To make this approach work, it is essential that no element in the count array exceed maxColCount. For this reason, the getIndex() method checks that the atomicAdd() never exceeds the maxColCount limit passed in the constructor. If it does, the count is reduced by atomicSub(), and getIndex() attempts to scavenge an index from a different element in the count array. The host version of getIndex() attempts to uniformly pick from all elements in the count array by choosing an index based on the CPU Time Stamp Counter (TSC) counter. Picking from a uniform distribution keeps a serial CPU code from causing unnecessary scavenge() operations on the GPU. It is assumed that the CUDA threads will arbitrarily request memory, which means that the value of threadIdx.x can be used to uniformly increment the count array. The code should be modified if either of these assumptions is incorrect.

Listing Three: Source for the BoundedParallelCounter class (part 1 of 2 of MappedTypeArray.hpp).

The MappedTypeArray Class

The MappedTypeArray class utilizes SHADOW_MACRO(), which means it can transfer itself for use on the device as well as the host. It is important to note that the constructor allocates an array of type T objects in mapped memory. For simplicity, this was required so the data could be accessed on both the host and device from the same pointer. The use of mapped memory is why the cudaMemcpy() operations of the object that contains a pointer will work as a single mapped pointer that can access the same data on both the host and device.

Much like malloc(), programmers call the MappedTypeArray alloc() method to get a new object of type T. For simplicity, no free() method is provided in this example. The programmer can use the allocated object on either the host or device, subject to the performance limitations of mapped memory (meaning the data is not cached, so each access can incur a PCIe data transfer cost). When required, the alloc() method will scavenge to find unallocated objects so that all the mapped memory objects will be utilized. The [] operator is defined along with size() so the allocated memory regions can be conveniently walked using a for loop. As will be demonstrated in the test code, the for loop can be a parallel OpenMP for loop on the host.

The best way to understand the MappedTypeArray class is to use it on both the host and a GPU in an example.

Test Code

Listing Five demonstrates how to use MappedTypeArray on both the host and GPU. Succinctly, this test code creates a histogram containing nBins of class ParallelCounter. These bins are uniformly filled based on integers in the range [0 ..nSamples) modulus nBins (for example, histo[tid % nBins]++). Both nBins and nSamples are set by user-provided command-line arguments.

Adding objects to the MappedTypeArray object on both the host and device demonstrates the usefulness of this class for complex code.

The counts of each bin of this histogram are saved in parallel on the device to output, the device side of a
MappedTypeArray object. Note that output was preloaded with a few objects of HistoType on the host. In this example, a kernel-side printf() reports the size of output before and after the update. After the device-side update to output, a few additional objects of HistoType were also added by the host side of the MappedTypeArray object.

The code uses __is_standard_layout() in an assertion to ensure that the programmer has not provided a class to the MappedTypeArray template that breaks C/C++ standard layout conformance.

An OpenMP parallel for loop demonstrates how to use a host-side parallel loop to walk the allocated HistoType objects. This parallel loop fills a host-side STL vector that is used to sort the data on the host. The results are then sanity-checked for consistency.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!