Parallel

CUDA: Unifying Host/Device Interactions with a Single C++ Macro

By Rob Farber, September 16, 2013

A general method to move data transparently between the host and the CUDA device.

Eventually, CUDA will offer some form of high-performance cached mapped memory. Until then, the SHADOW_MACRO() code is useful because it encapsulates device-side data allocation and data movement between the host and device. Regardless of how the data moves between the device(s), the programmer must ensure that all C++ classes provide a data layout that is usable on both devices! This implies that C++ type traits and C++ compiler methods to check type traits will remain important for the foreseeable future.

A Test Code

The following code demonstrates how to use ParallelCounter on both the host and GPU for several common scenarios:

Allocate an object on the stack on the host, use it on the device, and get the results on the host.

Dynamically allocate the object with new on the host and use it on the device.

Dynamically allocate an array of objects on the host and use one of them on the device.

Create an STL vector of objects and use one on the device.

Dynamically allocate an array of objects on the host and initialize them in parallel on the host prior to using one on the device.

Dynamically allocate an array of objects on the device and use one in a calculation on the device.

Walking through the code starting at main() shows that it uses C++ exceptions to catch errors. Currently, GPU kernels and CUDA library functions do not throw exceptions on errors, which is why this example uses cudaPeekAtLastError() to decide if an error needs to be thrown. The cudaPeekAtLastError() method does not clear the error, so cudaGetLastError() can be used to retrieve the error for printing with cudaGetErrorString().

The application requires two command-line arguments:

The number of times the counter will be incremented. For consistency with the histogram example, this number is referred to as nSamples.

The number of the CUDA device to use, which makes it easy to compare Fermi and Kepler GPU performance in mixed GPU systems.

For convenience, the application prints out information about the runtime configuration. In particular, note that the C-preprocessor variable SPREAD can be defined at compile time to test the impact of distributing the atomic operations across various sizes of the internal ParallelCounter count vector. See the previous article for a detailed discussion of the ParallelCounter class.

Instantiate an Object on the Host Stack, Use it on the Device, and Get the Results on the Host.

A very common scenario is to instantiate one or more objects on the stack, use them, and destroy them once they go out of scope.

In the following test, the object foo is instantiated, and the constructor initializes it to zero on the host. Aside from the CUDA kernel call, this example looks like standard C++ running on the host.

The interface to the GPU starts with the call to the d_ptr() method that passes the pointer of a device-side copy of foo to the doTest1() kernel. This method initiates a set of actions that include: allocating space for a ParallelCounter object on the device, using cudaMemcpy() to transfer the contents of the host version to the device, and returning the pointer to the space for the device-side object. The host then queues the doTest1() kernel to run on the GPU.

CUDA kernel calls are asynchronous, which means the host immediate proceeds to call the getCount() method. The first thing getCount() does is call cpyDtoH(), which ultimately calls cudaMemcpy(). CUDA programmers know that cudaMemcpy() waits until all kernels have completed before performing the copy. Once the data is on the host, the getCount() method can then finish and return the overall sum of how many times foo has been incremented in parallel by the GPU. An assert() on the host then checks that this count is correct.

Dynamically Allocate the Object with new on the Host and Use it on the Device.

A variation of the previous scenario is to dynamically allocate the object on the host with the new operator. Aside from the use of a pointer to foo, Listing Six (test 2) is identical to the first test.

Create an STL Vector of Objects to Use on the Device.

CUDA C++ does not support the use of the STL (Standard Template Library) on the device. However, it is important that host-side STL objects be able to utilize GPU-accelerated C++ objects. Listing Eight (test 4) utilizes a host-side STL vector rather than a C++ array to demonstrate that it is possible to use objects that incorporate the convenience and GPU acceleration provided by SHADOW_MACRO() .

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!