Parallel

CUDA: Unifying Host/Device Interactions with a Single C++ Macro

By Rob Farber, September 16, 2013

A general method to move data transparently between the host and the CUDA device.

To facilitate the creation and use of complex data structures such as trees and graphs on both the host and device  a convenient and arguably essential requirement for dynamic parallelism  this article will focus on a set of macros that can be used in a broad spectrum of C++ classes to mimic the convenience of mapped memory. Use of the presented SHADOW_MACRO() avoids the two order-of-magnitude decrease in CUDA performance when data in mapped memory is heavily reused, as demonstrated in the previous article in this series, Atomic Operations and Low-Wait Algorithms in CUDA.

NVIDIA is clearly moving toward a unified virtual environment, where objects will be transparently accessible from both host and GPU devices. NVIDIA does not currently support the caching of mapped memory on the device (as of CUDA 5), which is why I focus on a general method to move data reasonably transparently between the host and device.

I provide a complete working histogram example that shows how simple it is to use SHADOW_MACRO() in real applications. The histogram is implemented using the ParallelCounter class from the previous article, which preserves highly parallel performance even when confronted with pathological situations where every thread is simultaneously trying to increment a single ParallelCounter object.

Why Use a Macro Instead of C++ Inheritance?

Guaranteeing that all the devices can use the layout of a C++ object is constant challenge for CUDA C++ programmers in a multi-device environment. The use of cudaMemcpy() to transfer data is analogous to the use of memcpy() (or read and write operations) to transfer data to/from disk or between multiple devices. As a result, CUDA C++ developers can leverage existing work by the C++ standards committee and the C++ compiler development community to define the conditions when sequential byte-oriented copy operations can be used without affecting the usability of a C++ object.

The previous article discussed POD_structs, which is an early C++98 definition of a stringent set of conditions where layout compatibility is guaranteed after a byte-copy operation. Unfortunately, POD_structs do not allow user-defined destructors and require that all data be declared public (meaning no protected or private data) among the loss of other very useful C++ capabilities.

The end result of this discussion has been a revision to the C++ type_traits definitions. Key to this article are the is_standard_layout() and the more general is_trivially_copyable() methods to check the copyability of a C++ class:

is_standard_layout(): A standard-layout type is a type with a simple linear data structure and access control that can easily be used to communicate with code written in other programming languages, such as C, either cv-qualified or not. This is true for scalar types, standard-layout classes, and arrays of any such types.

is_trivially_copyable(): A trivially copyable type is a type whose storage is contiguous (thus, its copy implies a trivial memory block copy, as if performed with memcpy), either cv-qualified or not. This is true for scalar types, trivially copyable classes, and arrays of any such types.

While waiting to get a full C++11 type_traits implementation, CUDA C++ programmers can use the existing GNU C++ front-end methods to check C++ type traits. Calling compiler front-end methods is certainly not the most desirable solution, but is acceptable because these methods are used to check the type of a class rather than perform some functional operation. In particular, the GNU __is_pod(), __is_standard_layout(), and __has_trivial_copy() methods are useful to check that C++ classes used with SHADOW_MACRO() are byte transferable. Microsoft users can utilize the is_pod(), is_standard_layout(), and has_trivial_copy() methods.

C++ classes that conform to the __has_trivial_copy() traits are certainly the most general, while the classes that meet the __is_standard_layout() criteria provide a lowest common denominator that can work with both C and C++ code. Listing One illustrates the differences.

Compiling and running this example shows that the most common (and general) case of a derived class that contains internal data is still byte-copyable, but is not usable as a C struct. For the greatest generality, this article avoids the use of inheritance with the use of a macro to preserve standard layout compatibility with C.

ParallelCounter.hpp

To conform to the lowest common denominator of C structure compatibility, the source code for ParallelCounter.hpp (Listing Three) implements most of the code to transparently move data between the host and device as a macro: SHADOW_MACRO(). Be aware that using a macro can cause name conflicts, among other issues.

The macro is passed the type of the class via the variable TYPE, which acts much like a C++ template argument and gives SHADOW_MACRO() the ability to be used in many classes, structures, and C++ templates. This macro specifies the variable my_d_ptr that points to the device-side memory. The Boolean variable usrManaged flags whether this code is performing the memory management or an external, user-allocated memory region is being used for the host and device transfers with cudaMemcpy(). Public methods include:

d_ptr(): This method is called to get the device-side pointer. As needed, it performs any data allocation and/or initiates the data transfer between the host and device.

set_d_ptr(): This sets the device pointer to a user-allocated region of memory and sets usrManaged to true so this code will not call cudaFree().

free_d_ptr(): This frees the device pointer when appropriate.

cpyHtoD():This copies data from the host to the device with cudaMemcpy(). If needed, memory is allocated on the device.

cpyDtoH(): This copies data from the device to the host with cudaMemcpy().

The supporting macros SHADOW_MACRO_INIT and SHADOW_MACRO_FINI define code for inclusion in the constructor and destructor.

The ParallelCounter class from the previous article is modified to make use of SHADOW_MACRO. Note that the TYPE passed to the macro is ParallelCounter<N_ATOMIC>.

It is important to note that this code is predicated on the assumption that the object size and layout are identical between the host processor and the GPU. It is crucial to check that the classes that include SHADOW_MACRO() are at least trivially copyable.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!