Parallel

Atomic Operations and Low-Wait Algorithms in CUDA

Used correctly, atomic operations can help implement a wide range of generic data structures and algorithms in the massively threaded GPU programming environment. However, incorrect usage can turn massively parallel GPUs into poorly performing sequential processors.

The NVIDIA Kepler architecture significantly improved the ability of threads to communicate outside a threadblock via atomic operations. Atomic operations essentially lock a memory location until they complete. Used correctly, atomic operations can help implement a wide range of generic data structures and algorithms in the massively threaded GPU programming environment.This tutorial demonstrates how to implement a massively parallel, low-wait parallel counter. Benchmark results show that the provided code is 8x faster on Kepler GPUs and 40x faster on Fermi hardware compared with traditional counters that use atomicAdd() to increment a single memory location. Unlike traditional atomic counters, the massively parallel counter implemented in the ParallelCounter class below is not susceptible to performance degradation from pathological usage, such as having every thread increment the counter at the same time.

Atomic Operations Are Great, But Don't Use Them

To understand what a low-wait massively parallel counter is, it is necessary to first understand the benefits and challenges of using atomic operations in global memory.

Atomic operations's chief feature is locking the affected memory location until the operation is complete. Calling atomicAdd(&foo, 1), for example, means that only the thread that receives the lock can increment variable foo by one. All other threads that wish to read or write foo must wait until the lock is removed. It is necessary to utilize an atomic operation to update a memory location in global memory that might be used by other threads. While the C/C++ construct foo++ looks like a single operation, in reality, the hardware might carrry out three separate steps when performing the increment: (1) fetch foo into a register, (2) increment the register by one, and (3) write the register back to foo in global memory. Without a lock, two or more parallel threads might simultaneously read foo into a register at the same time, which means they would be unaware of the increment in progress by the other threads. While the end result of a write by multiple threads to the same location in global memory is undefined, it is likely that the variable foo will reflect an incorrect number of increment operations and be corrupted.

Atomic operations in a parallel environment present a real challenge because they serialize execution. Instead of seeing an nProcessor parallel speedup or O(nThreads/nProcessor), applications that perform an atomic operation on a single counter will only exhibit a sequential runtime of O(nThreads). In other words, incrementing a single counter with atomicAdd() means that the counter has to be locked, thus forcing all the parallel threads to stop and wait so they can individually perform the increment operation — one after the other. In other words, it's the antithesis of parallel programming.

A low-wait algorithm is an algorithm that still uses locking (atomic operations), so there will be some serialization, but the algorithm is designed to keep the number of threads that must wait for the lock to be released to a minimum. In other words, the algorithm attempts to keep as many parallel threads active as possible.

An NVIDIA SDK histogram example demonstrated a form of low-wait algorithm via the use of a vector of counters that are incremented with atomicAdd() operations. Each element in the vector contains the count for a single bin in the histogram. For uniformly distributed data, the SDK example will keep a number of threads equivalent to the number of active bins. When this number is large, the SDK histogram will demonstrate high performance because many threads will be actively incrementing histogram counts. Performance suffers when the data is not uniformly distributed, causing many of the items fall into a few bins. A pathological case occurs when all the histogram data fits into a single bin.

From a hardware point of view, implementing high-performance atomic operations is difficult because of all the complexity required to enforce a lock and to preserve coherency in a massively parallel environment. Kudos to the NVIDIA engineers who have made atomic operations so much faster on the new Kepler GPUs. Even though atomic operations can now approach the speed of global memory, using a lock in a parallel algorithm will likely have dramatic performance implications that must be avoided.

A Parallel Counter Class

The following implementation of a parallel counter C++ class spreads the value of the counter across a vector of several counters. Atomic increment (or decrement) operations are forced to be uniformly distributed across the count vector through a modulus operation (% in C/C++) on the CUDA variable threadIdx.x. As discussed previously, the uniformity of access means that on-average at least N_ATOMIC threads will always be active, which means the parallel performance should degrade gracefully for even the most extreme pathological case where all the threads on the GPU are incrementing the same counter at the same time.

Even on fast GPUs, the modulus operation is expensive, so it is highly recommended that N_ATOMIC be a power-of-two because the compiler can convert the expression (threadIdx.x % N_ATOMIC) to (threadIdx.x & (N_ATOMIC-1)). Boolean AND operations are fast relative to the modulus operation. To make best use of all the threads in a warp, it is also recommended that N_ATOMIC be a multiple of the warp size.

The getCount() method is defined using the __device__ and __host__ qualifiers, which means it can be called by either the host or the CUDA device. For simplicity, it is assumed that device-side calls to getCount() are performed by a single CUDA thread and only after all atomic updates are complete. Similarly, the set() method and constructor/destructors are qualified so they can run on either the host or device.

Testing Usability

The following firstCounter.cu example code demonstrates three possible ways to use a single ParallelCounter class:

Utilize a C++ object entirely on the GPU.

Utilize an object on both the host and GPU

Map an object into Unified Virtual Addressing (UVA) for use by both devices.

For convenience, the source in Listing One includes the ParallelCounter structure definition to make copying the code to a file easy. In addition, a test for structure compatibility with the GPU is performed using the __is_pod() method. POD_struct compatibility is also highlighted in the definition "struct ParallelCounter" rather than "class ParallelCounter" Instead of instrumenting this code, kernel execution times reported by the NVIDIA nvprof text profiler will be used to compare performance in the next section.

The number of times the counter will be incremented. For consistency with the histogram example, this number is referred to as nSamples.

The number of the CUDA device to use, which makes it easy to compare Fermi and Kepler GPU performance in mixed GPU systems.

Walking through the code starting at main() shows that this code uses C++ exceptions to catch errors. Currently, GPU kernels and CUDA library functions do not throw exceptions on errors, which is why this example uses cudaPeekAtLastError() to determine whether an error needs to be thrown. The cudaPeekAtLastError() method does not clear the error so cudaGetLastError() can be used to retrieve the error for printing with cudaGetErrorString().

Next, the result variable is allocated on the device. This variable is used to return the value of the parallel counter for the on-device test.

For convenience, the application prints out information about the runtime configuration. In particular, note that the C preprocessor variable SPREAD can be defined at compile time to test the impact of distributing the atomic operations across various sizes of the internal ParallelCounter count vector.

Test 1: Utilize The Counter Entirely on the GPU

Listing Three simply initializes the ParallelClassmyCounter object on the GPU to zero. The counter is then incremented nSamples times in parallel on the device and the count returned in the result variable. The call to assert() checks that the counter did indeed work correctly.

The initCounter() kernel zeros out the myCounter object with a call to the set() method. Because the C++ code does not have control over the execution configuration, a check is made to ensure that only a single thread calls the set() method.

The doCounter() kernel increments the counter nSamples times by having each thread use the += operator. Finally, the finiCounter() kernel calls the getCount() method on the GPU so the state of the counter can be returned in result.

Test 2: Porting C++ Objects

The second test instantiates an object foo on the host that is copied to the variable d_foo on the GPU with cudaMemcpy().

Portable C++ object size and layout compatibility is affected by numerous issues including features of the C++ language, such as virtual functions and inheritance, decisions made by the compiler authors, as well as hardware issues such as type size and alignment.

POD_structs (or Plain Old Data structs) are the only types of objects guaranteed by the C++ standard to hold the same value when the contents of the object are copied into an array of char or unsigned char with memcpy(), and then copied back into the object with memcpy(). A POD_struct is essentially a C struct. This compatibility means that a POD_struct can be copied to the remote device and utilized. It is also useful for transferring the C++ object read() and write() operations.

The stackoverflow.com post, "What are Aggregates and PODs and How/Why Are They Special?" provides the following code snippet that makes the memcpy() capability of POD_structs clear. That article provides a more detailed discussion about POD_structs, including limitations imposed by the C++ standard about what types of objects can be considered POD_structs.

#define N sizeof(T)
char buf[N];
T obj; // obj initialized to its original value
memcpy(buf, &obj, N); // between these two calls to memcpy,
// obj might be modified
memcpy(&obj, buf, N); // at this point, each subobject of obj of scalar type
// holds its original value

Substituting cudaMemcpy() for memcpy() illustrates how POD_structs can be utilized by both the host and GPU.

NVIDIA notes they make every effort to ensure that GPU and CPU objects have the same size and layout. In particular, the NVIDIA CUDA C Programming Guide states:

"On Windows, the CUDA compiler may produce a different memory layout, compared to the host Microsoft compiler, for a C++ object of class type T that satisfies any of the following conditions:

T has virtual functions or derives from a direct or indirect base class that has virtual functions;

T has a direct or indirect virtual base class;

T has multiple inheritance with more than one direct or indirect empty base class."

Bottom line: C++ object compatibility between devices — for both mapped memory and memory explicitly copied with cudaMemcpy() — requires careful attention to the limitations imposed by the C++ standard. Use of the __is_pod() method is recommended to check for PODstruct compatibility.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!