Parallel

Atomic Operations and Low-Wait Algorithms in CUDA

Used correctly, atomic operations can help implement a wide range of generic data structures and algorithms in the massively threaded GPU programming environment. However, incorrect usage can turn massively parallel GPUs into poorly performing sequential processors.

Test 3: Utilizing a C++ Object on Both the Host and GPU

In this test (Listing Five), the entire ParallelCounter object is copied to and from the host with cudaMemcpy(). The object foo is initialized on the host, incremented on the GPU with the doPassedCounter() kernel, and copied back to the host, where the getCount() method is used to check the result.

Notice that the ability to call getCount() on both the GPU (in the previous example) and on the host (in this example) is enabled by annotating the getCount() method in the ParallelCounter class definition with __host__ and __device__ qualifiers.

Test 4: Map an Object into UVA Mapped Memory for Use by Both Devices

Clearly, NVIDIA is moving to a unified virtual architecture (UVA), where objects are transparently accessible from both the host and GPU devices. At the moment, the NVIDIA method cudaMallocHost() must be called to map a region of memory into both devices.

Listing Six creates a ParallelCounter object in mapped memory. The counter is set to zero on the host and then utilized in the doPassedCounter() kernel. A cudaDeviceSynchronize() call ensures that the kernel has completed; after which, the state of the counter is read on the host. Note that no explicit memory transfers are required!

While convenient for many problems, mapped memory is currently not cached on the GPU as of CUDA-5. This means that any computation that accesses any location in mapped memory many times will probably perform badly. A two order-of-magnitude performance decrease will be shown for this approach in the following performance analysis section.

Don’t let the poor computational performance of mapped memory prevent you from using it. The performance analysis in this article merely highlights the need to use mapped memory appropriately. In particular, the ability to use one pointer to access data on both the host and device is essential to many code implementations. In short, enjoy the convenience of mapped memory, but just be aware that high performance requires a copy operation to/from global memory. As will be discussed in the next article, it is possible to implement a C++ base class that provides the convenience of mapped memory with high-performance C++ objects.

Performance

The NVIDIA nvprof text-based profiler is used to provide the following performance data. This choice eliminates the need to manually instrument the example code, thus making it cleaner and simpler.

To build the the test code. save the source code in Listing Two to a file, firstCounter.cu. This file can be compiled under Linux for sm 2.0 and later devices with the Nvidia compiler command in Listing Seven:

nvcc -O3 -DSPREAD=32 -arch=sm_20 firstCounter.cu -o first

Listing Seven: The nvcc command to build firstCounter.cu.

The test code is profiled while incrementing the counter 4 billion times with the nvprof command-line shown in Listing Eight:

nvprof ./first 4000000000 0

Listing Eight: The nvprof command used to run the example.

Results 1 shows the output produced when running on a Kepler K20c installed as device 0:

These results show that the Kepler card runs roughly 2x faster than the Fermi card when using the ParallelCounter class (0.71 seconds vs. 1.85 seconds).

Profiling Atomic Add of a Single Memory Location

The simple source code in Listing Nine performs the same work as the firstCounter.cu example. The two kernels initCounter() and doCounter() should be self-explanatory. The rest of the code follows the same logic as firstCounter.cu.

The code simpleCounter.cucode can be built with the nvcc command in Listing Ten:

nvcc -O3 -arch=sm_20 singleCounter.cu –o singleCounter

Listing Ten: the nvcc compilation command for simpleCounter.cu.

Figure 1 shows the excellent performance that can be achieved with the ParallelCounter class. Due to excessive runtime, the C2050 simpleCounter.cu runtimes are reported only up to nSamples of 400 million. The speed of the Kepler atomicAdd() is clearly shown by the green line as compared to a C2050. Still, a Fermi GPU using the ParallelCounter class will run faster than a Kepler. The K20c is clearly the fastest when using the ParallelCounter class. (Note that compiling the applications with SM_35 for Kepler did not affect the reported runtimes.)

Figure 1: Observed performance of simpleCounter.cu and the ParallelCounter class on a K20c and C2050 GPU.

The profiling results reported by nvprofafter compiling firstCounter.cu with USE_MAPPED defined (Results 2) show the dramatic impact that the lack of caching has on mapped memory. Note the runtime increased from 712ms to 182 seconds (first two lines), which is a 255x slowdown!

Results 3: Performance results on Kepler including the use of mapped memory.

Even though currently restricted from a performance point of view, mapped memory is very useful for creating and moving complex objects between host and GPU memory — especially those that contain pointers. The keys to remember with mapped memory are: use layout and size compatible POD_struct objects; and copy heavily utilized regions of mapped memory to faster global memory.

Conclusion

The performance graph in Figure 1 really tells the story of this article. The ParallelCounter class is all about robust performance regardless of how it is used in a parallel environment. The ability to maintain high performance regardless of how it is used — including pathological cases where all the threads increment the counter at the same time — makes the ParallelCounter class useful in applications ranging from histograms to parallel stacks and data allocators.

C++ developers should note the object layout and size compatibility between the host and device. This article discussed and used POD_structs, which are the simplest and most restrictive form of C++ compatibility. Newer forms of C++ object compatibility exist, such as is_standard_layout() and is_trivially_copiable().

In the future, it is likely that the need for transparent data movement will almost entirely be removed when NVIDIA enables a cached form of mapped memory. Perhaps some form of the Linux madvise() API will be used. When writing the examples for this article, I observed that mapped memory ran as fast as global memory whenever all the data fit inside a single cache line. This indicates that cached mapped memory has the potential to become the de facto method of sharing memory between the host and device(s).

Rob Farber is a frequent contributor to Dr. Dobb's on CPU and GPGPU programming topics.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!