C/C++

CUDA, Supercomputing for the Masses: Part 12

By Rob Farber, May 14, 2009

CUDA 2.2 Changes the Data Movement Paradigm

In CUDA, Supercomputing for the Masses: Part 11 of this article series on CUDA, I revisited CUDA memory spaces and introduced the concept of "texture memory". In this installment, I discuss some paradigm changing features of the just released CUDA version 2.2 -- namely the introduction of "mapped" pinned system memory that allows compute kernels to share host system memory and provides zero-copy support for direct access to host system memory when running on many newer CUDA-enabled graphics processors. The next article in this series will resume the discussion of texture memory and include information about new CUDA 2.2 features such as the ability to write to global memory on the GPU that has a texture bound to it. (Go here for more on CUDA 2.2.)

Prior to CUDA 2.2, CUDA kernels could not access host system memory directly. For that reason, CUDA programmers used the design pattern introduced in Part 1 and Part 2:

Move data to the GPU.

Perform calculation on GPU.

Move result(s) from the GPU to host.

This paradigm has now changed as CUDA 2.2 has introduced new APIs that allow host memory to be mapped into device memory via a new function called cudaHostAlloc (or cuMemHostAlloc in the CUDA driver API). This new memory type supports the following features:

For discrete GPUs, mapped pinned memory is only a performance win in certain cases. Since the memory is not cached by the GPU:

It should be read or written exactly once.

The global loads and stores that read or write the memory must be coalesced to avoid a 2x-7x PCIe performance penalty.

At best, it will only deliver PCIe bandwidth performance, but this can be 2x faster than cudaMemcpy because mapped memory is able exploit the full duplex capability of the PCIe bus by reading and writing at the same time. A call to cudaMemcpy can only move data in one direction at a time (i.e., half duplex).

Further, a drawback of the current CUDA 2.2 release is that all pinned allocations are mapped into the GPU's 32-bit linear address space, regardless of whether the device pointer is needed or not. (NVIDIA indicates this will be changed to a per-allocation basis in a later release.)

"WC" (write-combined) memory can provide higher performance:

Since WC memory is neither cached or cache coherent, greater PCIe performance can be achieved because the memory is not snooped during transfers across the PCI Express bus. NVIDIA notes in their "CUDA 2.2 Pinned Memory APIs" document that WC memory may perform as much as 40% faster on certain PCI Express 2.0 implementations.

It may increase the host processor(s) write performance to host memory because individual writes are first combined (via an internal processor write-buffer) so that only a single burst write containing many aggregated individual writes need be issued. (Intel claims they have observed actual performance increases of over 10x but this is not typical). For more information, please see the Intel publication Write Combining Memory Implementation Guidelines.

Host-side calculations and applications may run faster because write-combined memory does not pollute the internal processor caches such as the L1 and L2 caches. This happens because WC does not enforce cache coherency, which can increase host processor efficiency by reducing cache misses as well as avoiding the overhead incurred when enforcing cache coherency. Write-combining also avoids cache pollution by utilizing a separate dedicated internal write-buffer cache, which by-passes and leaves the other internal processor caches untouched.

WC memory does have drawbacks and CUDA programmers should not consider a WC memory region as general-purpose memory because it is "weakly-ordered". In other words, reading from a WC memory location may return unexpected -- and incorrect -- data because a previous write to that memory location might have been delayed in order to combine it with other writes. Without programmer enforced coherency though a "fence" operation, it is possible that a read of WC memory may actually "read" old or even initialized data.

Unfortunately, enforcing coherent reads from WC memory may incur a performance penalty on some host processor architectures. Happily, processors with the SSE4 instruction set provide a streaming load instruction (MOVNTDQA) that can efficiently read from WC memory. (Check if the CPUID instruction is executed with EAX==1, bit 19 of ECX, to see if SSE4.1 is available.) Please see the Intel publication, Increasing Memory Throughput With Intel Streaming SIMD Extensions 4 (Intel SSE4) Streaming Load.

It is unclear if and when a CUDA programmer needs to take any action (such as using a memory fence) to ensure that the WC memory is in-place and ready for use by the host or graphics processor(s). The Intel documentation states that "[a] 'memory fence' instruction should be used to properly ensure consistency between the data producer and data consumer." The CUDA driver does use WC memory internally and must issue a store fence instruction whenever it sends a command to the GPU. For this reason, the NVIDIA documentation notes, "the application may not have to use store fences at all" (emphasis added). A rough rule of thumb that appears to work is to look to the CUDA commands prior to referencing WC memory and assume they issue a fence instruction. Otherwise, utilize your compiler intrinsic operations to issue a store fence instruction and guarantee that every preceding store is globally visible. This is compiler dependent. Linux compilers will probably understand the _mm_sfence intrinsic while Windows compilers will probably use _WriteBarrier.

Each of these memory features can be used individually or in any combination -- you can allocate a portable, write-combined buffer, a portable pinned buffer, a write-combined buffer that is neither portable nor pinned, or any other permutation enabled by the flags.

In a nutshell, these new features add convenience and performance while conversely adding complexity and creating version dependencies on the CUDA driver, the CUDA hardware and the host processors. However, many types of applications can benefit from these new features.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!