Open Source

CUDA, Supercomputing for the Masses: Part 5

By Rob Farber, June 30, 2008

Understanding and using shared memory (2)

Shared Memory Cautions

Watch out for shared memory bank conflicts, which can slow performance.

All dynamically allocated shared variables in a kernel start at the same memory address. Using more than one dynamically allocated shared memory array requires manually generating the offset. For example, if you want dynamically allocated shared memory to contain two arrays, a and b, then you need to do something like:

Reports

Webcasts

Register/Local Memory Cautions

Register memory can be transparently placed into local memory. This can potentially be a cause for poor performance. Check the ptx assembly code or look for lmem in the output from nvcc with the "-ptxas-options=-v".

Arrays indexed by constants known at compile time typically reside in registers but if they are indexed by variables they cannot reside in registers. This creates a conundrum for the developer because loop unrolling may be required to keep array elements in register memory as opposed to slow global memory. However, unrolling loops can greatly increase register usage, which may result in variables being kept in local memory -- obviating any benefit of loop unrolling. It is possible to use the nvcc option, -maxrregcount=value to tell the compiler to use more registers. (Note: the maximum register count that can be specified is 128.) This is a tradeoff between using more registers and creating fewer threads, which may hinder the opportunities to hide memory latency. With some architectures, use of this option may also prevent kernels from starting due to insufficient resources.

A Shared Memory Kernel

Both programs reverseArray_multiblock.cu and revereseArray_multiblock_fast.cu perform the same tasks. They create a 1D array of integers, h_a, containing the integer values [0 .. dimA-1]. The array is then moved via cudaMemcpy to the device and the host then launches the reverseArrayBlock kernel to reverse order the array contents in place. Again, cudaMemcpy is used to transfer data from the device to the host where a check is performed to verify that the device produced the correct result (for example, [dimA-1 .. 0]).

The difference is that reverseArray_multiblock_fast.cu uses shared memory to improve the performance of the kernel, while reverseArray_multiblock.cu operates entirely in global memory. Try timing the two programs and verify for yourself the difference in performance. Also, reverseArray_multiblock.cu accesses global memory in an inefficient manner. We will use the CUDA profiler to help diagnose and fix this performance issue in a future column, and show how improvements in the new 10 series architecture eliminate the need for these types of optimizations in many cases.

Deciding on the amount of shared memory at runtime requires some setup in both host and device code. In this example , the amount of shared memory (in bytes) for each block in a kernel is specified in the execution configuration on the host as an optional third parameter. (Setup on the host side is only required if the amount of shared memory is specified at kernel launch. If it's fixed at compile time no setup is required on the host side.) By default, the execution configuration assumes no shared memory is used. For example, in the host code of arrayReversal_multiblock_fast.cu, the following code snippet allocates shared memory for an array of integers containing a number of elements equal to the number of threads in a block:

// Part 1 of 2: Compute the number of bytes of share memory needed
// This is used in the kernel invocation below
int sharedMemSize = numThreadsPerBlock * sizeof(int);

Looking at the reverseArrayBlock kernel, the shared memory is declared with the following:

extern __shared__ int s_data[];

Note that the size is not indicated in the kernel -- rather it is obtained from the host through the execution configuration.

Until the next column on profiling, I recommend looking at the reverseArray_multiblock.cu. Do you think there is a performance problem in accessing global memory? If you think there is a problem, try to fix it.

Rob Farber is a senior scientist at Pacific Northwest National Laboratory. He has worked in massively parallel computing at several national laboratories and as co-founder of several startups. He can be reached at rmfarber@gmail.com.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!