CUDA Programming

This feature is not supported on the Wolfram Cloud.

CUDA is a general C-like programming developed by NVIDIA to program Graphical Processing Units (GPUs). CUDALink provides an easy interface to program the GPU by removing many of the steps required. Compilation, linking, data transfer, etc. are all handled by the Wolfram Language's CUDALink. This allows the user to write the algorithm rather than the interface and code.

This section describes how to start programming CUDA in the Wolfram Language.

This document describes the GPU architecture and how to write a CUDA kernel. In the end, many applications written using CUDALink are demonstrated.

Introduction

The Common Unified Device Architecture (CUDA) was developed by NVIDIA in late 2007 as a way to make the GPU more general. While programming the GPU has been around for many years, difficulty in programming it had made adoption limited. CUDALink aims at making GPU programming easy and accelerating the adoption.

When using the Wolfram Language, you need not worry about many of the steps. With the Wolfram Language, you only need to write CUDA kernels. This is done by utilizing all levels of the NVIDIA architecture stack.

CUDA Architecture

CUDA's programming is based on the data parallel model. From a high-level standpoint, the problem is first partitioned onto hundreds or thousands of threads for computation. If the following is your computation:

OutputData = Table[fun[InputData[[i, j]]], {i, 10000}, {j, 10000}]

then in the CUDA programming paradigm, this computation is equivalent to:

CUDALaunch[fun, InputData, OutputData, {10000, 10000}]

where is a computation function. The above launches 10000×10000 threads, passes their indices to each thread, applying the function to , and places the results in . CUDALink's equivalence to is CUDAFunction.

The reason CUDA can launch thousands of threads all lies in its hardware architecture. The following sections will discuss this, along with how threads are partitioned for execution.

Each grid contains multiple blocks, and each block contains multiple threads. In terms of the above image, a grid, block, and thread are as follows.

Choosing whether to have a one-, two-, or three-dimensional thread configuration is dependent on the problem. In the case of image processing, for example, you map the image onto the threads as shown in the following figure and apply a function to each pixel.

The one-dimensional cellular automaton can map onto a one-dimensional grid.

CUDA Program Cycle

The gist of CUDA programming is to copy data from the launch of many threads (typically in the thousands), wait until the GPU execution finishes (or perform CPU calculation while waiting), and finally, copy the result from the device to the host.

The above figure details the typical cycle of a CUDA program.

1. Allocate memory of the GPU. GPU and CPU memory are physically separate, and the programmer must manage the allocation copies.

2. Copy the memory from the CPU to the GPU.

3. Configure the thread configuration: choose the correct block and grid dimension for the problem.

4. Launch the threads configured.

5. Synchronize the CUDA threads to ensure that the device has completed all its tasks before doing further operations on the GPU memory.

6. Once the threads have completed, memory is copied back from the GPU to the CPU.

7. The GPU memory is freed.

When using the Wolfram Language, you need not worry about many of the steps. With the Wolfram Language, you only need to write CUDA kernels.

Memory Hierarchy

CUDA memory is divided into different levels, each with its own advantages and limitations. The following figure depicts all types of memory available to CUDA.

Global Memory

The most abundant (but slowest) memory available on the GPU. This is the memory advertised on the packaging—128, 256, or 512 MB. All threads can access elements in global memory, although for performance reasons these accesses tend to be kept to a minimum and have further constrictions on them.

The performance constrictions on global memory have been relaxed on recent hardware and will likely be relaxed even further. The general rule is that performance is deteriorated if global memory is accessed more than once.

Texture Memory

Texture memory resides in the same location as global memory, but it is read-only. Texture memory does not suffer for the performance deterioration found in global memory. On the flip side, only char, int, and float are supported types.

Constant Memory

Fast constant memory that is accessible from any thread in the grid. The memory is cached, but limited to 64 KB globally.

Shared Memory

Fast memory that is local to a specific block. On current hardware, the amount of shared memory is limited to 16 KB per block.

Local Memory

Local memory is local to each thread, but resides in global memory unless the compiler places the variables in registers. While a general performance consideration is to keep local memories to a minimum since the memory accesses are slow, they do not have the same problems as global memory.

Compute Capability

The compute capabilities determine what operations the device is capable of. Currently, only compute capabilities 1.1, 1.2, 1.3, and 2.0 exist, with the main differences listed below.

Compute Capability

Extra Features

1.0

base implementation

1.1

atomic operations

1.2

shared atomic operations and warp vote functions

1.3

support for double-precision operations

2.0

double-precision, L2 cache, and concurrent kernels

Information about the compute capability on the current system can be retrieved using CUDAInformation.

Multiple CUDA Devices

If the system hardware supports it, CUDA allows you to select which device computation is performed on. By default, the fastest is chosen, but this can be overridden by the user. Once a device is set (whether chosen automatically or by the user) the device cannot be changed in the kernel session.

Writing a CUDA Kernel

CUDA kernels are atomic functions that are called many times. Usually these are a few lines inside the program's For loop. The following adds two vectors together.

First Kernel

A CUDA kernel is a small piece of code that performs a computation on each element of an input list. Your first kernel will add 2 to each element.

__global__ void addTwo_kernel(mint * arry, mint len) {

int index = threadIdx.x + blockIdx * blockDim.x;

if (index >= len) return;

arry[index] += 2;}

is a function qualifier that instructs the compiler that the function should be run on the GPU. functions can be called from C. The other function qualifier is , which denotes functions that can be called from other or functions, but cannot be called from C.

A CUDA kernel must have void output, so to get that result you must pass in pointer inputs and overwrite them. In this case, you pass and overwrite the elements (you can think of as an input/output parameter).

This gets the index. The CUDA kernel provides the following variables that are set depending on the launch grid and block size:

— index of current thread; the thread index is between 0 and

— the index of current block; the block index is between 0 and

— the block size dimensions

— the grid size dimensions

These parameters are set by CUDA automatically, based on the kernel launch parameters (the block and grid dimensions). The higher dimensions are automatically set to 1 when launching lower-dimension computation, so when launching a 1D grid the and are set to 1.

In most common cases in a 1D grid, you use to find the global offset, and you use to find the global position in 2D.

Since threads are launched in multiples of the block dimensions, the user needs to make sure not to overwrite the boundaries if the dimension of the input is not a multiple of the block dimension. This assures that.

This is the function applied to each element in the list, adding 2 to the input in this case.

Second Kernel

The second kernel implements a finite difference method (forward difference).

Loading a CUDA Kernel into the Wolfram Language

This loads the CUDAFunction. Pass the kernel code as a first argument, the kernel name to be executed () as a second argument, the function parameters as a third argument, and the block size as the fourth argument. The result is stored in .

C Sequential to CUDA Programming

The following details the progression of a program from the serial CPU version to a CUDA version.

The program implements the moving average with radius 3, calculating the average value of an array based on its neighboring pixels.

Generally, you start with the serial implementation. This allows you to gauge the problem and find possible avenues of parallelization. The serial implementation can also be used as the reference implementation as well as determining whether CUDA is fit for the task.

Compiling for CUDA

CUDALink provides a portable compiling mechanism for CUDA code. This is done through the NVCCCompiler, which registers the NVCC compiler found on the system with the CCompilers to compile DLLs and executables.

Before loading the NVCCCompiler in CUDALink, notice the available compilers on the system.

Writing Kernels for CUDALink

Users familiar with CUDA programming might consider the following strategies when programming for CUDALink. The strategies help in making the CUDA code more portable, correct, easier to debug, and efficient.

Using Real Types

Because of the lack of double-precision support on most CUDA cards, a CUDA C programmer is forced to use floats to represent floating-point numbers. With CUDALink, you can use the type to define a floating-point number that uses the maximum precision found on the CUDA card.

This would ensure maximum compatibility and use the maximum precision available.

Optimization

The compiler can optimize some loops if the user makes the loop parameters a constant. The following is a slightly more optimized version over passing channels as a variable, since the compiler can perform more code optimization.

Porting OpenCL to CUDA

Since both CUDALink and OpenCLLink handle all the underlying bookkeeping for CUDA and OpenCL, the user need only concentrate on the kernel code. This section covers a CUDA-to-OpenCL translation. It is assumed that the user has used OpenCLLink to develop the function.

The following table summarizes some of the common changes in the kernel code needed to port OpenCL programs to CUDA.

This is a simple OpenCL function that takes input image and color negates it.

In many cases, the above are the only required changes for OpenCL-to-CUDA porting in the Wolfram Language. There are a handful of other one-to-one function translations that are readily accessible in an OpenCL or CUDA programming guide.

An alternative to find and replace the mechanism shown above is to generate the kernel code symbolically and manipulate it in the Wolfram Language. The following section covers that approach.

Symbolic Code Generation

The Wolfram Language provides SymbolicC, which permits a hierarchical view of C code as the Wolfram Language's own. This makes it well suited to creating, manipulating, and optimizing C code.

In conjunction with this capability, users can generate CUDA kernel code for several different targets, allowing greater portability, less platform dependency, and better code optimization.

An advantage of using SymbolicC is that you can ensure that the C code contains no syntactical errors, you can generate CUDA programs using SymbolicC and run them using CUDAFunctionLoad, and you can ease the porting of CUDA to OpenCL (or vice versa).

This section uses CUDALink's symbolic code-generation capabilities to write an "RGB"-to-"HSV" and "RGB"-to-"HSB" color converter. This conversion is a fairly common operation for image and video processing.

Your first function will take an integer array and set all its elements to 0. You first need to define your function prototype. The function will be called , accepting an input list and the list size. Use the utility functions as follows.

Running the Program

Before running the code on the GPU, you need to define some parameters that are required. This defines the input and output image. Notice that the output is of type Real, since HSV and HSB are real valued.