Introduction to Game Programming with CUDA

Intro to CUDA

Modern game engines have a lot going on. With so many different subsystems competing for resources, multi-threading is a way of life. As multi-core CPUs have gotten cheaper and cheaper, game developers have been able to more easily take advantage of parallelism. While Intel and AMD fight to bring more cores and more cores to the CPU, GPUs have been easily surpassing them for raw parallel abilities. Modern GPUs contain thousands of cores, allowing tens of thousands of threads to execute code simultaneously. This presents game developers with yet another opportunity to add parallelism to their programs. In separate threads, an engine may want to perform a search or sort against a large amount of data, pre-process trees, generate a large amount of random data, process an image or perform calculations to be used for a transformation or collision detection. Any parallel computational task can be a good candidate for offloading to the GPU. This article aims to show you one possible way of harnessing that ability in a game using NVidia's CUDA.

CUDA is both a parallel platform and model that allows code to run directly on the processing cores that make up modern GPUs. It was created by NVidia and currently only supported on NVidia's hardware. It is similar to OpenCL in the idea but different in execution. Using CUDA is as simple as having a recent NVidia graphics card and downloading the free SDK. Links for Windows, Linux and Mac OSX can be found here. While it is proprietary to NVidia, the programming model is easy to use and supported by many languages such as C/C++, Java and Python and is even seeing support on ARM7 architectures. The CUDA programming syntax itself is based on C and so pairs well with games written in C or C++. The CUDA code you write is compiled to object code with NVidia's nvcc compiler and then is linked with standard C code using gcc or Visual Studio to produce the final program. For simple programs, the same file can be used to contain both your entry point and your CUDA function(s). After downloading and installing the toolkit, compiling CUDA code can be done from the command line with the nvcc compiler or through Visual Studio using the CUDA Runtime template which makes it easy to combine standard C/C++ and CUDA code files together in one project.

To demonstrate CUDA with C, we can start with a simple addition function. All samples shown in this article were compiled with the CUDA 5.5 toolkit:

__global__ void cudaAdd(int a, int b, int *c)
{
*c = a + b;
}

This program adds two numbers and stores the result in c. The __global__ identifier marks this function as an entry point for the CUDA program. Now we will see an example of how to call the above program. This can be placed in the same file to create one complete program:

Programs on CUDA are executed as kernels, with one kernel executing at a time. The kernel can be run by just one or even thousands of threads at the same time. Since we are retrieving a result from the GPU, we first use CUDA to allocate memory for it. Next we execute our program, using the < >>> syntax to specify how many blocks and threads we want the kernel to use. The number of threads that can run in a block is dependent on the specific architecture of the GPU you have. For Fermi GPUs you can execute up to 1024 threads on a block. For this simple example we are just executing one thread on one block. Once we have the data in our c variable, we need to copy it back to system memory using cudaMemcpy. Finally we can display the result.

Performing a Reduce

With a simple example out of the way, we can look at a more common example. A reduce is a parallel operation where data that exists across many threads is combined over a series of steps until a single value is held by one thread. A common example could be computing a sum where each steps adds the values of two different threads. After each step, fewer and fewer threads are used until only the final thread adds the last two values remaining and holds the sum. For this sample, we will demonstrate a program that has separate threads count the number of 5's in parts of an array and then perform a reduce to get the final total. This sample can be run over any number of blocks and threads:

This program has three basic steps. First we broke up the array into chunks and had each thread look for 5's in its own chunk. Then we performed a simple add reduction across the threads on each block, storing the result in the shared memory of the first thread of each block. For the last step we used an atomic add to update the global total across the different blocks. The atomic add prevents any contention issues between threads. The syncthreads function show here is used to provide a stopping point for the threads. All threads must reach this point before the program can continue. The example on the whole is inefficient as it only uses about half the total threads for the reduction and has potential contention issues when accessing the global memory but hopefully demonstrates the basic concept of a reduction. The following allocates memory for the array and calls the function:

CUDA Thrust

A really great library that can be used for common CUDA tasks is Thrust. Thrust is a template library for CUDA that allows STL-like syntax to increase developer productivity. The CUDA SDK comes with a version of Thrust that can be easily used in C code. The following demonstrates a sum reduction and a count of fives using the same array as above:

As you can see, the syntax is very similar to the Standard Template Library and makes it very easy to call common functions, saving you lots of coding time. It also integrates well with STL vectors. For useful examples of what Thrust can do, you can go here.

Integrating with OpenGL

A great feature of CUDA is its built-in ability to work with OpenGL directly. This allows a CUDA program easy access to data such as texture, pixel buffers or vertex buffers to perform operations against it quickly. Here we will see how we can use CUDA to alter data in parallel against a vertex buffer. The buffer shown here will be small and simple for demonstration purposes. I won't show all of the basic OpenGL set up or program layout here but this sample will work with code from any basic OpenGL tutorial. I placed all my OpenGL code and the main game loop in one c file and the CUDA kernel function and a wrapper to call it in a separate file with a .cu extension.

To get started, first we need to define our simple data structures to use to create the vertex buffer:

Next we want to allocate an array to use for our vertex buffer using the above structures and then generate a buffer. For this sample we will just allocate an array of four vertices to store a quad. We also need a global variable to store the ID of our vertex buffer:

The resource now has a pointer to the vertex buffer we created above. This allows us to retrieve and modify them using CUDA. The actual program to modify our vertices is very simple. Since we want to stretch our cube in all directions, we must first get a positive or negative value by dividing the current vertices position by the absolute value of itself. Then we will multiply it by the elapsed time in seconds and by our desired rate of movement of .05 units a second.

I placed this code in a file separate from the main c file with the OpenGL code and gave it an extension of .cu. Note that the program assumes that each thread will only act on one vertice. It also assumes one block for simplicity but you could easily execute this over multiple blocks if you had enough vertices. We use the index of our current thread to determine which vertice to operate on. We also use an elapsed time variable to control how much change we want in each loop. This helps keep the movement constant if frame rates vary and our time elapsed delta is constantly changing.

The last step now is to create a function to call our CUDA kernel. We can place this function in the same .cu file. The extern keyword is used so that our main c program is able to find it when compiling and linking.

All the wrapper needs to do is pass in the arguments and instruct CUDA how many blocks and threads we want to run on. In this example we tell it to run over 4 threads in one block so each thread has its own vertice. With the function in place, we can call it from the main logic loop. You will want to put the above function's signature with the extern keyword in your main c file if using multiple files so it can be found when linking. This code is set to execute once per loop:

The code works by getting a pointer to the vertices in the vertex buffer that is mapped to our CUDA resource. Then they are passed to the kernel wrapper to be modified and unmapped so they are released. This sample assumes there is some code for getting the time elapsed delta between this and the previous loop. QueryPerformanceCounter works well for this. After clearing buffers and setting our texture, our render code looks like this:

And thats it. OpenGL integration is fairly straightforward when dealing with buffers. This example can be easily extended to cover TextureBuffers, PixelBuffers or RenderBuffers as well.

Integrating with Direct3D

Similar to its integration with OpenGL, CUDA provides the ability to tie in with Direct3D 9, 10 or 11. Here I will demonstrate the Direct3D 11 version of modifying a simple vertex buffer. Just like with the OpenGL example, we will create a simple 2D cube that we can resize in a game loop. We can use the same vertex structure from the OpenGL example which allows us to use the same CUDA kernel function as we did earlier:

Now we have seen some basic examples of how to create CUDA programs and how they can directly interact with data from OpenGL or Direct3D. These examples are pretty basic but hopefully provide a springboard to more advanced concepts. The SDK is loaded with useful samples that demonstrate the power and flexibility of the toolkit.

About the Author(s)

I am a hobbyist game programmer who also dabbles in HPC and distributed computing.

Con: Only Nvidia cards
Pro: Easier to program with compared to OpenCL and easier to learn.

I actually found it easier to learn and code with OpenCL. There are other more important pros, like ability to use inline assembly (PTX) in CUDA. Another important pro is ability to use classes in CUDA (I am not sure if you can do it in OpenCL).

Con: Only Nvidia cards
Pro: Easier to program with compared to OpenCL and easier to learn.

I actually found it easier to learn and code with OpenCL. There are other more important pros, like ability to use inline assembly (PTX) in CUDA. Another important pro is ability to use classes in CUDA (I am not sure if you can do it in OpenCL).

Not in OpenCL 1.2, though there is an AMD-only extension that lets you use C++-like classes and templates (subject to the usual restrictions, e.g. no virtual functions and so on). Perhaps there is something ready for the OpenCL 2.0 standard, though, which I have been told focuses on making better use of the GPU through its relatively generic interface (since remember OpenCL needs to work on many different types of hardware and not just NVIDIA cards, so some API concepts may not easily map 1:1 to how GPU's work under the hood, limiting the potential for optimization).

Con: Only Nvidia cards
Pro: Easier to program with compared to OpenCL and easier to learn.

I actually found it easier to learn and code with OpenCL. There are other more important pros, like ability to use inline assembly (PTX) in CUDA. Another important pro is ability to use classes in CUDA (I am not sure if you can do it in OpenCL).

Not in OpenCL 1.2, though there is an AMD-only extension that lets you use C++-like classes and templates (subject to the usual restrictions, e.g. no virtual functions and so on). Perhaps there is something ready for the OpenCL 2.0 standard, though, which I have been told focuses on making better use of the GPU through its relatively generic interface (since remember OpenCL needs to work on many different types of hardware and not just NVIDIA cards, so some API concepts may not easily map 1:1 to how GPU's work under the hood, limiting the potential for optimization).

Thanks for clarification:) I was using mostly 1.1 because of destination hardware of application I was working on, but I can't wait for 2.0

So how to port a CUDA game to android? lol joking how can I measure time elapsed between function execution and function end? I want by default test a parallel algorithm, try variations to it and see wich variation is faster..

Not in OpenCL 1.2, though there is an AMD-only extension that lets you use C++-like classes and templates (subject to the usual restrictions, e.g. no virtual functions and so on). Perhaps there is something ready for the OpenCL 2.0 standard, though, which I have been told focuses on making better use of the GPU through its relatively generic interface (since remember OpenCL needs to work on many different types of hardware and not just NVIDIA cards, so some API concepts may not easily map 1:1 to how GPU's work under the hood, limiting the potential for optimization).

Thanks to this article I've been looking thru the Khronos Group website (CUDA seems fun but is limited to nVidia hardware only, so I began research on OpenCL) for some time and found the official C++ specification for OpenCL 1.2. I'ts just a C++ wrapper for the C API, but at least you get some OOP out of it. Sadly I don't see anything like that for OpenCL 2.0 for the time being. I'm posting a link below:

Not in OpenCL 1.2, though there is an AMD-only extension that lets you use C++-like classes and templates (subject to the usual restrictions, e.g. no virtual functions and so on). Perhaps there is something ready for the OpenCL 2.0 standard, though, which I have been told focuses on making better use of the GPU through its relatively generic interface (since remember OpenCL needs to work on many different types of hardware and not just NVIDIA cards, so some API concepts may not easily map 1:1 to how GPU's work under the hood, limiting the potential for optimization).

Thanks to this article I've been looking thru the Khronos Group website (CUDA seems fun but is limited to nVidia hardware only, so I began research on OpenCL) for some time and found the official C++ specification for OpenCL 1.2. I'ts just a C++ wrapper for the C API, but at least you get some OOP out of it. Sadly I don't see anything like that for OpenCL 2.0 for the time being. I'm posting a link below:

So how to port a CUDA game to android? lol joking how can I measure time elapsed between function execution and function end? I want by default test a parallel algorithm, try variations to it and see wich variation is faster..

CUDA provides event classes that can be used to record starting and completion times -- in the docs, look for functions that start with cudaEvent...