Introducing OpenCL

Over the past decade, graphics cards have gone from being simple accelerators to being fast general-purpose computing engines. David Chisnall looks at OpenCL, a new API for running non-graphics applications on modern GPUs.

From the author of

From the author of

The first desktop graphics cards were frame buffers. They provided a region of memory where you could write color values, and a digital-to-analog converter that generated a signal for a CRT. A modern GPU is a very different beast. Something like the Radeon HD 5800 has 2.15 billion transistorsmore than a six-core Xeon.

Whereas the older graphics hardware was only usable for graphics, a modern GPU is a powerful computing engine in its own right. It makes sense to consider using it for tasks other than handling graphics. This is where OpenCL fits in. Like OpenGL, it's an abstract API that hides the details of the GPU implementation. Unlike OpenGL, however, OpenCL is intended for arbitrary computations, rather than just graphics.

In this article, we'll take a look at the concepts behind OpenCL and create a simple program (implementing Conway's Game of Life) that uses OpenCL. Open the syntax-highlighted opencl.c.html file if you want to follow along as we go through the code.

Understanding the GPU

GPUs are often said to be special-purpose processors, with the implication that CPUs are general-purpose processors. This designation is misleading. A modern CPU and a modern GPU are both general-purpose processors; both can implement any algorithm. But they're not built with the same design goals.

In theory, you could run any program on a GPU, just as you can run a complete OpenGL stack on the CPU. It might not run fast, however, because both the CPU and GPU are heavily optimized toward a certain kind of instruction stream. CPUs expect a lot of integer computations and a lot of branches. The code that runs on a CPU expects a branch around every seven instructions, on average. In contrast, a GPU expects very few branches and a lot of floating-point arithmetic.

The CPU and the GPU also have very different memory models: A GPU typically doesn't have memory protection and is designed for streaming. A CPU expects a lot of locality of reference, so it has a big cache for storing the program's working set. The GPU expects you to read from one blob of memory and write to another; for example, reading from vertex lists and textures and writing a picture. Therefore, you can't just recompile your application for a GPU and expect it to run faster. You must select parts that will run efficiently on a GPU and design them carefully.

The OpenCL Model

OpenCL really describes two things:

A C99-derived language, called OpenCL C, which is designed to be easy to compile for the GPU

A library designed for compiling this code and running it

The most important unit of code in OpenCL is the kernel. A kernel is a self-contained program that runs on some data. Typically, an implementation will run several copies of the kernel in parallel. Kernels are equivalent to public library functions, with some extra constraints on their concurrency support.

Our example for this article will implement Conway's Game of Life, with a kernel that calculates what the next value for a cell should be. Conceptually, this kernel runs in parallel on every single cell in the input. In practice, the OpenCL stack may run it entirely on one CPU core, run 128 copies of it concurrently on a GPU, and so on.

When you write your kernel, you need to remember this fact. A number of instances of the kernel might run on the same input concurrently, so you have to make sure that they don't interact with each other. To make it easier to prevent interaction, you typically will separate input and output completely, which is what we'll do in this example. Our kernel will take pointers to input and output boards as arguments, read nine cells from the input, and write one cell in the output. The OpenCL stack can run one instance of this kernel at a time, one instance per cell in the board concurrently, or anything in between. This statement is a slight oversimplification, of course. The OpenCL memory model allows kernels to lock regions of memory and enforce synchronization, but with some quite complex limitations.

Commands are sent to OpenCL via queues, a form of implicit serialization. If you tell OpenCL to read some data, run a kernel on it, and then read back the results, all of these actions will happen in the background. Typically, you start them going and then wait for the last onegenerating the outputto complete.