Graphics Processing Unit (GPU) Programming in CUDA and OpenCL

First, a brief word on GPU hardware--it's really not as weird as you
think. Basically, the 1000 or so "stream processors" in a modern
GPU typically consist of:

16 or 32-way SIMD execution units, driven by a weakly
superscalar scheduling unit, making up one core.
The GPU compiler hides the SIMD instructions, so they're never in your
face from a programmability standpoint. But for performance,
branch granularity is very important, since all the SIMD units will
take the same branches. Unlike the CPU, GPU SIMD width has
been shrinking lately (ATI just switched from 64 to 16).
NVIDIA/CUDA calls one SIMD block a "thread warp"; ATI calls the same
thing a "wavefront".

Between 2 and 48 cores. Nobody calls them cores (for
marketing reasons), but conceptually that's exactly what they
are. Again, the graphics driver spreads your code across the
cores for you.

Very deep hyperthreading, with each core running up to several hundred threads.

If you multiply these numbers, you get about a thousand arithmetic
units executing instructions chosen from 10K-100K threads, which as you
might expect can deliver serious performance, currently about a
teraflop.

But if you look at a modern CPU, like an Intel Sandy Bridge, it might have:

8-way AVX SIMD units, driven by a very agressive scheduling unit. Sadly, as a programmer the SIMD is right in your face.

4 or 8 cores. Sadly, these are again right in your face, unless you're using OpenMP or some similar decent tool.

2-way hyperthreading, doubling the number of software-visible cores.

Multiplying these out shows about 100-way parallelism is certainly
possible, although to get there the traditional programming interface
is nasty. Last year Intel and AMD both released very impressive
multicore/SIMD OpenCL drivers for their CPU hardware--thus using the
GPU programming model on the CPU!

So the modern contest is really not "ordinary serial C++" versus "crazy
GPU stuff". Instead, it's a three-way battle, between the
increasingly dated sequential C++, a wide variety of multicore/SIMD CPU
models, and a few new GPU programming languages.

I foresee increasingly little difference between a heavily SIMD+multithreaded CPU, and a GPU.

CUDA

CUDA is NVIDIA corporation's C++-style GPU programming language.
It only runs on NVIDIA hardware, which is annoying, but the language is
really quite good, so it's currently the most popular.

Define a funky GPU (__global__) function "set_array". It writes to a GPU array named "vals" at its own thread index.

Define an ordinary C++ CPU main function.

Allocate space on the GPU with cudaMalloc. You get a
pointer back, but it's a pointer into GPU space, so you can't
dereference it, only hand it back to the GPU!

Call your funky GPU function using the funky "kernel invocation
syntax". Inside the <<< >>> goes the number of
GPU "blocks" of threads, then the number of threads per block. The total parallelism is blocks*threads.

Copy data back to main using a cudaMemcpy. Be sure to check for errors when you do this, unlike above!

In the version below, we actually check errors, and we're allocating
threads in 'blocks', which lets you use multiple blocks to work around
the hardware's thread count limit of 512 threads per block:

Several interesting architectural details crop up in the CUDA documentation:

The execution model is very unusual, especially compared to typical "what one thread should do" machine code.

A "thread" means one single execution of a kernel. This is
mostly conceptual, since the hardware operates on warps of threads.

A "warp" is a group of 32 threads that all take the same
branches. A warp is really a SIMD group: a bunch of floats sharing one
instruction decoder. The hardware does a good job with predication, so
warps aren't "in your face" like with SSE instructions.

A "block" is a group of a few hundred threads that have access to
the same "__shared__" memory. The block size is specified in software,
but limited by hardware to 256, 512, or 1024 threads maximum. More threads
per block is generally better, with serious slowdowns for less than 100
or so threads per block. The PTX manual calls blocks "CTAs" (Cooperative
Thread Arrays).

The entire kernel consists of a set of blocks of threads.

The memory model is also highly segmented and specialized, unlike the flat memory of modern CPUs.

"registers" are unique to that thread. Early 8000-series
cards had 8192 registers available; GTX 200 series had 16K registers;
and the new Fermi GTX 400s have 32K registers. Registers are
divided up among threads, so the fewer registers each thread uses, the
more threads the machine can keep in flight, hiding latency.

"shared"
memory is declared with __shared__, and can be read or
written by all the threads in a block. This is handy for
neighborhood
communication where global memory would be too slow. There is at
least 16KB of shared memory available per thread block; Fermi cards can
expose up to 48KB with special config options. Use "__syncthreads()__"
to synchronize shared writes across a whole thread block.

"global"
memory is the central gigabyte or so of GPU RAM. All cudaMemcpy calls
go to global memory, which is considered "slow" at only 100GB/second!
Older hardware had very strict rules on "global memory coalescing", but
luckily newer (Fermi-era) hardware just prefers locality, if you can
manage it.

"param" is the PTX abstraction around the parameter-passing
protocol. They reserve the right to change this, as hardware and
software changes.

constants and compiled program code are stored in their own read-only memories.

"local" memory is unique to each thread, but paradoxically slower than shared memory. Don't use it!

CUDA Performance: Use Big Arrays!

One universal aspect of CUDA is that kernel calls (<<<),
mallocs, and memcpy all take quite a long time to get running.
Once they're running, they're fairly fast, but for maximum efficiency
you should plan on accessing about a megabyte each time! Here's
an example where I benchmark this length dependence:

Note that really big arrays, with millions of floats, take much less
time per float than smaller arrays. The basic trouble is that
going through the OS, across the PCIe bus, into the GPU, and back takes
like 10 microseconds (10us = 10000ns). If you go to all
that just to get one or two floats, you'll get terrible performance.

CUDA Application Performance: Mandelbrot Rendering

Here's a little Mandelbrot set rendering application in CUDA. It
includes benchmarking code, which shows some surprising results.

The biggest surprise is the cudaMalloc time, which is ridiculously huge
because this is the first CUDA call in the program, so the driver has
to locate
and set up the graphics card. Adding a "dummy" cudaMalloc
of four bytes reduces the allocation time to less than a nanosecond per
pixel.

Second, it takes longer to copy the data off the card than it does to
compute it! Switching to a "char" data type cuts both the
copy-out time (due to less data copied) as well as the rendering time
(due to less data written)

CUDA and Thrust

A nice library for more complex operations is called Thrust,
sort
of like STL or Boost for the GPU. As of CUDA 4.0, Thrust is an
official part of the CUDA distribution, and it's preinstalled on NetRun.

Here's how to use thrust::reduce, which can be used to add up all the
values in an array, or find their maximum or minimum. 0 is the
additive identity. plus is the operation to perform.

It's not very efficient (see below) to call thrust::reduce three
times on the same vector. It's more efficient to call it once,
and collect up the sum, min, and max all in one go. To do this,
we need to write a weird 'functor' to pass to thrust::reduce.

Note how sorting 16 elements and 16 thousand elements takes nearly
exactly the same amount of time. This is common on the GPU--the
startup time to get a kernel running is measured in microseconds, not
nanoseconds like a function call. This means you need to do work
in huge batches, millions of data items at a time, to get good
efficiency.

CUDA Near Peak Performance

This program hits nearly peak performance on a GTX 280, about 520 gigaflops:

Dr. Lawlor's EPGPU Language

My 2011 language EPGPU is built on top of OpenCL, a parallel
programming language that currently runs on NVIDIA GPUs, ATI GPUs, AMD
multicore CPUs, and Intel multicore/SSE/AVX CPUs, all with quite good
performance. The big drawback of OpenCL, the many function calls
needed to get anything done, are all hidden inside of EPGPU.

"GPU_FILLKERNEL"
makes a kernel (like a function) that can be run on the GPU (or other
parallel hardware). "i" is effectively a keyword, giving the
array index for your float. "result" is a keyword, storing the value of
your float.

Floats are stored in a "gpu_vector", which is basically my GPU version
of a std::vector (and similar to a thrust::device_vector). It's
easy to run a FILL kernel into an array:

arr=simple_floats(1000.23);

This
runs the simple_floats function on every array element, splitting into
threads, blocks, or SIMD units as needed by the current hardware.

For more complex functions, where functions and array elements aren't
in a one-to-one mapping, you use a non-FILL kernel. Now you need
to manually pass in the array address, and explicitly run however many
copies of the function you need.

FILL
and non-FILL kernels are enough to write quite a few simple
applications, as I explore at the technical paper in the EPGPU download
page here:http://www.cs.uaf.edu/sw/EPGPU/

Parallel Summation Example

One common problem is to total up the values in an array. Serially, the code is trivial:

total=0; for (int i=0;i<length;i++) total+=arr[i];

In
parallel, the problem is a lot harder--"total" is a sequential
bottleneck, since everybody has to add to it simultaniously. In
OpenMP, with a small number of threads, "reduction(+:total)" creates
private copies, and adds them up with atomic operations, but this
doesn't scale well to the many thousands of threads on a GPU, because
it takes time in O(p) for p processors.

Another algorithm that scales much better is a tree-based summation,
shown here for 16 elements. We begin by summing pairs of two
elements, then summing across groups of four elements, then eight, and
finally all sixteen.

/* Total up the n values in this array into arr[0]. Does this recursively, using log(n) kernel invocations. */void totalArray(gpu_array<float> &arr,int n){ for (int step=2;step/2<n;step*=2) { totalFloats(arr,n,step).run((n+step-1)/step); }}

Note
that even with a small 8000-entry array, the sequential linear
summation already much slower (O(n) versus O(lg n)) and less accurate
(due to roundoff) than the tree-based recursive summation. Both
effects just get worse as the array gets bigger.

You can do even better by putting the lower levels of the tree
summation into GPU "__local__" memory, which allows you to use local
thread synchronization instead of calling a second kernel, and by
running the highest levels of the tree on the CPU, where sequential
summation is faster. But both of these implementation adjustments
just tweak the asymptotic constant--the big speedup is using a good
parallel algorithm!