Menu

Introducing CLUtil

Change CLUtil links to a tagged version compatible with the code presented here.

OpenCL is a cross-platform parallel programming standard with support for execution on both CPUs and GPUs. The OpenCL package on hackage provides a direct binding to the API with just enough Haskellosity to make invoking those API functions borderline pleasant. That said, there remains a certain amount of boilerplate that is rather offputting.

For context, we are trying to write program fragments that may be executed in parallel to improve efficiency. Multi-core CPUs are good at this, GPUs can be even better. We can bend GLSL to our needs (see my modern OpenGL with GLSL in Haskell tutorial), but are then left to contend with various graphics and GUI considerations that are completely irrelevant to our goals. OpenCL represents a much more familiar looking programming environment (a kind of C with support for small vectors), with less setup overhead than a GPGPU program using GLSL on the backend.

So let’s get started: How can we use OpenCL from a Haskell program?

First we’ll write a simple OpenCL kernel that adds two vectors, a and b, and stores the element-wise square of the result in c. To make it slightly interesting we’ll operate on four floating point numbers at a time.

To recap: we have an OpenCL kernel that takes two read-only input vectors and a write-only output vector. This sort of configuration fits rather well into a functional setting, so it is where we will focus our efforts as we endeavour to make the common things easy. Now we’re ready for some Haskell:

I’ve imported my CLUtil library that provides some simple helpers for common Haskell-OpenCL operations. Now let’s write a program that initializes OpenCL, loads and compiles an OpenCL kernel stored in a separate file, and run that kernel on some data.

And that’s it! The ezInit function intializes an OpenCL device and sets up a context and queue, all of which are returned in a record. The kernelFromFile function reads OpenCL source code from a file, then builds the named kernel using the previously initialized OpenCL state record. Finally, we produce some data in Haskell and call runKernel.

The runKernel action is variadic in an attempt to meet most of your kernel-running needs. Its arguments are an OpenCL state record, a kernel to run, the parameters to be passed to the OpenCL kernel, and the number of global work items (i.e. how many times to invoke the kernel). Recall that our kernel had three parameters, and operated on four elements at once. So, we pass the kernel the two input vectors, a note requesting an output vector of length four, and a specification that our kernel should be run exactly once. We bind the output to another vector, and provide a type annotation on the next line to let runKernel figure out how much memory to allocate for the output vector. This is potentially confusing, hopefully more examples will clarify usage.

Next, let’s consider another kernel that demonstrates a bit more variety in the types.

This kernel is similar to the first, but uses the double scalar type, and also takes an int pointer as an argument that has an int cast of the horizontal sum of the result vector written to it. The Haskell side looks very similar to the last time,

v3 and v4 have different types. The runKernel action always returns Vectors, be they individual as in test1, or tupled as here. The CLUtil library currently supports up to three output vectors, but this is an arbitrary limit that will be raised as the library matures.

Once again we operate on four input elements at a time. Given that our source vectors have twelve elements each, we must run three work items.

The type annotation on v4 is CInt rather than Int. This is important as the size of Haskell’s Int may be 32 or 64 bits, while OpenCL’s int is always 32 bits.

One last bit of mystery that remains is the Work1D specification. OpenCL supports up to three dimensions of work items, which can ease the handling of multi-dimensional data. If we were adding a matrix, for example, we might have the elements of the matrices stored in Vectors, but the elements are intuitively addressed by a pair of coordinates. A naive OpenCL kernel for adding matrices might go something like this,

Note that here we explicitly request a GPU to run our program. The GPU in my laptop supports floats, but not doubles, so I put this kernel in a separate file from the others.

Hopefully usage of CLUtil is becoming clear. So let’s crank things up and take advantage of the Haskell ecosystem.

A blog post describing quasicrystal figures serves as an interesting test of OpenCL performance. As a first pass, we can port the pixel computations from that post to OpenCL without thinking about optimization…

On my laptop, running on the CPU consumes 300% of a dual core i5 (with hyperthreading), while running on the GPU consumes 85% CPU. While this example is somewhat more complicated, the ability to switch between CPU and GPU computation driving a visualization, all from an interactive interpreter is highly compelling. The animation runs at a reasonable clip on a dual core CPU, so try it out yourself! It should look something like this,

P.S. When building executables on your system, you may need to supply GHC with an -lOpenCL option. This is not necessary in Mac OS X, where we use the installed OpenCL framework, but is likely needed elsewhere.

Visit the gloss homepage to get the latest version via darcs. You may need to manually relax version constraints on OpenGL and GLUT to install gloss with the latest versions of those packages. Note also that on Mac OS X, installing gloss with the GLFW-b backend (cabal install gloss --flags="-GLUT GLFW") allows you to start and stop the visualization from an interactive GHCi session (see the FAQ on the gloss page for further instructions). ?