This is one of the heterogeneous programming scenarios supported by our new meta-par packages. A draft paper can be found here, which explains the mechanism for building parallel schedulers out of "mix-in" components. In this post, however, we will skip over that and take a look at CPU/GPU programming specifically.

Getting Started

If you have CUDA 4.1 installed, then just remove the "-f-cuda" (or remove the entire middle line). And then we import the following module:

importControl.Monad.Par.Meta.AccSMP

This provides a scheduler for (Accelerate GPU-EDSL) + (monad-par multicore CPU) scheduling: It also reexports the ParAccelerate type class which provides the ability to launch GPU computations from within a Par computation.

Next, we also import Accelerate itself to so that we can express Acc computations that can run on the GPU:

importData.Array.AccelerateimportData.Array.Accelerate.CUDAasCUDA

(By the way, this blog post is an executable literate Haskell file that can be found here.)
Now we are ready to create a trivial Accelerate computation:

This creates a ten element one-dimensional array, and then sums up the elements. We could run this directly using CUDA, which would print out Array (Z) [33.0]. That’s just Accelerate’s fancy way of saying "33.0"—a scalar is a zero-dimensional array. Here’s how we invoke Accelerate/CUDA:

runDirect =print (CUDA.run triv)

If we are inside a Par computation, on the other hand, we simply use runACC or runAccWith:

The former uses the default Accelerate implementation. The latter specifies which Accelerate implementation to use. After all, there should ultimately be several: OpenCL, CUDA, plus various CPU backends. (In the future, we plan to add the ability to change the default Accelerate backend either at the runPar site, or statically. Stay tuned for that. But for now just use runAccWith.)

The reader might protest that it is possible to use CUDA.run directly within a Par computation, so why go to the extra trouble? The advantage of using runAcc is that it informs the Par scheduler of what’s going on. The scheduler can therefore execute other work on the CPU core that would otherwise be waiting for the GPU. An application could achieve the same effect by creating a dedicated thread to talk to the GPU, but that wouldn’t jive well with a pure computation (forkIO), and it’s easier to let meta-par handle it anyway.

The second benefit of integrated scheduling is that the scheduler can automatically divide work between the CPU and GPU. Eventually, when there are full-featured, efficient CPU-backends for Accelerate, this will happen transparently. For now you need to use unsafeHybrid described in the next section. Finally, our soon-forthcoming CPU/GPU/Distributed schedulers can make more intelligent decisions if they know where all the calls to GPU computations occur.

Hybrid CPU/GPU workloads.

The meta-par and meta-par-accelerate packages, as currently released, include a generalized work-stealing infrastructure. The relevant point for our purposes here, is that the CPU and GPU can each steal work from one another. Work-stealing is by no means the most sophisticated CPU/GPU partitioning on the scene. Much literature has been written on the subject, and it can get quite sophisticated (for example, modeling memory transfer time). However, as on regular multicores, work-stealing provides an admirable combination of simplicity and efficacy. For example, if a given program runs much better on the CPU or the GPU, respectively, then that device will end up doing more of the work.

In the current release, we use unsafeHybridWith, documented here, to spawn a task with two separate implementations—one CPU and one GPU—leaving the scheduler to choose between them. Here’s a silly example:

The call to unsafeHybridWith is passed a task that consists of a separate CPU (return 33.0) and GPU (triv) component. We must guarantee that the two computations yield the same result (hence the "unsafe" moniker). The indexArray bit is a conversion function that converts the result from the GPU computation to a type that matches the result from the CPU alternative.

The user of this API still has to make some tough decisions. Setting the threshold determines the granularity at which work will be farmed out between the CPU and GPU. To achieve effective load balance, the problem must be subdivided into sufficiently small pieces to deal with the slower rate at which the CPU may complete work items. Further, in this particular example the GPU may not have enough memory to process the entire array, meaning that the problem must be subdivided at least until it fits in memory.

On the other hand, subdividing the problem further means more round trips through the GPU, more driver calls, synchronization. In this particular example the total volume of data transferred is the same (it’s linear), but for other algorithms that may not be the case.

Notes for Hackers

If you want to work with the github repositories, you need to have GHC 7.4 and the latest cabal-install (0.14.0). You can check everything out here:

git clone git://github.com/simonmar/monad-par.git --recursive

Try make mega-install-gpu if you already have CUDA installed on your machine. Explore the README’s here and here for more information.