This week Gary Frost from AMD unveiled an alpha release of Aparapi (A PARallel API) , an API that allows programmers to write logic in Java to be executed on a GPU. GPUs are the massively parallel hardware acceleration chips originally installed in PCs to boost graphics rendering performance but that are now pushed to other kinds of compute-intensive tasks that have nothing to do with graphics.

We caught up with Mr. Frost over email to get the highlights from his talk and to understand what Aparapi does. Mr. Frost explained that Aparapi:

allows the Java developer to express his/her data parallel algorithms using pure Java. Therefore, there’s no need to learn CUDA/ OpenCL or to take on the challenge of JNI. Essentially the Java developer extends a base class called ‘Kernel’ and overrides the run() method. At runtime, Aparapi will determine if you have OpenCL available or not. If so, it will attempt to convert your data parallel workload to OpenCL. If that transformation is successful Aparapi will execute the OpenCL on the GPU and will take care of all the data transfers for you. If for any reason we can’t convert to OpenCL (or OpenCL is not available) Aparapi will execute the original code using a thread pool. To be clear, we are not converting Java source to OpenCL. Clearly at runtime we can’t expect access to the source. Instead, we analyze the bytecode of the methods reachable from the run() method in the user's Kernel implementation, and we create OpenCL from the bytecode. In a way, we have a small library which behaves similar to Jad, Jode or Mocha *but* which creates OpenCL code from bytecode.

Mr. Frost went on to explain that not just any arbitrary Java can be executed on Aparapi; there are some restrictions on the kind of Java that can be executed. According to the Aparapi README, these restrictions mean that developers can basically only write C-style code with arrays of primitives. No Objects are allowed, no try-catch-finally blocks are allowed and no allocation/de-allocation (i.e. no “new”) of arrays is allowed. These restrictions map closely to the inherent capabilities of GPUs. Additionally, Aparapi will only make use of GPUs on Windows platforms and on other platforms it will fall back to using Java thread pools.

Mr. Frost explained that his team has seen improvements of 20x "fairly easily" when using GPUs from Java. Alternatively, developers can use native OpenCL or CUDA and JNI bindings to see more "dramatic" speedups. OpenCL is a language standard for " general-purpose parallel programming of heterogeneous systems" which includes programs targeted at GPUs while CUDA is similar, but tied specifically to NVIDIA's GPUs. We asked about the experience using JNI-based libraries and Mr. Frost responded:

The early Java pioneers leveraged JNI and implemented the ‘host programming’ code in C. By host programming I refer to the ‘housekeeping’ of offloading compute to the GPU, specifically compiling the actual workload (from OpenC/CUDA) for some device and then coordinating the movement of data to the device, executing the code and then moving data back. It turns out that this ‘host’ programming can be verbose and picky, and for Java developers the ‘joy of JNI’ can quickly wear off. Thankfully we can avoid JNI altogether by taking advantage of some of the Java bindings for CUDA and OpenCL that have been made available. I have used JOpenCL and JOCL – both open source – and in both cases developers of these bindings have done all the JNI work for you by essentially wrapping the low level OpenCL/CUDA calls in Java classes.

We asked Mr. Frost what GPU sweet-spot applications he has seen in industry and he responded:

Looking at the OpenCL side, I have seen everything from financial services to energy exploration (I used to work in that field). I also know folk that are looking at GPUs for seismic data analysis. Applications that have the potential to see major benefits of using the GPU for computation include the big-data analytics domain.

Mr. Frost encouraged Java programmers wanting to learn more about programming on GPUs to visit AMD's OpenCL Zone as well as the home pages of JOpenCL and Marco Hutter's "excellent" JOCL.

I took this for a test drive today and ported a BlackScholes Algorithm onto it. Fairly Neat. Couple of questions right now

1) Can mersenne twister be ported as a random number function inside the kernel class? Without which we would be unable to run Monte Carlo simulations leaving out most of Financial Application Areas from using Aparapi. 2) Can something like this be created on .Net also? This would mean really rapid adoption with .Net Teams. Shifting quickly from CUDA to AparApi.Net and app dev times of the order of days.

The future is parallel programming; we can't have tools that only uber-geeks/Jedi-Master programmers can use. Anything that brings paralel programming to the not-so-uber progrtammer could only be a good thing.