Efficiently Evaluating Mathematical Expressions with OpenCL Code

By Peter Eastman, PhD

OpenCL is a cross-platform language for doing general purpose computation on graphics processing units (GPUs) and other massively parallel architectures. One of its most interesting features is the fact that the compiler is built into the runtime. This means that while a program is running, it can programmatically generate the source code for new computational kernels, compile them, and execute them at full speed on the GPU.

Because of this feature, OpenCL provides a unique opportunity to build both flexibility and high performance into a piece of software. We use this ability in OpenMM, a library for running molecular simulations on high performance architectures, including those that support OpenCL. Users can specify arbitrary mathematical expressions for calculating the interaction energy between particles in their simulation and have them transformed on the fly into OpenCL kernels that can be run on a GPU. To get optimal performance, this transformation must be done carefully.

Consider the following expression which describes a Lennard-Jones nonbonded interaction between two particles:

where r is the distance between the two particles and ε and σ are parameters of the force field describing the interaction. The simplest approach to transforming this expression into source code is a one-to-one translation of mathematical operations to OpenCL instructions:

energy = 4*epsilon*(pow(sigma/r, 12)-pow(sigma/r, 6));

This is clearly an inefficient way to perform the computation. The first thing we notice is that the ratiosigma/r is being calculated twice. We should identify common subexpressions, compute them only once, and assign them to temporary variables so they can be reused. In fact, the easiest way of doing this is to create a new temporary variable for every subexpression. For each piece of the expression we translate, we first check whether an identical one has already been processed, and if so, simply use the existing temporary variable. This produces the following OpenCL source code:

This may look much wordier and harder to understand, but that isn’t important. We’re generating it to be read by a compiler, not a human!

The next problem we notice is the use of the pow() function, which is a slow way to calculate small integer

powers. Building up the power through repeated multiplication is much faster. The trick is to decompose the power into a sum of powers of 2, such as 12=8+4, then compute the powers of 2 by repeatedly squaring a multiplier:

We are using only four multiplications to calculate a 12th power, which is much faster than the pow() function. Similarly, we can calculate the 6th power with three multiplications. But we can do even better by combining both of them into a single evaluation:

We are now calculating both powers at once with only five multiplications!

The final important optimization is to translate all expressions at once as a single unit. The above example shows only the expression for the energy, but in OpenMM we need to calculate the derivative of the energy as well. The two expressions share many subexpressions. For example, the derivative includes (σ/ε )11 and (σ/ε)5, so by translating both expressions together, we can compute four different powers at the same time.

In practice, we find these techniques work extraordinarily well for generating optimized OpenCL code to evaluate mathematical expressions. Our preliminary benchmarks with OpenMM show that the automatically generated GPU kernels are only a few percent slower than hand-tuned versions. At the same time, the user gains enormous flexibility to select the precise interactions they want in their simulations.