Since OpenCL 2.0, the OpenCL C device programming language includes a
set of work-group parallel reduction and scan built-in functions. These
functions allow developers to execute local reductions and scans for
the most common operations (addition, minimum and maximum), and allow
vendors to implement them very efficiently using hardware intrinsics
that are not normally exposed in OpenCL C.

In this article I aim at challenging the idea that exposing such
high-level functions, but not the lower-level intrinsics on which their
efficient implementation might rely, results in lower flexibility and
less efficient OpenCL programs, and is ultimately detrimental to the
quality of the standard itself.

While the arguments I will propose will be focused specifically on the
parallel reduction and scans offered by OpenCL C since OpenCL 2.0, the
fundamental idea applies in a much more general context: it is
more important for a language or library to provide the building
blocks on which to build certain high-level features than to expose
the high-level features themselves (hiding the underlying building
blocks).

For example, the same kind of argument would apply to a language or
library that aimed at providing support for Interval Analysis (IA). A
fundamental computational aspect which is required for proper IA
support is directed rounding: just exposing directed rounding would be
enough to allow efficient (custom) implementations of IA, and also allow
other numerical feats (as discussed here);
conversely, while it's possible to provide support for IA without
exposing the underlying required directed rounding features, doing so
results in an inefficient, inflexible standard1.

The case against high-level reduction operations

To clarify, I'm not actually against the presence of high-level
reduction and scan functions in OpenCL. They are definitely a very
practical and useful set of functions, with the potential of very
efficient implementations by vendors —in fact, more efficient than any
programmer may achieve, not just because they can be tuned (by the
vendor) for the specific hardware, but also because they can in fact be
implemented making use of hardware capabilities that are not exposed in
the standard nor via extensions.

The problem is that the set of available functions is very limited (and
must be so), and as soon as a developer needs a reduction or scan
function that is even slightly different from the ones offered by the
language, it suddenly becomes impossible for such a reduction or scan
to be implemented with the same efficiency of the built-in ones, simply
because the underlying hardware capabilities (necessary for the optimal
implementation) are not available to the developer.

Thrust and Kahan summation

Interesting enough, I've hit a similar issue while working on a
different code base, which makes use of CUDA rather than OpenCL, and
for which we rely on the thrust library for the most common reduction
operations.

The thrust library is a C++ template library that provides efficient
CUDA implementations of a variety of common parallel programming
paradigms, and is flexible enough to allow such paradigms to make use of
user-defined operators, allowing for example reductions and scans with
operators other than summation, minimum and maximum. Despite this
flexibility, however, even the thrust library cannot move (easily)
beyond stateless reduction operators, so that, for example, one
cannot trivially implement a parallel reduction with Kahan summation
using only the high-level features offered by thrust.

Of course, this is not a problem per se, since ultimately thrust
just compiles to plain CUDA code, and it is possible to write such code
by hand, thus achieving a Kahan summation parallel reduction, as
efficiently as the developer's prowess allows. (And since CUDA exposes
most if not all hardware intrinsics, such a hand-made implementation
can in fact be as efficient as possible on any given CUDA-capable
hardware.)

Local parallel reductions in OpenCL 2.0

The situation in OpenCL is sadly much worse, and not so much due to the
lack of a high-level library such as thrust (to which end one may
consider the Bolt library instead), but because the language itself
is missing the fundamental building blocks to produce the most
efficient reductions: and while it does offer built-ins for the most
common operations, anything beyond that must be implemented by hand,
and cannot be implemented as efficiently as the hardware allows.

One could be led to think that (at least for something like my specific
use case) it would be “sufficient” to provide more built-ins for a
wider range of reduction operations, but such an approach would be
completely missing the point: there will always be variations of
reductions that are not provided by the language, and such a variation
will always be inefficient.

Implementor laziness

There is also another point to consider, and it has to do with the sad
state of the OpenCL ecosystem. Developers that want to use OpenCL for
their software, be it in academia, gaming, medicine or any industry,
must face the reality of the quality of existing OpenCL implementations.
And while for custom solutions one can focus on a specific vendor, and
in fact choose the one with the best implementations, software vendors
have to deal with the idiosyncrasies of all OpenCL implementations,
and the best they can expect is for their customers to be up to date
with the latest drivers.

What this implies in this context is that developers cannot, in fact,
rely on high-level functions being implemented efficiently, nor can
they sit idle waiting for the vendors to provide more efficient
implementations: more often than not, developers will find themselves
working around the limitations of this and that implementation,
rewriting code that should be reduced to one liners in order to provide
custom, faster implementations.

This is already the case for some functions such as the asynchronous
work-group memory copies (from/to global/local memory), which are
dramatically inefficient on some vendor implementations, so that
developers are more likely to write their own loading functions instead,
which generally end up being just as efficient as the built-ins on the
platforms where such built-ins are properly implemented, and much
faster on the lazy platforms.

Therefore, can we actually expect vendors to really implement the
work-group reduction and scan operations as efficiently as their
hardware allows? I doubt it. However, while for the memory copies
an efficient workaround was offered by simple loads, such a workaround
is impossible in OpenCL 2.0, since the building blocks of the
efficient work-group reductions are missing.

Warp shuffles: the work-group reduction building block

Before version 2.0 of the standard, OpenCL offered only one way to allow
work-items within a work-group to exchange informations: local memory.
The feature reflected the capability of GPUs when the standard was first
proposed, and could be trivially emulated on other hardware by making
use of global memory (generally resulting in a performance hit).

With version 2.0, OpenCL exposes a new set of functions that allow data
exchange between work-items in a work-group, which doesn't
(necessarily) depend on local memory: such functions are the work-group
vote functions, and the work-group reduction and scan functions. These
functions can be implemented via local memory, but most modern hardware
can implement them using lower-level intrinsics that do not depend on
local memory at all, or only depend on local memory in smaller amounts
than would be needed by a hand-coded implementation.

On GPUs, work-groups are executed in what are called warps or
wave-fronts, and most modern GPUs can in fact exchange data between
work-items in the same warp using specific shuffle intrinsics (which
have nothing to do with the OpenCL C shuffle function): these
intrinsics allow work-items to access the private registers of other
work-items in the same warp. While warps in the same work-group still
have to communicate using local memory, a simple reduction algorithm
can thus be implemented using warp shuffle instructions and only
requiring one word of local memory per warp, rather than one per
work-item, which can lead to better hardware utilization (e.g. by
allowing more work-groups per compute unit thanks to the reduced use of
local memory).

Warp shuffle instructions are available on NVIDIA GPUs with compute
capability 3.0 or higher, as well as on AMD GPUs since Graphics Core
Next. Additionally, vectorizing CPU platforms such as Intel's can
trivially implement them in the form of vector component swizzling.
Finally, all other hardware can still emulate them via local memory
(which in turn might be inefficiently emulated via global memory, but
still): and as inefficient as such an emulation might be, it still
would scarcely be worse than hand-coded use of local memory (which
would still be a fall-back option to available to developers).

In practice, this means that all OpenCL hardware can implement
work-group shuffle instructions (some more efficiently than others),
and parallel reductions of any kind could be implemented through
work-group shuffles, achieving much better performance than standard
local-memory reductions on hardware supporting work-group shuffles in
hardware, while not being less efficient than local-memory reductions
where shuffles would be emulated.

Conclusions

Finally, it should be obvious now that the choice of exposing
work-group reduction and scan functions, but not work-group shuffle
functions in OpenCL 2.0 results in a crippled standard:

it does not represent the actual capabilities of current massively
parallel computational hardware, let alone the hardware we may expect
in the future;

to top it all, we can scarcely expect such high-level functions to be
implemented efficiently, making them effectively useless.

The obvious solution would be to provide work-group shuffle instructions
at the language level. This could in fact be a core feature, since it
can be supported on all hardware, just like local memory, and the
device could be queries to determine if the instructions are supported
in hardware or emulated (pretty much like devices can be queried to
determine if local memory is physical or emulated).

Optionally, it would be nice to have some introspection to allow the
developer to programmatically find the warp size (i.e. work-item
concurrency granularity) used for the kernel2, and
potentially improve on the use of the instructions by limiting the
strides used in the shuffles.

since IA intrinsically depends on directed rounding, even if
support for IA was provided without explicitly exposing directed
rounding, it would in fact still be possible to emulate directed
rounding of scalar operations by operating on interval types and then
discarding the unneeded parts of the computation; of course, this would
be dramatically inefficient. ↩

in practice, the existing
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE kernel property that can
be programmatically queried corresponds already to the
warp/wave-front size on GPUs, so there might be no need for
another property if it could be guaranteed that this is the work-item
dispatch granularity. ↩