Part 9: OpenCL Extensions and Device Fission

This article discusses OpenCL extensions that provide programmers with additional capabilities such as double-precision arithmetic and Device Fission

The previous article, part 8, in this series on portable
parallelism with OpenCL™ demonstrated how to incorporate OpenCL™ into
heterogeneous workflows via a general-purpose "click-together tools" framework that
can stream arbitrary messages (vectors, arrays, and arbitrary, complex nested structures)
within a single workstation, across a network of machines, or within a cloud
computing framework. The ability to create scalable workflows is important
because data handling and transformation can be as complex and time consuming as
the computational problem used to generate the desired result.

This article discusses OpenCL extensions that provide programmers
with additional capabilities such as double-precision arithmetic and Device Fission.
(Device Fission provides an interface to subdivide a single OpenCL device into
multiple devices – each with a separate asynchronous command queues.)

OpenCL extensions can be defined by a vendor, a subset of
the OpenCL working group, or by the entire OpenCL working group. The most
portable extensions are those defined by the KHR extensions that are formally
approved by the entire OpenCL working group while vendor extensions are the
least portable and likely to be tied to a particular device or product line. Regardless
of who provides the definition, no guarantee is made that an extension will be
available on any platform.

Following are the three types of OpenCL extensions and
naming conventions:

KHR extension: A KHR extension is formally ratified by the OpenCL
working group and comes with a set of conformance tests to help ensure
consistent behavior. KHR extensions are provided to support capabilities
available on some but not all OpenCL devices. The Microsoft DirectX extension
is one example of an important capability that is only available on devices
that support Microsoft Windows. A KHR extension has a unique name of the form cl­_khr_<name>.

EXT extension: An EXT extension is developed by one or more of
the OpenCL working group members. No conformance tests are required. It is
reasonable to think of these as "work in process" extensions to assess usability,
value, and portability prior to formal approval as a KHR extension. An EXT
extension has a unique name of the form cl_ext_<name>.

Vendor extension: These extensions are provided by a vendor to
expose features specific to a vendor device or product line. Vendor extensions
should be considered as highly non-portable. The AMD device attribute query is
one example that provides additional information about AMD devices. Vendor
extensions are assigned a unique name of the form cl_<vendor>_<name>.
Thus an AMD extension would have a name string cl_amd_<name>.

The #pragma OPENCL EXTENSION directive controls the
behavior of the OpenCL compiler to allow or disallow extension(s). For example,
part
4 of this series enabled double-precision computation on AMD devices with
the following line. (Note cl_amd_fp64 can be updated to cl_khr_fp64
in the 2.6 AMD SDK release.)

#pragma OPENCL EXTENSION cl_amd_fp64 : enable

Example 1: Pragma to enable double-precision from part 4 of this series

The syntax of the extension pragma is:

#pragma OPENCL EXTENSION <extention_name> : <behavior>

Example 2: Form of an OpenCL extension pragma

The <behavior>
token can one of the following:

enable: the extension is enabled if supported, or an error is
reported if the specified extension is not supported or the token "all" is
used.

disable: the OpenCL implementation/compiler behaves as if the
specified extension does not exist.

all: only core functionality of OpenCL is used and supported, all
extensions are ignored. If the specified extension is not supported then a
warning is issued by the compiler.

By default, the compiler requires that all extensions be
explicitly enabled as if it had been provided with the following pragma:

cl_khr_local_int32_base_atomics: basic atomic operations
on 32-bit integers in local memory.

cl_khr_local_int32_extended_atomics: extended atomic
operations on 32-bit integers in local memory.

cl_khr_int64_base_atomics: basic atomic operations on
64-bit integers in both global and local memory.

cl_khr_int64_extended_atomics: extended atomic operations
on 64-bit integers in both global and local memory.

cl_khr_3d_image_writes: supports kernel writes to 3D
images.

cl_khr_byte_addressable_store: this eliminates the
restriction of not allowing writes to a pointer (or array elements) of types
less than 32-bit wide in kernel program.

cl_khr_gl_sharing: allows association of OpenGL context or
share group with CL context for interoperability.

cl_khr_icd: the OpenCL Installable Client Driver (ICD)
that lets developers select from multiple OpenCL runtimes which may be
installed on a system.(This extension is automatically enabled as of SDK v2 for
AMD Accelerated Parallel Processing.)

cl_khr_d3d10_sharing: allows association of D3D10 context
or share group with CL context for interoperability.

Device Fission

By default, each OpenCL kernel attempts to use all the computing
resources on a device according to a data parallel computing model. In
other words, the same kernel is used to process data on all the computational
resources in a device. In contrast, a task parallel model uses the
available computing resources to run one or more independent kernels on the
same device. Both task- and data-parallelism are valid ways to structure code
to accelerate application performance given that some problems are better
solved with task parallelism while others are more amenable to solution by data
parallelism. In general task parallelism is more complicated to implement
efficiently from both a software and hardware perspective. The default OpenCL
default behavior to use all available computing resources according to a
data-parallel model is a good one because it will provide the greatest speedup
for individual kernels.

On AMD platforms, there are two methods to limit the number
of cores utilized when running a kernel on a multi-core processor.

The
AMD OpenCL runtime checks an environmental variable CPU_MAX_COMPUTE_UNITS.
If defined the AMD runtime will limit the number of processor cores used by an
OpenCL application to the number specified by this variable. Simply set this
environment variable to a number from 1 to the total number of multi-processor cores
in a system. Note: this variable will not affect other devices such as GPUs nor
is it guaranteed to work on with all vendor runtimes.

The
EXT Device Fission extension, cl_ext_device_fission, provides an
interface within OpenCL to sub-divide a device into multiple sub-devices. The
programmer can then create a command queue on each sub-device and enqueue
kernels that run on only the resources (e.g. processor cores) within the
sub-device. Each sub-device runs asynchronously to the other sub-devices.
Currently, Device Fission only works for multi-core processors (both AMD and
Intel) and the Cell Broadband engine. GPUs are not supported.

(Note: it is possible
to restrict the number of work-groups and work-items so an OpenCL kernel uses
only few cores of a multi-core processor and then rely on the operating system
to schedule multiple applications to run efficiently. This method is not
recommended for many reasons including the fact that it effectively hardcodes a
purposely wasteful usage of resources. Further, this trick depends on external
factors like the operating system to achieve efficient operation. Also, this
trick will not work on GPUs.)

The webinar slides "Device
Fission Extension for OpenCL" by Ben Gaster discusses Device Fission in the
context of parallel pipelines for containers. He notes there are three general
use cases when a user would like to subdivide a device:

To
reserve part of the device for use for high priority / latency-sensitive tasks.

To
more directly control the assignment of work to individual compute units.

To
subdivide compute devices along some shared hardware feature like a cache.

Typically these are use cases where some level of additional
control is required to get optimal performance beyond that provided by standard
OpenCL 1.1 APIs. Proper use of this interface assumes some detailed knowledge
of the devices.

The AMD SDK samples provide an example that uses Device
Fission on multi-core processors. In a standard install, this sample can be
found in /opt/AMDAPP/samples/cl/app/DeviceFission. Ben Gaster also has some
nice slides discussing the basics required to utilize Device Fission in his March
2011 presentation to the Khronos Group, "OpenCL
Device Fission".

Device Fission in an OpenCL click-together tool framework

As noted in part 8 of this
tutorial series, preprocessing the data can be as complicated and time
consuming as the actual computation that generates the desired results. A
"click-together" framework (illustrated below and discussed in greater detail
in part 8) naturally exploits the parallelism of multi-core processors because
each element in the pipeline is a separate application. The operating system
scheduler ensures that any applications that have the data necessary to perform
work will run – generally on separate processor cores. Under some
circumstances, it is desirable to partition the workflow so that multiple
click-together OpenCL applications can run on separate cores without
interfering with each other. Perhaps the tasks are latency sensitive or the
developer wishes to use a command like numactl under UNIX to
bind an application to specific processing cores for better cache utilization.

Figure 1: Example click-together workflow

The following source code for dynOCL.cc from part 8
has been modified to use Device Fission. The changes are highlighted in color.
Briefly, the changes are:

Following is the graphical output of system monitor for a
6-core AMD Phenom™ II X6 1055T Processor running Ubuntu 10.10 that demonstrates
the default OpenCL behavior to run on all the cores. As noted previously, the longAdd.cl
source code was substituted for simpleAdd.cl in the part 8 script.
Notice that the processor utilization jumps for all six processors when the
application starts running.

Example 6: Default OpenCL behavior is to use all processing cores

Utilizing the Device Frission version of dynOCL.cc
from this tutorial, we see that only one processing core (in this case the
orange line) achieves high utilization.

Example 7: Only a single core is used by the Device Fission code

Summary

OpenCL extensions provide programmers with additional capabilities
such as double-precision arithmetic and Device Fission. Vendor extensions are
the least portable but they do provide an important path to expose an API to
exploit device capabilities. The KHR extensions are the most general as they
require formal ratification and a test suite to define a standard behavior. The
EXT extensions can be viewed as a "work in progress" API that might eventually
achieve the formal status of a KHR extension.

With the Device Fission extension, programmers have an API
to subdivide multi-core processors to better exploit system capabilities. The
Google protobuf streaming framework introduced in part 8 was easily extended to
utilize Device Fission. Through operating systems commands such as numactl, programmers can
even bind OpenCL applications in this streaming framework to specific
processing cores. By extension, OpenCL application programmers can use Device
Fission to further optimize OpenCL plugins and generic workflows discussed in
parts 7 and 8 of this tutorial series.

Share

About the Author

Rob Farber is a senior scientist and research consultant at the Irish Center for High-End Computing in Dublin, Ireland and Chief Scientist for BlackDog Endeavors, LLC in the US.

Rob has been on staff at several US national laboratories including Los Alamos National Laboratory, Lawrence Berkeley National Laboratory, and at Pacific Northwest National Laboratory. He also served as an external faculty member at the Santa Fe Institute, co-founded two successful start-ups, and has been a consultant to Fortune 100 companies. His articles have appeared in Dr. Dobb's Journal and Scientific Computing, among others. He recently completed a book teaching massive parallel computing.