Previous topic

Next topic

This Page

Quick search

HSA provides an execution model similar to OpenCL. Instructions are executed
in parallel by a group of hardware threads. In some way, this is similar to
single-instruction-multiple-data (SIMD) model but with the convenience that
the fine-grain scheduling is hidden from the programmer instead of programming
with SIMD vectors as a data structure. In HSA, the code you write will be
executed by multiple threads at once (often hundreds or thousands). Your
solution will
be modeled by defining a thread hierarchy of grid, workgroup and
workitem.

Numba’s HSA support exposes facilities to declare and manage this
hierarchy of threads.

HSA execution model is similar to CUDA. The main difference will be the
shared memory model employed by HSA so that there are no device memory. The
GPU hardware uses the machine’s main memory (or host memory in
CUDA term) directly. Therefore, you will not need to_device() and
copy_to_host() in HSA programming.

A kernel function is a GPU function that is meant to be called from CPU
code. It gives it two fundamental characteristics:

kernels cannot explicitly return a value; all result data must be written
to an array passed to the function (if computing a scalar, you will
probably pass a one-element array);

kernels explicitly declare their thread hierarchy when called: i.e.
the number of workgroups and the number of workitems per workgroup
(note that while a kernel is compiled once, it can be called multiple
times with different workgroup sizes or grid sizes).

At first sight, writing a HSA kernel with Numba looks very much like
writing a JIT function for the CPU:

@hsa.jitdefincrement_by_one(an_array):""" Increment all array elements by one. """# code elided here; read further for different implementations

Instantiate the kernel proper, by specifying a number of workgroup
(or “workgroup per grid”), and a number of workitems per workgroup. The
product of the two will give the total number of workitem launched. Kernel
instantiation is done by taking the compiled kernel function
(here increment_by_one) and indexing it with a tuple of integers.

Running the kernel, by passing it the input array (and any separate
output arrays if necessary). By default, running a kernel is synchronous:
the function returns when the kernel has finished executing and the
data is synchronized back.

To help deal with multi-dimensional arrays, HSA allows you to specify
multi-dimensional workgroups and grids. In the example above, you could
make itempergroup and groupperrange tuples of one, two
or three integers. Compared to 1D declarations of equivalent sizes,
this doesn’t change anything to the efficiency or behaviour of generated
code, but can help you write your algorithms in a more natural way.

When running a kernel, the kernel function’s code is executed by every
thread once. It therefore has to know which thread it is in, in order
to know which array element(s) it is responsible for (complex algorithms
may define more complex responsibilities, but the underlying principle
is the same).

One way is for the thread to determines its position in the grid and
workgroup and manually compute the corresponding array position:

Returns the size of the workgroup at the given dimension.
The value is declared when instantiating the kernel.
This value is the same for all workitems in a given kernel,
even if they belong to different workgroups (i.e. each workgroups is “full”).