Need Help or Have an Issue?

Examples

This guide was created for versions: v0.1.0 - Latest

In this chapter we show different SYCL and CUDA examples and demonstrate the
similarities and differences between them.

Depending on how the code has been written, there are three approaches for how to maintain it.

In the first approach, for the maintenance of CUDA/SYCL applications we
encapsulate SYCL and CUDA using C++ abstractions. Therefore, a developer can
have a single source file that can be compiled with a CUDA compiler for NVIDIA
GPUs, or ComputeCpp for SYCL. A pre-processor macro can be used to
specify which code is used for SYCL and which is used for CUDA, as shown below:

In the second approach, we use template specialization where the CUDA and
SYCL versions implement their own back-ends and at compile time, based on the chosen
compiler, either the CUDA or SYCL back-ends will be selected. In the example below,
different back-ends implement different functionality for the library, but the main
functionality and the interface is the same. This is the approach followed by the standard
C++ Parallel STL library, among others.

A third possible approach is simply to port all the CUDA code to SYCL, and use a SYCL implementation capable of running on NVIDIA devices via a PTX backend. This is the simplest to maintain, as once the code is on C++ and SYCL, there is only one single source to maintain. However, it may not offer all the features of the latest NVIDIA hardware, since they may not be available on the SYCL implementation.

The first two approaches require the SYCL and the CUDA source separated in the source using pre-processor macros and different build systems.
In the following subsections we describe how to convert existing CUDA code
into its SYCL equivalent, which can be used for either option above.

For the last case, there is only one source code that can be executed on NVIDIA
GPU devices.

This section represents a step-by-step CUDA and SYCL example for adding two
vectors together. The purpose of this example is to compare the CUDA and SYCL
programming model, demonstrating how to map both API and concepts from the
former to the latter.

The completed runnable code samples for CUDA and SYCL are available at end of this
section.

In SYCL, the buffer represents the memory storage, however accessing memory
is handled via an accessor object. The following code snippet from the SYCL code
for vector addition represents the buffer creation for SYCL.

A SYCL accessor can be either
a host accessor or a device accessor. A device accessor is created within a
queue submit scope and it is only usable inside the kernel lambda. For creating
different types of accessor see the section 4.7.6 of the SYCL 1.2.1 specification.
The following code snippet represents the device accessor creation for the SYCL vector
addition code.

However, CUDA uses unified memory so the device memory will be
created as a standard pointer using the cudaMalloc function. The following code
snippet from CUDA code for vector addition represents the device memory
allocation for CUDA.

In CUDA, we need to set up the number of blocks per kernel and the block size in
order to distribute the workload among the threads. In SYCL, we need to set the
total number of threads executing the kernel and the work group size in order
to distribute the workload among the threads. We use the same approach in CUDA
and SYCL for launching the kernel. The following code snippet from the SYCL
vector addition sample represents the number of threads calculation for SYCL.

In CUDA, we calculate the maximum thread per block which is equivalent of the
SYCL local size (work group size). SYCL has the concept of global size
(total number of thread). However, in CUDA, they use the concept of number of
blocks per grid where:

In SYCL, dispatching the kernel is handled via queue submit command, while in
CUDA dispatching the kernel is performed via <<...>>. In both SYCL and CUDA
the kernel execution will be handled automatically through the run-time system.
The following code snippet from the CUDA vector addition sample code represents
dispatching the kernel for CUDA.

By leveraging SYCL C++ scopes, the data will be automatically
returned to the host and there is no need for explicit copy. The following
code snippets from the SYCL vector addition sample code use C++ scopes (marked by { and }) to trigger memory synchronization points.

This section complements the vector addition
section that introduces different concepts of a queue, buffer, and kernel in
SYCL. This section covers applying step-by-step optimization strategy for a SYCL
reduction algorithm. The CUDA equivalent of step-by-step optimization for
a reduction algorithm is located in the CUDA Toolkit
Documentation
under NVIDIA license. Please refer to CUDA Code
Samples for further
information about downloading and installing CUDA samples on different
platforms.

The reduction algorithm is one of the most common data parallel algorithms that uses a
collective communication primitive to combine multiple elements of a vector
into a single one, using the associative binary operator. The reduction algorithm
requires local memory to communicate the partial result calculated by
different work items within a work-group. We are using a tree-based algorithm
to calculate the reduction. Figure @fig:8 illustrates the different concepts
behind the reduction. The optimization steps for parallel reduction algorithm
are based on the CUDA parallel reduction.

For simplicity, we are using the input size of power of two. However, it is
easy to use the non-power of two as data size by checking the boundaries of the
kernel before executing.

In the naive algorithm, we simply use the interleaved thread access. Therefore,
at each step, we are disabling half of the thread until one thread is left.

The following picture represents the architectural view of the naive algorithm.

The following code represents the naive reduction algorithm in SYCL. Following
the tree represented in the figure @fig:8 , we are using n threads and in log n
step, we are able to reduce the output into one element.

This kernel uses interleaved threads which is costly, especially in SIMD(Simple
Instruction Multiple Data) architectures, where in each step half of the
threads that can be group together and executed in one lockstep are disabled.
Also, we are using the modulo algorithm and the cost of calculating this
algorithm is high.

However, the memory access is still not coalesced, as each thread accesses every
i-th (i is the power of 2 and increased by for loop step) memory. Also, the way
that each thread accesses the data can result in bank conflict. For example,
suppose that we have 32 threads and the local memory on a device has 16 banks.
For each work-group, the first 16 threads are used to reduce the data. The first
8 threads will use the same 8 banks as the second 8 threads, leaving 8 banks un-
used. Therefore, the access to the local memory for the second 8 threads would
be serialized. This could reduce the level of parallelism by a factor of 2.

Applying Brent’s theorem, each thread sequentially reduces k elements and sums
the in private memory. Then, each thread writes the result into the local
memory. This will increase the workload per thread by a factor of k. For
simplicity, we set the k to the power of 2.

By converting the local size to a compile-time value, the start, end, and
step for the for loop is defined at compile-time. Therefore, the compiler
is able to automatically unroll the for loop. The following kernel receives
the local size as a template parameter.

When the number of work-groups is bigger than one, the partial result of each
work-group must be written to a temporary global buffer. This is due to the
fact that there is no global barrier in OpenCL among the work-groups.
Therefore, we need more than one step to reduce the data. In this case, we use
the n-step reduction to reduce the result, where the result of each reduction
step i is stored in a temporary buffer temp_buff. This temporary buffer is
passed as an input buffer to the reduction step i+1. However, launching too
many kernels can be expensive. In order to optimize the number of kernels in
the n-step reduction algorithm, we set a threshold value, called
work_group_load. If the size of temp_buff is bigger than work_group_load
we launch a new kernel and pass the temp_buff as an input to it. Otherwise,
we bring the data on the host and reduce it in a for loop.

The following picture represents the architectural view of launching reduction kernels recursively.

The following code demonstrates launching n-step reduction kernels recursively
with the assist of threshold value work_group_load.