To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
along with a defined set of conventions used to represent GPU programming
concepts. This document provides an overview of the general usage of the back-
end, including a description of the conventions used and the set of accepted
LLVM IR.

Note

This document assumes a basic familiarity with CUDA and the PTX
assembly language. Information about the CUDA Driver API and the PTX assembly
language can be found in the CUDA documentation.

In PTX, there are two types of functions: device functions, which are only
callable by device code, and kernel functions, which are callable by host
code. By default, the back-end will emit device functions. Metadata is used to
declare a function as a kernel function. This metadata is attached to the
nvvm.annotations named metadata object, and has the following format:

!0 = !{<function-ref>, metadata !"kernel", i32 1}

The first parameter is a reference to the kernel function. The following
example shows a kernel function calling a device function in LLVM IR. The
function @my_kernel is callable from host code, but @my_fmad is not.

Every global variable and pointer type is assigned to one of these address
spaces, with 0 being the default address space. Intrinsics are provided which
can be used to convert pointers between the generic and non-generic address
spaces.

As an example, the following IR will define an array @g that resides in
global device memory.

@g=internaladdrspace(1)global[4xi32][i320,i321,i322,i323]

LLVM IR functions can read and write to this array, and host-side code can
copy data to it by name with the CUDA Driver API.

Note that since address space 0 is the generic space, it is illegal to have
global variables in address space 0. Address space 0 is the default address
space in LLVM, so the addrspace(N) annotation is required for global
variables.

The NVPTX target uses the module triple to select between 32/64-bit code
generation and the driver-compiler interface to use. The triple architecture
can be one of nvptx (32-bit PTX) or nvptx64 (64-bit PTX). The
operating system should be one of cuda or nvcl, which determines the
interface used by the generated code to communicate with the driver. Most
users will want to use cuda as the operating system, which makes the
generated PTX compatible with the CUDA Driver API.

The ‘llvm.nvvm.ptr.gen.to.*’ intrinsics convert a pointer in the generic
address space to a pointer in the target address space. Note that these
intrinsics are only useful if the address space of the target address space of
the pointer is known. It is not legal to use address space conversion
intrinsics to convert a pointer from one non-generic address space to another
non-generic address space.

The CUDA Toolkit comes with an LLVM bitcode library called libdevice that
implements many common mathematical functions. This library can be used as a
high-performance math library for any compilers using the LLVM NVPTX target.
The library can be found under nvvm/libdevice/ in the CUDA Toolkit and
there is a separate version for each compute architecture.

To accommodate various math-related compiler flags that can affect code
generation of libdevice code, the library code depends on a special LLVM IR
pass (NVVMReflect) to handle conditional compilation within LLVM IR. This
pass looks for calls to the @__nvvm_reflect function and replaces them
with constants based on the defined reflection parameters. Such conditional
code often follows a pattern:

The NVVMReflect pass should be executed early in the optimization
pipeline, immediately after the link stage. The internalize pass is also
recommended to remove unused math functions from the resulting PTX. For an
input IR module module.bc, the following compilation flow is recommended:

Save list of external functions in module.bc

Link module.bc with libdevice.compute_XX.YY.bc

Internalize all functions not in list from (1)

Eliminate all unused internal functions

Run NVVMReflect pass

Run standard optimization pipeline

Note

linkonce and linkonce_odr linkage types are not suitable for the
libdevice functions. It is possible to link two IR modules that have been
linked against libdevice using different reflection variables.

Since the NVVMReflect pass replaces conditionals with constants, it will
often leave behind dead code of the form:

entry:..bri1true,label%foo,label%barfoo:..bar:; Dead code..

Therefore, it is recommended that NVVMReflect is executed early in the
optimization pipeline before dead-code elimination.

The NVPTX TargetMachine knows how to schedule NVVMReflect at the beginning
of your pass manager; just use the following code when setting up your pass
manager:

The most common way to execute PTX assembly on a GPU device is to use the CUDA
Driver API. This API is a low-level interface to the GPU driver and allows for
JIT compilation of PTX code to native GPU machine code.

Initializing the Driver API:

CUdevicedevice;CUcontextcontext;// Initialize the driver APIcuInit(0);// Get a handle to the first compute devicecuDeviceGet(&device,0);// Create a compute device contextcuCtxCreate(&context,0,device);

JIT compiling a PTX string to a device binary:

CUmodulemodule;CUfunctionfunction;// JIT compile a null-terminated PTX stringcuModuleLoadData(&module,(void*)PTXString);// Get a handle to the "myfunction" kernel functioncuModuleGetFunction(&function,module,"myfunction");

For full examples of executing PTX assembly, please see the CUDA Samples distribution.

To start, let us take a look at a simple compute kernel written directly in
LLVM IR. The kernel implements vector addition, where each thread computes one
element of the output vector C from the input vectors A and B. To make this
easier, we also assume that only a single CTA (thread block) will be launched,
and that it will be one dimensional.

In this example, we use the @llvm.nvvm.read.ptx.sreg.tid.x intrinsic to
read the X component of the current thread’s ID, which corresponds to a read
of register %tid.x in PTX. The NVPTX back-end supports a large set of
intrinsics. A short list is shown below; please see
include/llvm/IR/IntrinsicsNVVM.td for the full list.

You may have noticed that all of the pointer types in the LLVM IR example had
an explicit address space specifier. What is address space 1? NVIDIA GPU
devices (generally) have four types of memory:

Global: Large, off-chip memory

Shared: Small, on-chip memory shared among all threads in a CTA

Local: Per-thread, private memory

Constant: Read-only memory shared across all threads

These different types of memory are represented in LLVM IR as address spaces.
There is also a fifth address space used by the NVPTX code generator that
corresponds to the “generic” address space. This address space can represent
addresses in any other address space (with a few exceptions). This allows
users to write IR functions that can load/store memory using the same
instructions. Intrinsics are provided to convert pointers between the generic
and non-generic address spaces.

In PTX, a function can be either a kernel function (callable from the host
program), or a device function (callable only from GPU code). You can think
of kernel functions as entry-points in the GPU program. To mark an LLVM IR
function as a kernel function, we make use of special LLVM metadata. The
NVPTX back-end will look for a named metadata node called
nvvm.annotations. This named metadata must contain a list of metadata that
describe the IR. For our purposes, we need to declare a metadata node that
assigns the “kernel” attribute to the LLVM IR function that should be emitted
as a PTX kernel function. These metadata nodes take the form:

Generating PTX from LLVM IR is all well and good, but how do we execute it on
a real GPU device? The CUDA Driver API provides a convenient mechanism for
loading and JIT compiling PTX to a native GPU device, and launching a kernel.
The API is similar to OpenCL. A simple example showing how to load and
execute our vector addition code is shown below. Note that for brevity this
code does not perform much error checking!

Note

You can also use the ptxas tool provided by the CUDA Toolkit to offline
compile PTX to machine code (SASS) for a specific GPU architecture. Such
binaries can be loaded by the CUDA Driver API in the same way as PTX. This
can be useful for reducing startup time by precompiling the PTX kernels.

In this tutorial, we show a simple example of linking LLVM IR with the
libdevice library. We will use the same kernel as the previous tutorial,
except that we will compute C=pow(A,B) instead of C=A+B.
Libdevice provides an __nv_powf function that we will use.

These steps can be performed by the LLVM llvm-link, opt, and llc
tools. In a complete compiler, these steps can also be performed entirely
programmatically by setting up an appropriate pass configuration (see
Linking with Libdevice).