Driven by the insatiable market demand for realtime, high-definition 3D graphics, the programmable Graphic Processor Unit
or GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and
very high memory bandwidth, as illustrated by Figure 1 and Figure 2.

Figure 1. Floating-Point Operations per Second for the CPU and GPU

Figure 2. Memory Bandwidth for the CPU and GPU

The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized
for compute-intensive, highly parallel computation - exactly what graphics rendering is about - and therefore designed such
that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated
by Figure 3.

Figure 3. The GPU Devotes More Transistors to Data Processing

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations
- the same program is executed on many data elements in parallel - with high arithmetic intensity - the ratio of arithmetic
operations to memory operations. Because the same program is executed for each data element, there is a lower requirement
for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory
access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets
can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices
are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images,
video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel
processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel
processing, from general signal processing or physics simulation to computational finance or computational biology.

In November 2006, NVIDIA introduced CUDA®, a general
purpose parallel computing platform and programming model that leverages
the parallel compute engine in NVIDIA GPUs to solve many complex
computational problems in a more efficient way than on a CPU.

CUDA comes with a software environment that allows developers to use C
as a high-level programming language. As illustrated by Figure 4,
other languages, application programming interfaces, or directives-based
approaches are supported, such as FORTRAN, DirectCompute, OpenACC.

Figure 4. GPU Computing Applications. CUDA is designed to support various languages and application
programming interfaces.

The advent of multicore CPUs and manycore GPUs means that mainstream
processor chips are now parallel systems. Furthermore, their parallelism
continues to scale with Moore's law. The challenge is to develop
application software that transparently scales its parallelism to
leverage the increasing number of processor cores, much as 3D graphics
applications transparently scale their parallelism to manycore GPUs with
widely varying numbers of cores.

The CUDA parallel programming model is designed to overcome this
challenge while maintaining a low learning curve for programmers familiar
with standard programming languages such as C.

At its core are three key abstractions - a hierarchy of thread groups,
shared memories, and barrier synchronization - that are simply exposed to
the programmer as a minimal set of language extensions.

These abstractions provide fine-grained data parallelism and thread
parallelism, nested within coarse-grained data parallelism and task
parallelism. They guide the programmer to partition the problem into
coarse sub-problems that can be solved independently in parallel by
blocks of threads, and each sub-problem into finer pieces that can be
solved cooperatively in parallel by all threads within the block.

This decomposition preserves language expressivity by allowing threads
to cooperate when solving each sub-problem, and at the same time enables
automatic scalability. Indeed, each block of threads can be scheduled on
any of the available multiprocessors within a GPU, in any order,
concurrently or sequentially, so that a compiled CUDA program can execute
on any number of multiprocessors as illustrated by Figure 5, and only
the runtime system needs to know the physical multiprocessor count.

This scalable programming model allows the GPU architecture to span a
wide market range by simply scaling the number of multiprocessors and
memory partitions: from the high-performance enthusiast GeForce GPUs and
professional Quadro and Tesla computing products to a variety of
inexpensive, mainstream GeForce GPUs (see CUDA-Enabled GPUs for a list of all CUDA-enabled GPUs).

Figure 5. Automatic Scalability

Note: A GPU is built around an array of Streaming
Multiprocessors (SMs) (see Hardware Implementation for
more details). A multithreaded program is partitioned into blocks of
threads that execute independently from each other, so that a GPU with
more multiprocessors will automatically execute the program in less
time than a GPU with fewer multiprocessors.

CUDA C extends C by allowing the programmer to define C functions,
called kernels, that, when called, are executed N times in
parallel by N different CUDA threads, as opposed to only
once like regular C functions.

A kernel is defined using the __global__ declaration
specifier and the number of CUDA threads that execute that kernel for a
given kernel call is specified using a new
<<<...>>>execution
configuration syntax (see C Language Extensions). Each thread that executes the kernel
is given a unique thread ID that is accessible within the
kernel through the built-in threadIdx variable.

As an illustration, the following sample code adds two vectors
A and B of size N and stores the
result into vector C:

For convenience, threadIdx is a 3-component vector, so
that threads can be identified using a one-dimensional, two-dimensional,
or three-dimensional thread index, forming a
one-dimensional, two-dimensional, or three-dimensional thread block. This
provides a natural way to invoke computation across the elements in a
domain such as a vector, matrix, or volume.

The index of a thread and its thread ID relate to each other in a
straightforward way: For a one-dimensional block, they are the same; for
a two-dimensional block of size (Dx, Dy),the
thread ID of a thread of index (x, y) is (x + y
Dx); for a three-dimensional block of size
(Dx, Dy, Dz), the thread ID of a
thread of index (x, y, z) is (x + y Dx + z
Dx Dy).

As an example, the following code adds two matrices A and
B of size NxN and stores the result into matrix
C:

There is a limit to the number of threads per block, since all threads
of a block are expected to reside on the same processor core and must
share the limited memory resources of that core. On current GPUs, a
thread block may contain up to 1024 threads.

However, a kernel can be executed by multiple equally-shaped thread
blocks, so that the total number of threads is equal to the number of
threads per block times the number of blocks.

Blocks are organized into a one-dimensional, two-dimensional, or
three-dimensional grid of thread blocks as illustrated by
Figure 6. The number of
thread blocks in a grid is usually dictated by the size of the data being
processed or the number of processors in the system, which it can greatly
exceed.

Figure 6. Grid of Thread Blocks

The number of threads per block and the number of blocks per grid
specified in the <<<...>>> syntax can be of
type int or dim3. Two-dimensional
blocks or grids can be specified as in the example above.

Each block within the grid can be identified by a one-dimensional,
two-dimensional, or three-dimensional index accessible within the kernel
through the built-in blockIdx variable. The dimension of
the thread block is accessible within the kernel through the built-in
blockDim variable.

A thread block size of 16x16 (256 threads), although arbitrary in this
case, is a common choice. The grid is created with enough blocks to have
one thread per matrix element as before. For simplicity, this example
assumes that the number of threads per grid in each dimension is evenly
divisible by the number of threads per block in that dimension, although
that need not be the case.

Thread blocks are required to execute independently: It must be possible
to execute them in any order, in parallel or in series. This independence
requirement allows thread blocks to be scheduled in any order across any
number of cores as illustrated by Figure 1 4, enabling programmers to
write code that scales with the number of cores.

Threads within a block can cooperate by sharing data through some
shared memory and by synchronizing their execution to
coordinate memory accesses. More precisely, one can specify
synchronization points in the kernel by calling the
__syncthreads() intrinsic function;
__syncthreads() acts as a barrier at which all threads in the
block must wait before any is allowed to proceed. Shared Memory gives an example of using shared memory.

For efficient cooperation, the shared memory is expected to be a
low-latency memory near each processor core (much like an L1 cache) and
__syncthreads() is expected to be lightweight.

CUDA threads may access data from multiple memory spaces during their execution as illustrated by Figure 7. Each thread has private local memory. Each thread block has shared memory visible to all threads of the block and with the
same lifetime as the block. All threads have access to the same global memory.

There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The
global, constant, and texture memory spaces are optimized for different memory usages (see Device Memory Accesses). Texture memory also offers different addressing modes, as well as data filtering, for some specific data formats (see Texture and Surface Memory).

The global, constant, and texture memory spaces are persistent across kernel launches by the same application.

As illustrated by Figure 8, the CUDA programming model assumes that the CUDA threads execute on a physically separate device that operates as a coprocessor to the host running the C program. This is the case, for example, when the kernels execute
on a GPU and the rest of the C program executes on a CPU.

The CUDA programming model also assumes that both the host and the device maintain their own separate memory spaces in DRAM, referred to as host memory and device memory, respectively. Therefore, a program manages the global, constant, and texture memory spaces visible to kernels through calls
to the CUDA runtime (described in Programming Interface). This includes device memory allocation and deallocation as well as data transfer between host and device memory.

Figure 8. Heterogeneous Programming

Note: Serial code executes on the host while parallel code executes on the device.

The compute capability of a device is represented by a
version number, also sometimes called its "SM version". This version
number identifies the features supported by the GPU hardware and is
used by applications at runtime to determine which hardware features
and/or instructions are available on the present GPU.

The compute capability version comprises a major and a minor
version number (x.y):

Devices with the same major revision number are
of the same core architecture. The major revision number is 5 for
devices based on the Maxwell architecture, 3 for devices
based on the Kepler architecture, 2 for devices based on
the Fermi architecture, and 1 for devices based on the
Tesla architecture.

The minor revision number corresponds to an incremental improvement
to the core architecture, possibly including new features.

Note: The compute capability version of a particular GPU should not be
confused with the CUDA version (e.g., CUDA 5.5, CUDA 6, CUDA 6.5),
which is the version of the CUDA software platform. The CUDA
platform is used by application developers to create applications that
run on many generations of GPU architectures, including future GPU
architectures yet to be invented. While new versions of the CUDA
platform often add native support for a new GPU architecture by
supporting the compute capability version of that architecture, new
versions of the CUDA platform typically also include software features
that are independent of hardware generation.

CUDA C provides a simple path for users familiar with the C programming
language to easily write programs for execution by the device.

It consists of a minimal set of extensions to the C language and a
runtime library.

The core language extensions have been introduced in Programming Model. They allow programmers to define a kernel
as a C function and use some new syntax to specify the grid and block
dimension each time the function is called. A complete description of all
extensions can be found in C Language Extensions. Any
source file that contains some of these extensions must be compiled with
nvcc as outlined in Compilation with NVCC.

The runtime is introduced in Compilation Workflow. It
provides C functions that execute on the host to allocate and deallocate
device memory, transfer data between host memory and device memory,
manage systems with multiple devices, etc. A complete description of the
runtime can be found in the CUDA reference manual.

The runtime is built on top of a lower-level C API, the CUDA driver API,
which is also accessible by the application. The driver API provides an
additional level of control by exposing lower-level concepts such as CUDA
contexts - the analogue of host processes for the device - and CUDA
modules - the analogue of dynamically loaded libraries for the device.
Most applications do not use the driver API as they do not need this
additional level of control and when using the runtime, context and
module management are implicit, resulting in more concise code. The
driver API is introduced in Driver API and fully
described in the reference manual.

Kernels can be written using the CUDA instruction set architecture, called PTX, which is described in the PTX reference manual. It is however usually more effective to use a high-level programming language
such as C. In both cases, kernels must be compiled into binary code by nvcc to execute on the device.

nvcc is a compiler driver that simplifies the process of compiling C or PTX code: It provides simple and familiar command line options and executes them by invoking the collection of tools that implement
the different compilation stages. This section gives an overview of nvcc workflow and command options. A complete description can be found in the nvcc user manual.

Source files compiled with nvcc can include a mix of host code (i.e., code that executes on the host) and device code (i.e., code that executes on the device).
nvcc's basic workflow consists in separating device code from host code and then:

compiling the device code into an assembly form (PTX code) and/or binary form (cubin object),

and modifying the host code by replacing the <<<...>>> syntax introduced in Kernels (and described in more details in Execution Configuration) by the necessary CUDA C runtime function calls to load and launch each compiled kernel from the PTX code and/or cubin object.

The modified host code is output either as C code that is left to be compiled using another tool or as object code directly
by letting nvcc invoke the host compiler during the last compilation stage.

Applications can then:

Either link to the compiled host code (this is the most common case),

Or ignore the modified host code (if any) and use the CUDA driver API (see Driver API) to load and execute the PTX code or cubin object.

Any PTX code loaded by an application at runtime is
compiled further to binary code by the device driver. This is called
just-in-time compilation. Just-in-time compilation
increases application load time, but allows the application to benefit
from any new compiler improvements coming with each new device driver. It
is also the only way for applications to run on devices that did not
exist at the time the application was compiled, as detailed in Application Compatibility.

When the device driver just-in-time compiles some PTX code
for some application, it automatically caches a copy of the generated
binary code in order to avoid repeating the compilation in subsequent
invocations of the application. The cache - referred to as compute
cache - is automatically invalidated when the device driver is
upgraded, so that applications can benefit from the improvements in the
new just-in-time compiler built into the device driver.

Binary code is architecture-specific. A cubin object is
generated using the compiler option
-code that specifies the targeted
architecture: For example, compiling with
-code=sm_35 produces binary code for
devices of compute capability
3.5. Binary compatibility is guaranteed from one minor revision
to the next one, but not from one minor revision to the previous one or
across major revisions. In other words, a cubin object
generated for compute capability X.y will only execute
on devices of compute capability X.z where z≥y.

Some PTX instructions are only supported on devices of higher compute capabilities. For example, warp shuffle instructions are only
supported on devices of compute capability 3.0 and above. The -arch compiler option specifies the compute capability that is assumed when compiling C to PTX code. So, code that contains warp shuffle, for example, must be compiled with -arch=sm_30 (or higher).

PTX code produced for some specific compute capability can always be compiled to binary code of greater or equal compute capability.

To execute code on devices of specific compute capability, an
application must load binary or PTX code that is compatible
with this compute capability as described in Binary Compatibility and PTX Compatibility.
In particular, to be able to execute code on future architectures with
higher compute capability (for which no binary code can be generated
yet), an application must load PTX code that will be
just-in-time compiled for these devices (see Just-in-Time Compilation).

Which PTX and binary code gets embedded in a CUDA C
application is controlled by the -arch and
-code compiler options or the
-gencode compiler option as detailed in
the nvcc user manual. For example,

x.cu can have an optimized code path that uses warp shuffle
operations, for example, which are only supported in devices of compute
capability 3.0 and higher. The __CUDA_ARCH__ macro can
be used to differentiate various code paths based on compute capability.
It is only defined for device code. When compiling with
-arch=compute_35 for example,
__CUDA_ARCH__ is equal to 350.

Applications using the driver API must compile code to separate files
and explicitly load and execute the most appropriate file at runtime.

The nvcc user manual lists various shorthand for the
-arch,
-code, and
-gencode compiler options. For example,
-arch=sm_35 is a shorthand for
-arch=compute_35-code=compute_35,sm_35 (which is the same as
-gencodearch=compute_35,code=\'compute_35,sm_35\').

The front end of the compiler processes CUDA source files according to C++ syntax
rules. Full C++ is supported for the host code. However, only a subset of C++ is fully
supported for the device code as described in C/C++ Language Support.

The runtime is implemented in the cudart library,
which is linked to the application, either statically via cudart.lib or libcudart.a,
or dynamically via cudart.dll or libcudart.so.
Applications that require cudart.dll and/or cudart.so for dynamic linking
typically include them as part of the application installation package.

All its entry points are prefixed with cuda.

As mentioned in Heterogeneous Programming, the CUDA
programming model assumes a system composed of a host and a device, each
with their own separate memory. Device Memory gives an
overview of the runtime functions used to manage device memory.

There is no explicit initialization function for the runtime; it
initializes the first time a runtime function is called (more
specifically any function other than functions from the device and
version management sections of the reference manual). One needs to keep
this in mind when timing runtime function calls and when interpreting the
error code from the first call into the runtime.

During initialization, the runtime creates a CUDA context for each
device in the system (see Context for more details on
CUDA contexts). This context is the primary context for this
device and it is shared among all the host threads of the application.
As part of this context creation, the device code is just-in-time compiled if necessary (see Just-in-Time Compilation) and loaded into device memory.
This all happens under the hood and the runtime does not expose the
primary context to the application.

When a host thread calls cudaDeviceReset(), this
destroys the primary context of the device the host thread currently
operates on (i.e., the current device as defined in Device Selection). The next runtime function call made by
any host thread that has this device as current will create a new primary
context for this device.

As mentioned in Heterogeneous Programming, the CUDA
programming model assumes a system composed of a host and a device, each
with their own separate memory. Kernels operate out of device memory, so
the runtime provides functions to allocate, deallocate, and copy device
memory, as well as transfer data between host memory and device memory.

Device memory can be allocated either as linear memory or
as CUDA arrays.

Linear memory exists on the device in a 40-bit address space, so separately allocated entities can reference one
another via pointers, for example, in a binary tree.

Linear memory is typically allocated using cudaMalloc()
and freed using cudaFree() and data transfer between
host memory and device memory are typically done using
cudaMemcpy(). In the vector addition code sample of
Kernels, the vectors need to be copied from
host memory to device memory:

Linear memory can also be allocated through
cudaMallocPitch() and cudaMalloc3D().
These functions are recommended for allocations of 2D or 3D arrays as it
makes sure that the allocation is appropriately padded to meet the
alignment requirements described in Device Memory Accesses, therefore ensuring best performance
when accessing the row addresses or performing copies between 2D arrays
and other regions of device memory (using the
cudaMemcpy2D() and cudaMemcpy3D()
functions). The returned pitch (or stride) must be used to access array
elements. The following code sample allocates a width x
height 2D array of floating-point values and shows how
to loop over the array elements in device code:

The reference manual lists all the various functions used to copy memory
between linear memory allocated with cudaMalloc(),
linear memory allocated with cudaMallocPitch() or
cudaMalloc3D(), CUDA arrays, and memory allocated for
variables declared in global or constant memory space.

The following code sample illustrates various ways of accessing global
variables via the runtime API:

cudaGetSymbolAddress() is used to retrieve the address
pointing to the memory allocated for a variable declared in global memory
space. The size of the allocated memory is obtained through
cudaGetSymbolSize().

Shared memory is expected to be much faster than global memory as
mentioned in Thread Hierarchy and detailed in Shared Memory. Any opportunity to replace global memory
accesses by shared memory accesses should therefore be exploited as
illustrated by the following matrix multiplication example.

The following code sample is a straightforward implementation of matrix
multiplication that does not take advantage of shared memory. Each thread
reads one row of A and one column of B and computes the
corresponding element of C as illustrated in Figure 9.
A is therefore read B.width times from global memory and
B is read A.height times.

The following code sample is an implementation of matrix multiplication
that does take advantage of shared memory. In this implementation, each
thread block is responsible for computing one square sub-matrix
Csub of C and each thread within the block is
responsible for computing one element of Csub. As
illustrated in Figure 10,
Csub is equal to the product of two rectangular
matrices: the sub-matrix of A of dimension (A.width,
block_size) that has the same row indices as
Csub, and the sub-matrix of B of dimension
(block_size, A.width )that has the same column indices as
Csub. In order to fit into the device's resources,
these two rectangular matrices are divided into as many square matrices
of dimension block_size as necessary and Csub is
computed as the sum of the products of these square matrices. Each of
these products is performed by first loading the two corresponding square
matrices from global memory to shared memory with one thread loading one
element of each matrix, and then by having each thread compute one
element of the product. Each thread accumulates the result of each of
these products into a register and once done writes the result to global
memory.

By blocking the computation this way, we take advantage of fast shared
memory and save a lot of global memory bandwidth since A is only
read (B.width / block_size) times from global memory and B
is read (A.height / block_size) times.

The Matrix type from the previous code sample is augmented
with a stride field, so that sub-matrices can be efficiently
represented with the same type. __device__
functions are used to get and set
elements and build any sub-matrix from a matrix.

Copies between page-locked host memory and device memory can be performed concurrently with
kernel execution for some devices as mentioned in Asynchronous Concurrent Execution.

On some devices, page-locked host memory can be mapped into the address space of the device,
eliminating the need to copy it to or from device memory as detailed in Mapped Memory.

On systems with a front-side bus, bandwidth between host memory and device memory is higher
if host memory is allocated as page-locked and even higher if in addition it is
allocated as write-combining as described in Write-Combining Memory.

Page-locked host memory is a scarce resource however, so allocations in page-locked memory will start failing long before
allocations in pageable memory. In addition, by reducing the amount of physical memory available to the operating system for
paging, consuming too much page-locked memory reduces overall system performance.

The simple zero-copy CUDA sample comes with a detailed document on the page-locked memory APIs.

A block of page-locked memory can be used in conjunction with any device in the system (see Multi-Device System for more details on multi-device systems), but by default, the benefits of using page-locked memory described above are only
available in conjunction with the device that was current when the block was allocated (and with all devices sharing the same
unified address space, if any, as described in Unified Virtual Address Space). To make these advantages available to all devices, the block needs to be allocated by passing the flag cudaHostAllocPortable to cudaHostAlloc() or page-locked by passing the flag cudaHostRegisterPortable to cudaHostRegister().

By default page-locked host memory is allocated as cacheable. It can
optionally be allocated as write-combining instead by
passing flag cudaHostAllocWriteCombined to
cudaHostAlloc(). Write-combining memory frees up the
host's L1 and L2 cache resources, making more cache available to the rest
of the application. In addition, write-combining memory is not snooped
during transfers across the PCI Express bus, which can improve transfer
performance by up to 40%.

Reading from write-combining memory from the host is prohibitively slow,
so write-combining memory should in general be used for memory that the
host only writes to.

A block of page-locked host memory can also be mapped into the address space of the device by passing flag cudaHostAllocMapped to cudaHostAlloc() or by passing flag cudaHostRegisterMapped to cudaHostRegister(). Such a block has therefore in general two addresses: one in host memory that is returned by cudaHostAlloc() or malloc(), and one in device memory that can be retrieved using cudaHostGetDevicePointer() and then used to access the block from within a kernel. The only exception is for pointers allocated with cudaHostAlloc() and when a unified address space is used for the host and the device as mentioned in Unified Virtual Address Space.

Accessing host memory directly from within a kernel has several advantages:

There is no need to allocate a block in device memory and copy data between this block and the block in host memory; data
transfers are implicitly performed as needed by the kernel;

There is no need to use streams (see Concurrent Data Transfers) to overlap data transfers with kernel execution; the kernel-originated data transfers automatically overlap with kernel
execution.

Since mapped page-locked memory is shared between host and device however, the application must synchronize memory accesses
using streams or events (see Asynchronous Concurrent Execution) to avoid any potential read-after-write, write-after-read, or write-after-write hazards.

To be able to retrieve the device pointer to any mapped page-locked memory, page-locked memory mapping must be enabled by
calling cudaSetDeviceFlags() with the cudaDeviceMapHost flag before any other CUDA call is performed. Otherwise, cudaHostGetDevicePointer() will return an error.

cudaHostGetDevicePointer() also returns an error if the device does not support mapped page-locked host memory. Applications may query this capability
by checking the canMapHostMemory device property (see Device Enumeration), which is equal to 1 for devices that support mapped page-locked host memory.

Note that atomic functions (see Atomic Functions) operating on mapped page-locked memory are not atomic from the point of view of the host or other devices.

In order to facilitate concurrent execution between host and device, some function calls are asynchronous: Control is returned
to the host thread before the device has completed the requested task. These are:

Kernel launches;

Memory copies between two addresses to the same device memory;

Memory copies from host to device of a memory block of 64 KB or less;

Memory copies performed by functions that are suffixed with Async;

Memory set function calls.

Programmers can globally disable asynchronous kernel launches for all CUDA applications
running on a system by setting the CUDA_LAUNCH_BLOCKING environment
variable to 1. This feature is provided for debugging purposes only and should never be
used as a way to make production software run reliably.

Some devices can perform copies between page-locked host memory and device memory concurrently with kernel execution.
Applications may query this capability by checking the asyncEngineCount device property (see Device Enumeration), which is greater than zero for devices that support it.

Some devices of compute capability 2.x and higher can execute multiple kernels concurrently. Applications may query this capability
by checking the concurrentKernels device property (see Device Enumeration), which is equal to 1 for devices that support it.

The maximum number of kernel launches that a device can execute concurrently is 16 on
devices of compute capability 2.0 through 3.0; the maximum is 32 concurrent kernel launches
on devices of compute capability 3.5 and higher. Devices of compute capability 3.2 are
limited to 4 concurrent kernel launches.

A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context.

Kernels that use many textures or a large amount of local memory are less likely to execute concurrently with other kernels.

Applications manage concurrency through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams,
on the other hand, may execute their commands out of order with respect to one another or concurrently; this behavior is not
guaranteed and should therefore not be relied upon for correctness (e.g., inter-kernel communication is undefined).

A stream is defined by creating a stream object and specifying it as the stream parameter to a sequence of kernel launches
and host <-> device memory copies. The following code sample creates two streams and allocates an array hostPtr of float in page-locked memory.

Each stream copies its portion of input array hostPtr to array inputDevPtr in device memory, processes inputDevPtr on the device by calling MyKernel(), and copies the result outputDevPtr back to the same portion of hostPtr. Overlapping Behavior describes how the streams overlap in this example depending on the capability of the device. Note that hostPtr must point to page-locked host memory for any overlap to occur.

Streams are released by calling cudaStreamDestroy().

for (int i = 0; i < 2; ++i)
cudaStreamDestroy(stream[i]);

cudaStreamDestroy() waits for all preceding commands in the given stream to complete before destroying the stream and returning control to the
host thread.

Kernel launches and host <-> device memory
copies that do not specify any stream parameter, or equivalently that set
the stream parameter to zero, are issued to the default stream. They are
therefore executed in order.

For code that is compiled using the --default-stream per-thread compilation flag (or that defines the CUDA_API_PER_THREAD_DEFAULT_STREAM macro before including CUDA headers (cuda.h and cuda_runtime.h)), the default stream is a regular stream and
each host thread has its own default stream.

For code that is compiled using the --default-stream null compilation flag, the default stream is a special stream called the NULL stream
and each device has a single NULL stream used for all host threads. The NULL stream is special as it causes implicit synchronization
as described in
Implicit Synchronization.

There are various ways to explicitly synchronize streams with each other.

cudaDeviceSynchronize() waits until all preceding
commands in all streams of all host threads have completed.

cudaStreamSynchronize()takes a stream as a parameter
and waits until all preceding commands in the given stream have
completed. It can be used to synchronize the host with a specific stream,
allowing other streams to continue executing on the device.

cudaStreamWaitEvent()takes a stream and an event as
parameters (see Events for a description of events)and
makes all the commands added to the given stream after the call to
cudaStreamWaitEvent()delay their execution until the
given event has completed. The stream can be 0, in which case all the
commands added to any stream after the call to
cudaStreamWaitEvent()wait on the event.

cudaStreamQuery()provides applications with a way to
know if all preceding commands in a stream have completed.

To avoid unnecessary slowdowns, all these synchronization functions are
usually best used for timing purposes or to isolate a launch or memory
copy that is failing.

For devices that support concurrent kernel execution and are of compute capability 3.0 or
lower, any operation that requires a dependency check to see if a streamed kernel launch
is complete:

Can start executing only when all thread blocks of all prior kernel launches from any stream
in the CUDA context have started executing;

Blocks all later kernel launches from any stream in the CUDA context until the kernel launch
being checked is complete.

Operations that require a dependency check include any other commands within the same stream as
the launch being checked and any call to cudaStreamQuery() on that
stream. Therefore, applications should follow these guidelines to improve their potential for
concurrent kernel execution:

All independent operations should be issued before dependent operations,

For example, on devices that do not support concurrent data transfers, the two streams of the code sample of Creation and Destruction do not overlap at all because the memory copy from host to device is issued to stream[1] after the memory copy from device
to host is issued to stream[0], so it can only start once the memory copy from device to host issued to stream[0] has completed.
If the code is rewritten the following way (and assuming the device supports overlap of data transfer and kernel execution)

then the memory copy from host to device issued to stream[1] overlaps with the kernel launch issued to stream[0].

On devices that do support concurrent data transfers, the two streams of the code sample
of Creation and Destruction do overlap: The memory copy from
host to device issued to stream[1] overlaps with the memory copy from device to host
issued to stream[0] and even with the kernel launch issued to stream[0] (assuming the
device supports overlap of data transfer and kernel execution). However, for devices of
compute capability 3.0 or lower, the kernel executions cannot possibly overlap because
the second kernel launch is issued to stream[1] after the memory copy from device to host
is issued to stream[0], so it is blocked until the first kernel launch issued to stream[0]
is complete as per Implicit Synchronization. If the code is rewritten
as above, the kernel executions overlap (assuming the device supports concurrent kernel
execution) since the second kernel launch is issued to stream[1] before the memory copy
from device to host is issued to stream[0]. In that case however, the memory copy from
device to host issued to stream[0] only overlaps with the last thread blocks of the
kernel launch issued to stream[1] as per Implicit Synchronization, which
can represent only a small portion of the total execution time of the kernel.

The runtime provides a way to insert a callback at any point into a stream via cudaStreamAddCallback(). A callback is a function that is executed on the host once all commands issued to the stream before the callback have completed.
Callbacks in stream 0 are executed once all preceding tasks and commands issued in all streams before the callback have completed.

The following code sample adds the callback function
MyCallback to each of two streams
after issuing a host-to-device memory copy, a kernel launch and a
device-to-host memory copy into each stream. The callback will
begin execution on the host after each of the device-to-host memory
copies completes.

The commands that are issued in a stream (or all commands issued to any stream if the callback is issued to stream 0) after
a callback do not start executing before the callback has completed.
The last parameter of cudaStreamAddCallback() is reserved for future use.

A callback must not make CUDA API calls (directly or indirectly), as it might end up waiting on itself if it makes such a
call leading to a deadlock.

The relative priorities of streams can be specified at creation using cudaStreamCreateWithPriority(). The range of allowable priorities, ordered as [ highest priority, lowest priority ] can be obtained using the cudaDeviceGetStreamPriorityRange() function. At runtime, as blocks in low-priority schemes finish, waiting blocks in higher-priority streams are scheduled in
their place.

The following code sample obtains the allowable range of priorities for the current device, and creates streams with the
highest and lowest available priorities

The runtime also provides a way to closely monitor the device's
progress, as well as perform accurate timing, by letting the
application asynchronously record events at any point in
the program and query when these events are completed. An event has
completed when all tasks - or optionally, all commands in a given
stream - preceding the event have completed. Events in stream zero are
completed after all preceding tasks and commands in all streams are
completed.

When a synchronous function is called, control is not returned to the host thread before the
device has completed the requested task. Whether the host thread will then yield, block, or spin
can be specified by calling cudaSetDeviceFlags()with some specific flags (see
reference manual for details) before any other CUDA call is performed by the host thread.

A host thread can set the device it operates on at any time by calling cudaSetDevice(). Device memory allocations and kernel launches are made on the currently set device; streams and events are created in association
with the currently set device. If no call to cudaSetDevice() is made, the current device is device 0.

The following code sample illustrates how setting the current device affects memory allocation and kernel execution.

A memory copy will succeed even if it is issued to a stream that is not associated to the current device.

cudaEventRecord() will fail if the input event and input stream are associated to different devices.

cudaEventElapsedTime() will fail if the two input events are associated to different devices.

cudaEventSynchronize() and cudaEventQuery() will succeed even if the input event is associated to a device that is different from the current device.

cudaStreamWaitEvent() will succeed even if the input stream and input event are associated to different devices. cudaStreamWaitEvent() can therefore be used to synchronize multiple devices with each other.

Each device has its own default stream (see Default Stream), so commands issued to the default stream of a device may execute out of order or concurrently with respect to commands
issued to the default stream of any other device.

When the application is run as a 64-bit process, devices of compute capability 2.0 and higher from the Tesla series may address
each other's memory (i.e., a kernel executing on one device can dereference a pointer to the memory of the other device).
This peer-to-peer memory access feature is supported between two devices if cudaDeviceCanAccessPeer() returns true for these two devices.

Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample.

A unified address space is used for both devices (see Unified Virtual Address Space), so the same pointer can be used to address memory from both devices as shown in the code sample below.

Consistent with the normal behavior of streams, an asynchronous copy between the memories of two devices may overlap with
copies or kernels in another stream.

Note that if peer-to-peer access is enabled between two devices via cudaDeviceEnablePeerAccess() as described in Peer-to-Peer Memory Access, peer-to-peer memory copy between these two devices no longer needs to be staged through the host and is therefore faster.

When the application is run as a 64-bit process, a single address space
is used for the host and all the devices of compute capability 2.0 and
higher. All host memory allocations made via CUDA API calls and all
device memory allocations on supported devices are within this virtual
address range. As a consequence:

The location of any memory on the host allocated through CUDA, or
on any of the devices which use the unified address space, can be
determined from the value of the pointer using
cudaPointerGetAttributes().

When copying to or from the memory of any device which uses
the unified address space, the cudaMemcpyKind
parameter of cudaMemcpy*() can be
set to cudaMemcpyDefault to determine locations
from the pointers. This also works for host pointers not allocated
through CUDA, as long as the current device uses unified addressing.

Allocations via cudaHostAlloc() are automatically
portable (see Portable Memory) across all the devices
for which the unified address space is used, and pointers returned by
cudaHostAlloc() can be used directly from within
kernels running on these devices (i.e., there is no need to obtain a
device pointer via cudaHostGetDevicePointer() as
described in Mapped Memory.

Applications may query if the unified address space is used for a
particular device by checking that the unifiedAddressing
device property (see Device Enumeration) is equal to
1.

Any device memory pointer or event handle created by a host thread can
be directly referenced by any other thread within the same process. It is
not valid outside this process however, and therefore cannot be directly
referenced by threads belonging to a different process.

To share device memory pointers and events across processes, an
application must use the Inter Process Communication API, which is
described in detail in the reference manual. The IPC API is only
supported for 64-bit processes on Linux and for devices of compute
capability 2.0 and higher.

Using this API, an application can get the IPC handle for a given device
memory pointer using cudaIpcGetMemHandle(), pass it to
another process using standard IPC mechanisms (e.g., interprocess shared
memory or files), and use cudaIpcOpenMemHandle() to
retrieve a device pointer from the IPC handle that is a valid pointer
within this other process. Event handles can be shared using similar
entry points.

An example of using the IPC API is where a single master process
generates a batch of input data, making the data available to multiple
slave processes without requiring regeneration or copying.

All runtime functions return an error code, but for an asynchronous function (see Asynchronous Concurrent Execution), this error code cannot possibly report any of the asynchronous errors that could occur on the device since the function
returns before the device has completed the task; the error code only reports errors that occur on the host prior to executing
the task, typically related to parameter validation; if an asynchronous error occurs, it will be reported by some subsequent
unrelated runtime function call.

The only way to check for asynchronous errors just after some asynchronous function call is therefore to synchronize just
after the call by calling cudaDeviceSynchronize() (or by using any other synchronization mechanisms described in Asynchronous Concurrent Execution) and checking the error code returned by cudaDeviceSynchronize().

The runtime maintains an error variable for each host thread that is initialized to cudaSuccess and is overwritten by the error code every time an error occurs (be it a parameter validation error or an asynchronous error).
cudaPeekAtLastError() returns this variable. cudaGetLastError() returns this variable and resets it to cudaSuccess.

Kernel launches do not return any error code, so cudaPeekAtLastError() or cudaGetLastError() must be called just after the kernel launch to retrieve any pre-launch errors. To ensure that any error returned by cudaPeekAtLastError() or cudaGetLastError() does not originate from calls prior to the kernel launch, one has to make sure that the runtime error variable is set to
cudaSuccess just before the kernel launch, for example, by calling cudaGetLastError() just before the kernel launch. Kernel launches are asynchronous, so to check for asynchronous errors, the application must
synchronize in-between the kernel launch and the call to cudaPeekAtLastError() or cudaGetLastError().

Note that cudaErrorNotReady that may be returned by cudaStreamQuery() and cudaEventQuery() is not considered an error and is therefore not reported by cudaPeekAtLastError() or cudaGetLastError().

CUDA supports a subset of the texturing hardware that the GPU uses for
graphics to access texture and surface memory. Reading data from texture
or surface memory instead of global memory can have several performance
benefits as described in Device Memory Accesses.

There are two different APIs to access texture and surface memory:

The texture reference API that is supported on all devices,

The texture object API that is only supported on devices of compute
capability 3.x.

The texture reference API has limitations that the texture object API
does not have. They are mentioned in Texture Reference API.

Texture memory is read from kernels using the device functions described
in Texture Functions. The process of reading a texture
calling one of these functions is called a texture fetch.
Each texture fetch specifies a parameter called a texture
object for the texture object API or a texture
reference for the texture reference API.

The texture object or the texture reference specifies:

The texture, which is the piece of texture memory that
is fetched. Texture objects are created at runtime and the texture is
specified when creating the texture object as described in Texture Object API. Texture references are created at
compile time and the texture is specified at runtime by bounding the
texture reference to the texture through runtime functions as described
in Texture Reference API; several distinct texture
references might be bound to the same texture or to textures that
overlap in memory. A texture can be any region of linear memory or a
CUDA array (described in CUDA Arrays).

Its dimensionality that specifies whether the texture is
addressed as a one dimensional array using one texture coordinate, a
two-dimensional array using two texture coordinates, or a
three-dimensional array using three texture coordinates. Elements of
the array are called texels, short for texture
elements. The texture width, height,
and depth refer to the size of the array in each
dimension. Table 12
lists the maximum texture width, height, and depth depending on the
compute capability of the device.

The type of a texel, which is restricted to the basic integer and
single-precision floating-point types and any of the 1-, 2-, and
4-component vector types defined in char, short, int, long, longlong, float, double that
are derived from the basic integer and single-precision floating-point
types.

The read mode, which is equal to
cudaReadModeNormalizedFloat or
cudaReadModeElementType. If it is
cudaReadModeNormalizedFloat and the type of the texel
is a 16-bit or 8-bit integer type, the value returned by the texture
fetch is actually returned as floating-point type and the full range of
the integer type is mapped to [0.0, 1.0] for unsigned integer type and
[-1.0, 1.0] for signed integer type; for example, an unsigned 8-bit
texture element with the value 0xff reads as 1. If it is
cudaReadModeElementType, no conversion is
performed.

Whether texture coordinates are normalized or not. By default,
textures are referenced (by the functions of Texture Functions) using floating-point coordinates in the
range [0, N-1] where N is the size of the texture in the dimension
corresponding to the coordinate. For example, a texture that is 64x32
in size will be referenced with coordinates in the range [0, 63] and
[0, 31] for the x and y dimensions, respectively. Normalized texture
coordinates cause the coordinates to be specified in the range [0.0,
1.0-1/N] instead of [0, N-1], so the same 64x32 texture would be
addressed by normalized coordinates in the range [0, 1-1/N] in both the
x and y dimensions. Normalized texture coordinates are a natural fit to
some applications' requirements, if it is preferable for the texture
coordinates to be independent of the texture size.

The addressing mode. It is valid to call the device
functions of Section B.8 with coordinates that are out of range. The
addressing mode defines what happens in that case. The default
addressing mode is to clamp the coordinates to the valid range: [0, N)
for non-normalized coordinates and [0.0, 1.0) for normalized
coordinates. If the border mode is specified instead, texture fetches
with out-of-range texture coordinates return zero. For normalized
coordinates, the warp mode and the mirror mode are also available.
When using the wrap mode, each coordinate x is converted to
frac(x)=x floor(x) where floor(x) is the
largest integer not greater than x. When using the mirror mode,
each coordinate x is converted to frac(x) if
floor(x) is even and 1-frac(x) if floor(x) is odd.
The addressing mode is specified as an array of size three whose first,
second, and third elements specify the addressing mode for the first,
second, and third texture coordinates, respectively; the addressing
mode are cudaAddressModeBorder,
cudaAddressModeClamp,
cudaAddressModeWrap, and
cudaAddressModeMirror;
cudaAddressModeWrap and
cudaAddressModeMirror are only supported for
normalized texture coordinates

The filtering mode which specifies how the value
returned when fetching the texture is computed based on the input
texture coordinates. Linear texture filtering may be done only for
textures that are configured to return floating-point data. It performs
low-precision interpolation between neighboring texels. When enabled,
the texels surrounding a texture fetch location are read and the return
value of the texture fetch is interpolated based on where the texture
coordinates fell between the texels. Simple linear interpolation is
performed for one-dimensional textures, bilinear interpolation for
two-dimensional textures, and trilinear interpolation for
three-dimensional textures. Texture Fetching gives
more details on texture fetching. The filtering mode is equal to
cudaFilterModePoint or
cudaFilterModeLinear. If it is
cudaFilterModePoint, the returned value is the texel
whose texture coordinates are the closest to the input texture
coordinates. If it is cudaFilterModeLinear, the
returned value is the linear interpolation of the two (for a
one-dimensional texture), four (for a two dimensional texture), or
eight (for a three dimensional texture) texels whose texture
coordinates are the closest to the input texture coordinates.
cudaFilterModeLinear is only valid for returned values
of floating-point type.

Some of the attributes of a texture reference are immutable and must be
known at compile time; they are specified when declaring the texture
reference. A texture reference is declared at file scope as a variable of
type texture:

texture<DataType, Type, ReadMode> texRef;

where:

DataType specifies the type of the texel;

Type specifies the type of the texture reference and
is equal to cudaTextureType1D,
cudaTextureType2D, or
cudaTextureType3D, for a one-dimensional,
two-dimensional, or three-dimensional texture, respectively, or
cudaTextureType1DLayered or
cudaTextureType2DLayered for a one-dimensional or
two-dimensional layered texture respectively; Type is an optional
argument which defaults to cudaTextureType1D;

ReadMode specifies the read mode; it is an optional
argument which defaults to
cudaReadModeElementType.

A texture reference can only be declared as a static global variable and
cannot be passed as an argument to a function.

The other attributes of a texture reference are mutable and can be
changed at runtime through the host runtime. As explained in the
reference manual, the runtime API has a low-level C-style
interface and a high-level C++-style interface. The
texture type is defined in the high-level API as a
structure publicly derived from the textureReference
type defined in the low-level API as such:

normalized, addressMode, and filterMode may be directly modified in host code.

Before a kernel can use a texture reference to read from texture memory,
the texture reference must be bound to a texture using
cudaBindTexture() or
cudaBindTexture2D() for linear memory, or
cudaBindTextureToArray() for CUDA arrays.
cudaUnbindTexture() is used to unbind a texture
reference. Once a texture reference has been unbound, it can be safely
rebound to another array, even if kernels that use the previously bound
texture have not completed. It is recommended to allocate
two-dimensional textures in linear memory using
cudaMallocPitch() and use the pitch returned by
cudaMallocPitch() as input parameter to
cudaBindTexture2D().

The following code samples bind a 2D texture reference to linear memory
pointed to by devPtr:

The 16-bit floating-point or half format supported by CUDA arrays is the same as the IEEE 754-2008 binary2 format.

CUDA C does not support a matching data type, but provides intrinsic functions to convert to and from the 32-bit floating-point
format via the unsigned short type: __float2half_rn(float) and __half2float(unsigned short). These functions are only supported in device code. Equivalent functions for the host code can be found in the OpenEXR library,
for example.

16-bit floating-point components are promoted to 32 bit float during texture fetching before any filtering is performed.

A channel description for the 16-bit floating-point format can be created by calling one of the cudaCreateChannelDescHalf*() functions.

A one-dimensional or two-dimensional layered texture (also known as texture array in Direct3D and array texture in OpenGL) is a texture made up of a sequence of layers, all of which are regular textures of same dimensionality, size,
and data type.

A one-dimensional layered texture is addressed using an integer index and a floating-point texture coordinate; the index denotes
a layer within the sequence and the coordinate addresses a texel within that layer. A two-dimensional layered texture is addressed
using an integer index and two floating-point texture coordinates; the index denotes a layer within the sequence and the coordinates
address a texel within that layer.

A layered texture can only be a CUDA array by calling cudaMalloc3DArray() with the cudaArrayLayered flag (and a height of zero for one-dimensional layered texture).

A cubemap texture is a special type of two-dimensional layered texture that has six layers representing the faces of a cube:

The width of a layer is equal to its height.

The cubemap is addressed using three texture coordinates x, y, and z that are interpreted as a direction vector emanating from the center of the cube and pointing to one face of the cube and
a texel within the layer corresponding to that face. More specifically, the face is selected by the coordinate with largest
magnitude m and the corresponding layer is addressed using coordinates (s/m+1)/2 and (t/m+1)/2 where s and t are defined in Table 1.

Table 1. Cubemap Fetch

face

m

s

t

|x| > |y| and |x| > |z|

x > 0

0

x

-z

-y

x < 0

1

-x

z

-y

|y| > |x| and |y| > |z|

y > 0

2

y

x

z

y < 0

3

-y

x

-z

|z| > |x| and |z| > |y|

z > 0

4

z

x

-y

z < 0

5

-z

-x

-y

A layered texture can only be a CUDA array by calling cudaMalloc3DArray() with the cudaArrayCubemap flag.

A cubemap layered texture is a layered texture whose layers are cubemaps of same dimension.

A cubemap layered texture is addressed using an integer index and three floating-point texture coordinates; the index denotes
a cubemap within the sequence and the coordinates address a texel within that cubemap.

A layered texture can only be a CUDA array by calling cudaMalloc3DArray() with the cudaArrayLayered and cudaArrayCubemap flags.

Texture gather is a special texture fetch that is available for
two-dimensional textures only. It is performed by the
tex2Dgather() function, which has the same parameters as
tex2D(), plus an additional comp
parameter equal to 0, 1, 2, or 3 (see tex2Dgather() and tex2Dgather()). It
returns four 32-bit numbers that correspond to the value of the component
comp of each of the four texels that would have been used for bilinear
filtering during a regular texture fetch. For example, if these texels
are of values (253, 20, 31, 255), (250, 25, 29, 254), (249, 16, 37, 253),
(251, 22, 30, 250), and comp is 2,
tex2Dgather() returns (31, 29, 37, 30).

Texture gather is only supported for CUDA arrays created with the
cudaArrayTextureGather flag and of width and height less
than the maximum specified in Table 12
for texture gather, which is smaller than for regular texture fetch.

Texture gather is only supported on devices of compute capability 2.0
and higher.

For devices of compute capability 2.0 and higher, a CUDA array
(described in Cubemap Surfaces), created with the
cudaArraySurfaceLoadStore flag, can be read and written
via a surface object or surface reference using
the functions described in Surface Functions.

Table 12
lists the maximum surface width, height, and depth depending on the
compute capability of the device.

A surface reference is declared at file scope as a variable of type
surface:

surface<void, Type> surfRef;

where Type specifies the type of the surface reference
and is equal to cudaSurfaceType1D,
cudaSurfaceType2D, cudaSurfaceType3D,
cudaSurfaceTypeCubemap,
cudaSurfaceType1DLayered,
cudaSurfaceType2DLayered, or
cudaSurfaceTypeCubemapLayered; Type is
an optional argument which defaults to cudaSurfaceType1D. A surface
reference can only be declared as a static global variable and cannot be
passed as an argument to a function.

Before a kernel can use a surface reference to access a CUDA array, the
surface reference must be bound to the CUDA array using
cudaBindSurfaceToArray().

The following code samples bind a surface reference to a CUDA array
cuArray:

A CUDA array must be read and written using surface functions of
matching dimensionality and type and via a surface reference of matching
dimensionality; otherwise, the results of reading and writing the CUDA
array are undefined.

Unlike texture memory, surface memory uses byte addressing. This means
that the x-coordinate used to access a texture element via texture
functions needs to be multiplied by the byte size of the element to
access the same element via a surface function. For example, the element
at texture coordinate x of a one-dimensional floating-point CUDA array
bound to a texture reference texRef and a surface
reference surfRef is read using tex1d(texRef,
x) via texRef, but
surf1Dread(surfRef, 4*x) via surfRef.
Similarly, the element at texture coordinate x and y of a
two-dimensional floating-point CUDA array bound to a texture reference
texRef and a surface reference surfRef
is accessed using tex2d(texRef, x, y) via
texRef, but surf2Dread(surfRef, 4*x, y)
via surfRef (the byte offset of the y-coordinate is
internally calculated from the underlying line pitch of the CUDA
array).

The following code sample applies some simple transformation kernel to a
texture.

Cubemap surfaces are accessed usingsurfCubemapread() and surfCubemapwrite() (surfCubemapread and surfCubemapwrite) as a two-dimensional layered surface, i.e., using an integer index denoting a face and two floating-point texture coordinates
addressing a texel within the layer corresponding to this face. Faces are ordered as indicated in Table 1.

Cubemap layered surfaces are accessed using surfCubemapLayeredread() and surfCubemapLayeredwrite() (surfCubemapLayeredread() and surfCubemapLayeredwrite()) as a two-dimensional layered surface, i.e., using an integer index denoting a face of one of the cubemaps and two floating-point
texture coordinates addressing a texel within the layer corresponding to this face. Faces are ordered as indicated in Table 1, so index ((2 * 6) + 3), for example, accesses the fourth face of the third cubemap.

CUDA arrays are opaque memory layouts optimized for texture
fetching. They are one dimensional, two dimensional, or
three-dimensional and composed of elements, each of which has 1, 2 or 4
components that may be signed or unsigned 8-, 16-, or 32-bit integers,
16-bit floats, or 32-bit floats. CUDA arrays are only accessible by
kernels through texture fetching as described in Texture Memory or surface reading and writing as described
in Surface Memory.

The texture and surface memory is cached (see Device Memory Accesses) and within the same kernel call, the cache is not kept coherent with respect to global memory writes and surface memory
writes, so any texture fetch or surface read to an address that has been written to via a global write or a surface write
in the same kernel call returns undefined data. In other words, a thread can safely read some texture or surface memory location
only if this memory location has been updated by a previous kernel call or memory copy, but not if it has been previously
updated by the same thread or another thread from the same kernel call.

Some resources from OpenGL and Direct3D may be mapped into the address space of CUDA, either to enable CUDA to read data written
by OpenGL or Direct3D, or to enable CUDA to write data for consumption by OpenGL or Direct3D.

A resource must be registered to CUDA before it can be mapped using the functions mentioned in OpenGL Interoperability and Direct3D Interoperability. These functions return a pointer to a CUDA graphics resource of type struct cudaGraphicsResource. Registering a resource is potentially high-overhead and therefore typically called only once per resource. A CUDA graphics
resource is unregistered using cudaGraphicsUnregisterResource(). Each CUDA context which intends to use the resource is required to register it separately.

Once a resource is registered to CUDA, it can be mapped and unmapped as many times as necessary using cudaGraphicsMapResources() and cudaGraphicsUnmapResources(). cudaGraphicsResourceSetMapFlags() can be called to specify usage hints (write-only, read-only) that the CUDA driver can use to optimize resource management.

A mapped resource can be read from or written to by kernels using the device memory address returned by cudaGraphicsResourceGetMappedPointer() for buffers and cudaGraphicsSubResourceGetMappedArray() for CUDA arrays.

The OpenGL resources that may be mapped into the address space of CUDA are OpenGL buffer, texture, and renderbuffer objects.

A buffer object is registered using cudaGraphicsGLRegisterBuffer(). In CUDA, it appears as a device pointer and can therefore be read and written by kernels or via cudaMemcpy() calls.

A texture or renderbuffer object is registered using cudaGraphicsGLRegisterImage(). In CUDA, it appears as a CUDA array. Kernels can read from the array by binding it to a texture or surface reference. They
can also write to it via the surface write functions if the resource has been registered with the cudaGraphicsRegisterFlagsSurfaceLoadStore flag. The array can also be read and written via cudaMemcpy2D() calls. cudaGraphicsGLRegisterImage() supports all texture formats with 1, 2, or 4 components and an internal type of float (e.g., GL_RGBA_FLOAT32), normalized integer (e.g., GL_RGBA8, GL_INTENSITY16), and unnormalized integer (e.g., GL_RGBA8UI) (please note that since unnormalized integer formats require OpenGL 3.0, they can only be written by shaders, not the fixed
function pipeline).

The OpenGL context whose resources are being shared has to be current to the host thread making any OpenGL interoperability
API calls.

Please note: When an OpenGL texture is made bindless (say for example by requesting an image or texture handle using the glGetTextureHandle*/glGetImageHandle*
APIs) it cannot be registered with CUDA. The application needs to register the texture for interop before requesting an image
or texture handle.

The following code sample uses a kernel to dynamically modify a 2D width x height grid of vertices stored in a vertex buffer object:

On Windows and for Quadro GPUs, cudaWGLGetDevice() can be used to retrieve the CUDA device associated to the handle returned by wglEnumGpusNV(). Quadro GPUs offer higher performance OpenGL interoperability than GeForce and Tesla GPUs in a multi-GPU configuration where
OpenGL rendering is performed on the Quadro GPU and CUDA computations are performed on other GPUs in the system.

A CUDA context may interoperate only with Direct3D devices that fulfill the following criteria: Direct3D 9Ex devices must
be created with DeviceType set to D3DDEVTYPE_HAL and BehaviorFlags with the D3DCREATE_HARDWARE_VERTEXPROCESSING flag; Direct3D 10 and Direct3D 11 devices must be created with DriverType set to D3D_DRIVER_TYPE_HARDWARE.

The Direct3D resources that may be mapped into the address space of CUDA are Direct3D buffers, textures, and surfaces. These
resources are registered using cudaGraphicsD3D9RegisterResource(), cudaGraphicsD3D10RegisterResource(), and cudaGraphicsD3D11RegisterResource().

The following code sample uses a kernel to dynamically modify a 2D width x height grid of vertices stored in a vertex buffer object.

In a system with multiple GPUs, all CUDA-enabled GPUs are accessible via the CUDA driver and runtime as separate devices.
There are however special considerations as described below when the system is in SLI mode.

First, an allocation in one CUDA device on one GPU will consume memory on other GPUs that are part of the SLI configuration
of the Direct3D or OpenGL device. Because of this, allocations may fail earlier than otherwise expected.

Second, applications should create multiple CUDA contexts, one for each GPU in the SLI configuration. While this is not a
strict requirement, it avoids unnecessary data transfers between devices. The application can use the cudaD3D[9|10|11]GetDevices() for Direct3D and cudaGLGetDevices() for OpenGL set of calls to identify the CUDA device handle(s) for the device(s) that are performing the rendering in the
current and next frame. Given this information the application will typically choose the appropriate device and map Direct3D
or OpenGL resources to the CUDA device returned by cudaD3D[9|10|11]GetDevices() or cudaGLGetDevices() when the deviceList parameter is set to cudaD3D[9|10|11]DeviceListCurrentFrame or cudaGLDeviceListCurrentFrame.

Please note that resource returned from cudaGraphicsD9D[9|10|11]RegisterResource and cudaGraphicsGLRegister[Buffer|Image] must be only used on device the registration happened. Therefore on SLI configurations when data for different frames is
computed on different CUDA devices it is necessary to register the resources for each separatly.

There are two version numbers that developers should care about when
developing a CUDA application: The compute capability that describes the
general specifications and features of the compute device (see Compute Capability) and the version of the CUDA driver API
that describes the features supported by the driver API and runtime.

The version of the driver API is defined in the driver header file as
CUDA_VERSION. It allows developers to check whether
their application requires a newer device driver than the one currently
installed. This is important, because the driver API is backward
compatible, meaning that applications, plug-ins, and libraries
(including the C runtime) compiled against a particular version of the
driver API will continue to work on subsequent device driver releases as
illustrated in Figure 11.
The driver API is not forward compatible, which means that
applications, plug-ins, and libraries (including the C runtime) compiled
against a particular version of the driver API will not work on previous
versions of the device driver.

It is important to note that there are limitations on the mixing and
matching of versions that is supported:

Since only one version of the CUDA Driver can be installed at a time on a
system, the installed driver must be of the same or higher version than the
maximum Driver API version against which any application, plug-ins, or
libraries that must run on that system were built.

All plug-ins and libraries used by an application must use the same
version of the CUDA Runtime unless they statically link to the Runtime,
in which case multiple versions of the runtime can coexist in the same
process space. Note that if nvcc is used to link the
application, the static version of the CUDA Runtime library will be used
by default, and all CUDA Toolkit libraries are statically linked against
the CUDA Runtime.

All plug-ins and libraries used by an application must use the same
version of any libraries that use the runtime (such as cuFFT, cuBLAS,
...) unless statically linking to those libraries.

On Tesla solutions running Windows Server 2008 and later or Linux, one can set any device in a system in one of the three
following modes using NVIDIA's System Management Interface (nvidia-smi), which is a tool distributed as part of the driver:

Default compute mode: Multiple host threads can use the device (by calling cudaSetDevice() on this device, when using the runtime API, or by making current a context associated to the device, when using the driver
API) at the same time.

Exclusive-process compute mode: Only one CUDA context may be created on the device across all processes in the system and that context may
be current to as many threads as desired within the process that created that context.

Exclusive-process-and-thread compute mode: Only one CUDA context may be created on the device across all processes in the system and that context may
only be current to one thread at a time.

Prohibited compute mode: No CUDA context can be created on the device.

This means, in particular, that a host thread using the runtime API without explicitly calling cudaSetDevice() might be associated with a device other than device 0 if device 0 turns out to be in the exclusive-process mode and used
by another process, or in the exclusive-process-and-thread mode and used by another thread, or in prohibited mode. cudaSetValidDevices() can be used to set a device from a prioritized list of devices.

Applications may query the compute mode of a device by checking the computeMode device property (see Device Enumeration).

GPUs that have a display output dedicate some DRAM memory to the so-called primary surface, which is used to refresh the display device whose output is viewed by the user. When users initiate a mode switch of the display by changing the resolution or bit depth of the display (using NVIDIA control panel or the Display control
panel on Windows), the amount of memory needed for the primary surface changes. For example, if the user changes the display
resolution from 1280x1024x32-bit to 1600x1200x32-bit, the system must dedicate 7.68 MB to the primary surface rather than
5.24 MB. (Full-screen graphics applications running with anti-aliasing enabled may require much more display memory for the
primary surface.) On Windows, other events that may initiate display mode switches include launching a full-screen DirectX
application, hitting Alt+Tab to task switch away from a full-screen DirectX application, or hitting Ctrl+Alt+Del to lock the
computer.

If a mode switch increases the amount of memory needed for the primary surface, the system may have to cannibalize memory
allocations dedicated to CUDA applications. Therefore, a mode switch results in any call to the CUDA runtime to fail and return
an invalid context error.

Using NVIDIA's System Management Interface (nvidia-smi),
the Windows device driver can be put in TCC (Tesla Compute Cluster) mode
for devices of the Tesla and Quadro Series of compute capability 2.0 and
higher.

This mode has the following primary benefits:

It makes it possible to use these GPUs in cluster nodes with
non-NVIDIA integrated graphics;

It makes these GPUs available via Remote Desktop, both directly and
via cluster management systems that rely on Remote Desktop;

It makes these GPUs available to applications running as a Windows
service (i.e., in Session 0).

The NVIDIA GPU architecture is built around a scalable array of multithreaded
Streaming Multiprocessors (SMs). When a
CUDA program on the host CPU invokes a kernel grid, the blocks of the grid
are enumerated and distributed to multiprocessors with available execution
capacity. The threads of a thread block execute concurrently on one
multiprocessor, and multiple thread blocks can execute concurrently on one
multiprocessor. As thread blocks terminate, new blocks are launched on the
vacated multiprocessors.

A multiprocessor is designed to execute hundreds of threads concurrently. To manage such a large amount of threads, it employs
a unique architecture called SIMT (Single-Instruction, Multiple-Thread) that is described in SIMT Architecture. The instructions are pipelined to leverage instruction-level parallelism within a single thread, as well as thread-level
parallelism extensively through simultaneous hardware multithreading as detailed in Hardware Multithreading. Unlike CPU cores they are issued in order however and there is no branch prediction and no speculative execution.

The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel
threads called warps. Individual threads composing a warp start together at the
same program address, but they have their own instruction address counter and register
state and are therefore free to branch and execute independently. The term warp
originates from weaving, the first parallel thread technology. A half-warp is
either the first or second half of a warp. A quarter-warp is either the first,
second, third, or fourth quarter of a warp.

When a multiprocessor is given one or more thread blocks to execute, it partitions them into
warps and each warp gets scheduled by a warp scheduler for execution. The way a
block is partitioned into warps is always the same; each warp contains threads of
consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy
describes how thread IDs relate to thread indices in the block.

A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their
execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch
path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same
execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they
are executing common or disjoint code paths.

The SIMT architecture is akin to SIMD (Single Instruction, Multiple Data) vector organizations in that a single instruction
controls multiple processing elements. A key difference is that SIMD vector organizations expose the SIMD width to the software,
whereas SIMT instructions specify the execution and branching behavior of a single thread. In contrast with SIMD vector machines,
SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well as data-parallel code
for coordinated threads. For the purposes of correctness, the programmer can essentially ignore the SIMT behavior; however,
substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge.
In practice, this is analogous to the role of cache lines in traditional code: Cache line size can be safely ignored when
designing for correctness but must be considered in the code structure when designing for peak performance. Vector architectures,
on the other hand, require the software to coalesce loads into vectors and manage divergence manually.

Notes

The threads of a warp that are on that warp's current execution path are
called the active threads, whereas threads not on the
current path are inactive (disabled). Threads can be inactive
because they have exited earlier than other threads of their warp,
or because they are on a different branch path than the branch path currently executed by the warp,
or because they are the last threads of a block whose number of threads is not a multiple of the warp size.

If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of
the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability
of the device (see Compute Capability 2.x, Compute Capability 3.x, and Compute Capability 5.x), and which thread performs the final write is undefined.

If an atomic instruction executed by a warp reads, modifies, and writes to the same location in global memory for more than one of the
threads of the warp, each read/modify/write to that location occurs and they are all serialized, but the order in which they
occur is undefined.

The execution context (program counters, registers, etc.) for each warp processed by a multiprocessor is maintained on-chip
during the entire lifetime of the warp. Therefore, switching from one execution context to another has no cost, and at every
instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction (the active threads of the warp) and issues the instruction to those threads.

In particular, each multiprocessor has a set of 32-bit registers that are partitioned among the warps, and a parallel data cache or shared memory that is partitioned among the thread blocks.

The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends
on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on
the multiprocessor. There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor.
These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute
capability of the device and are given in Appendix Compute Capabilities. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel
will fail to launch.

The total number of warps in a block is as follows:

ceil(TWsize,1)

T is the number of threads per block,

Wsize is the warp size, which is equal to 32,

ceil(x, y) is equal to x rounded up to the nearest multiple of y.

The total number of registers and total amount of shared memory allocated for a block are documented in the CUDA Occupancy
Calculator provided in the CUDA Toolkit.

Which strategies will yield the best performance gain for a particular
portion of an application depends on the performance limiters for that
portion; optimizing instruction usage of a kernel that is mostly limited
by memory accesses will not yield any significant performance gain, for
example. Optimization efforts should therefore be constantly directed by
measuring and monitoring the performance limiters, for example using the
CUDA profiler. Also, comparing the floating-point operation throughput or
memory throughput - whichever makes more sense - of a particular kernel
to the corresponding peak theoretical throughput of the device indicates
how much room for improvement there is for the kernel.

To maximize utilization the application should be structured in a way that it exposes as much parallelism as possible and
efficiently maps this parallelism to the various components of the system to keep them busy most of the time.

At a high level, the application should maximize parallel execution between the host, the devices, and the bus connecting
the host to the devices, by using asynchronous functions calls and streams as described in Asynchronous Concurrent Execution. It should assign to each processor the type of work it does best: serial workloads to the host; parallel workloads to the
devices.

For the parallel workloads, at points in the algorithm where parallelism is broken because some
threads need to synchronize in order to share data with each other, there are two cases:
Either these threads belong to the same block, in which case they should use
__syncthreads() and share data through shared memory within the
same kernel invocation, or they belong to different blocks, in which case they must
share data through global memory using two separate kernel invocations, one for writing
to and one for reading from global memory. The second case is much less optimal since it
adds the overhead of extra kernel invocations and global memory traffic. Its occurrence
should therefore be minimized by mapping the algorithm to the CUDA programming model in
such a way that the computations that require inter-thread communication are performed
within a single thread block as much as possible.

At a lower level, the application should maximize parallel execution between the multiprocessors of a device.

Multiple kernels can execute concurrently on a device, so maximum utilization can also be achieved by using streams to enable
enough kernels to execute concurrently as described in Asynchronous Concurrent Execution.

At an even lower level, the application should maximize parallel execution between the various functional units within a multiprocessor.

As described in Hardware Multithreading, a GPU multiprocessor relies on thread-level parallelism to maximize utilization of its functional units. Utilization is
therefore directly linked to the number of resident warps. At every instruction issue time, a warp scheduler selects a warp
that is ready to execute its next instruction, if any, and issues the instruction to the active threads of the warp. The number of clock cycles it takes for a warp to be ready to execute its next instruction is called
the latency, and full utilization is achieved when all warp schedulers always have some instruction to issue for some warp at every clock
cycle during that latency period, or in other words, when latency is completely "hidden". The number of instructions required
to hide a latency of L clock cycles depends on the respective throughputs of these instructions (see Arithmetic Instructions for the throughputs of various arithmetic instructions); assuming maximum throughput for all instructions, it is:

L for devices of compute capability 2.0 since a multiprocessor issues one instruction per warp over two clock cycles for two
warps at a time, as mentioned in Compute Capability 2.x,

2L for devices of compute capability 2.1 since a multiprocessor issues a pair of instructions per warp over two clock cycles
for two warps at a time, as mentioned in Compute Capability 2.x,

8L for devices of compute capability 3.x since a multiprocessor issues a pair
of instructions per warp over one clock cycle for four warps at a time, as mentioned
in Compute Capability 3.x.

For devices of compute capability 2.0, the two instructions issued every other cycle are for two different warps. For devices
of compute capability 2.1, the four instructions issued every other cycle are two pairs for two different warps, each pair
being for the same warp.

For devices of compute capability 3.x, the eight instructions issued every cycle are four
pairs for four different warps, each pair being for the same warp.

The most common reason a warp is not ready to execute its next instruction is that the instruction's input operands are not
available yet.

If all input operands are registers, latency is caused by register dependencies, i.e.,
some of the input operands are written by some previous instruction(s) whose execution
has not completed yet. In the case of a back-to-back register dependency (i.e., some
input operand is written by the previous instruction), the latency is equal to the
execution time of the previous instruction and the warp schedulers must schedule
instructions for different warps during that time. Execution time varies depending on
the instruction, but it is typically about 22 clock cycles for devices of compute
capability 2.x and about 11 clock cycles for devices of compute capability 3.x,
which translates to 22 warps for
devices of compute capability 2.x and 44 warps for devices of compute capability 3.x and higher (still assuming that warps
execute
instructions with maximum throughput, otherwise fewer warps are needed). For devices of
compute capability 2.1 and higher, this is also assuming enough instruction-level
parallelism so that schedulers are always able to issue pairs of instructions for each
warp.

If some input operand resides in off-chip memory, the latency is much higher: 400 to 800 clock cycles for devices of compute
capability 2.x and about 200 to 400 clock cycles for devices of compute capability 3.x. The number of warps required to keep
the warp schedulers busy during such high latency periods depends on the kernel code and its degree of instruction-level parallelism.
In general, more warps are required if the ratio of the number of instructions with no off-chip memory operands (i.e., arithmetic
instructions most of the time) to the number of instructions with off-chip memory operands is low (this ratio is commonly
called the arithmetic intensity of the program). For example, assume this ratio is 30, also assume the latencies are 600 cycles
on devices of compute capability 2.x and 300 cycles on devices of compute capability 3.x. Then about 20 warps are required
for devices of compute capability 2.x and about 40 for devices of compute capability 3.x (with the same assumptions as in
the previous paragraph).

Another reason a warp is not ready to execute its next instruction is that it is waiting at some memory fence (Memory Fence Functions) or synchronization point (Memory Fence Functions). A synchronization point can force the multiprocessor to idle as more and more warps wait for other warps in the same block
to complete execution of instructions prior to the synchronization point. Having multiple resident blocks per multiprocessor
can help reduce idling in this case, as warps from different blocks do not need to wait for each other at synchronization
points.

The number of blocks and warps residing on each multiprocessor for a given kernel call depends on the execution configuration
of the call (Execution Configuration), the memory resources of the multiprocessor, and the resource requirements of the kernel as described in Hardware Multithreading. Register and shared memory usage are reported by the compiler when compiling with the -ptxas-options=-v option.

The total amount of shared memory required for a block is equal to the sum of the amount of statically allocated shared memory
and the amount of dynamically allocated shared memory.

The number of registers used by a kernel can have a significant impact on the number of resident warps. For example, for devices
of compute capability 2.x, if a kernel uses 32 registers and each block has 512 threads and requires very little shared memory,
then two blocks (i.e., 32 warps) can reside on the multiprocessor since they require 2x512x32 registers, which exactly matches
the number of registers available on the multiprocessor. But as soon as the kernel uses one more register, only one block
(i.e., 16 warps) can be resident since two blocks would require 2x512x17 registers, which are more registers than are available
on the multiprocessor. Therefore, the compiler attempts to minimize register usage while keeping register spilling (see Device Memory Accesses) and the number of instructions to a minimum. Register usage can be controlled using the maxrregcount compiler option or launch bounds as described in Launch Bounds.

Each double variable and each long long variable uses two registers.

The effect of execution configuration on performance for a given kernel call generally depends on the kernel code. Experimentation
is therefore recommended. Applications can also parameterize execution configurations based on register file size and shared
memory size, which depends on the compute capability of the device, as well as on the number of multiprocessors and memory
bandwidth of the device, all of which can be queried using the runtime (see reference manual).

The number of threads per block should be chosen as a multiple of the warp size to avoid wasting computing resources with
under-populated warps as much as possible.

Several API functions exist to assist programmers in choosing thread block size based on register and shared memory requirements.

The occupancy calculator API, cudaOccupancyMaxActiveBlocksPerMultiprocessor, can provide an occupancy prediction based on the block size and shared memory usage of a kernel. This function reports occupancy
in terms of the number of concurrent thread blocks per multiprocessor.

Note that this value can be converted to other metrics. Multiplying by the number of warps per block yields the number of
concurrent warps per multiprocessor; further dividing concurrent warps by max warps per multiprocessor gives the occupancy
as a percentage.

The CUDA Toolkit also provides a self-documenting, standalone occupancy calculator and launch configurator implementation
in <CUDA_Toolkit_Path>/include/cuda_occupancy.h for any use cases that cannot depend on the CUDA software stack. A spreadsheet version of the occupancy calculator is also
provided. The spreadsheet version is particularly useful as a learning tool that visualizes the impact of changes to the parameters
that affect occupancy (block size, registers per thread, and shared memory per thread).

The first step in maximizing overall memory throughput for the application is to minimize data transfers with low bandwidth.

That means minimizing data transfers between the host and the device, as detailed in Data Transfer between Host and Device, since these have much lower bandwidth than data transfers between global memory and the device.

That also means minimizing data transfers between global memory and the device by maximizing use of on-chip memory: shared
memory and caches (i.e., L1 cache available on devices of compute capability 2.x and 3.x, L2 cache available on devices of
compute capability 2.x and higher, texture cache and constant cache available on all devices).

Shared memory is equivalent to a user-managed cache: The application explicitly allocates and accesses it. As illustrated
in CUDA C Runtime, a typical programming pattern is to stage data coming from device memory into shared memory; in other words, to have each
thread of a block:

Load data from device memory to shared memory,

Synchronize with all the other threads of the block so that each thread can safely read shared memory locations that were
populated by different threads,

Process the data in shared memory,

Synchronize again if necessary to make sure that shared memory has been updated with the results,

Write the results back to device memory.

For some applications (e.g., for which global memory access patterns are data-dependent), a traditional hardware-managed cache
is more appropriate to exploit data locality. As mentioned in Compute Capability 2.x and Compute Capability 3.x, for devices of compute capability 2.x and 3.x, the same on-chip memory is used for both L1 and shared memory, and how much
of it is dedicated to L1 versus shared memory is configurable for each kernel call.

The throughput of memory accesses by a kernel can vary by an order of magnitude depending on access pattern for each type
of memory. The next step in maximizing memory throughput is therefore to organize memory accesses as optimally as possible
based on the optimal memory access patterns described in Device Memory Accesses. This optimization is especially important for global memory accesses as global memory bandwidth is low, so non-optimal global
memory accesses have a higher impact on performance.

Applications should strive to minimize data transfer between the host and the device. One way to accomplish this is to move
more code from the host to the device, even if that means running kernels with low parallelism computations. Intermediate
data structures may be created in device memory, operated on by the device, and destroyed without ever being mapped by the
host or copied to host memory.

Also, because of the overhead associated with each transfer, batching many small transfers into a single large transfer always
performs better than making each transfer separately.

On systems with a front-side bus, higher performance for data transfers between host and device is achieved by using page-locked
host memory as described in Page-Locked Host Memory.

In addition, when using mapped page-locked memory (Mapped Memory), there is no need to allocate any device memory and explicitly copy data between device and host memory. Data transfers
are implicitly performed each time the kernel accesses the mapped memory. For maximum performance, these memory accesses must
be coalesced as with accesses to global memory (see Device Memory Accesses). Assuming that they are and that the mapped memory is read or written only once, using mapped page-locked memory instead
of explicit copies between device and host memory can be a win for performance.

On integrated systems where device memory and host memory are physically the same, any copy between host and device memory
is superfluous and mapped page-locked memory should be used instead. Applications may query a device is integrated by checking that the integrated device property (see Device Enumeration) is equal to 1.

An instruction that accesses addressable memory (i.e., global, local,
shared, constant, or texture memory) might need to be re-issued multiple
times depending on the distribution of the memory addresses across the
threads within the warp. How the distribution affects the instruction
throughput this way is specific to each type of memory and described in
the following sections. For example, for global memory, as a general
rule, the more scattered the addresses are, the more reduced the
throughput is.

Global Memory

Global memory resides in device memory and device memory is accessed
via 32-, 64-, or 128-byte memory transactions. These memory
transactions must be naturally aligned: Only the 32-, 64-, or 128-byte
segments of device memory that are aligned to their size (i.e., whose
first address is a multiple of their size) can be read or written by
memory transactions.

When a warp executes an instruction that accesses global memory, it
coalesces the memory accesses of the threads within the warp into one
or more of these memory transactions depending on the size of the word
accessed by each thread and the distribution of the memory addresses
across the threads. In general, the more transactions are necessary,
the more unused words are transferred in addition to the words accessed
by the threads, reducing the instruction throughput accordingly. For
example, if a 32-byte memory transaction is generated for each thread's
4-byte access, throughput is divided by 8.

Padding data in some cases, for example, when accessing a
two-dimensional array as described in Device Memory Accesses.

Size and Alignment Requirement

Global memory instructions support reading or writing words of size
equal to 1, 2, 4, 8, or 16 bytes. Any access (via a variable or a
pointer) to data residing in global memory compiles to a single global
memory instruction if and only if the size of the data type is 1, 2, 4,
8, or 16 bytes and the data is naturally aligned (i.e., its address is
a multiple of that size).

If this size and alignment requirement is not fulfilled, the access
compiles to multiple instructions with interleaved access patterns that
prevent these instructions from fully coalescing. It is therefore
recommended to use types that meet this requirement for data that
resides in global memory.

For structures, the size and alignment requirements can be enforced by
the compiler using the alignment specifiers __align__(8) or
__align__(16), such as

struct __align__(8) {
float x;
float y;
};

or

struct __align__(16) {
float x;
float y;
float z;
};

Any address of a variable residing in global memory or returned by one
of the memory allocation routines from the driver or runtime API is
always aligned to at least 256 bytes.

Reading non-naturally aligned 8-byte or 16-byte words produces
incorrect results (off by a few words), so special care must be taken
to maintain alignment of the starting address of any value or array of
values of these types. A typical case where this might be easily
overlooked is when using some custom global memory allocation scheme,
whereby the allocations of multiple arrays (with multiple calls to
cudaMalloc() or cuMemAlloc()) is
replaced by the allocation of a single large block of memory
partitioned into multiple arrays, in which case the starting address of
each array is offset from the block's starting address.

Two-Dimensional Arrays

A common global memory access pattern is when each thread of index
(tx,ty) uses the following address to access one element of a 2D array
of width width, located at address BaseAddress of type
type* (where type meets the requirement described in
Maximize Utilization):

BaseAddress + width * ty + tx

For these accesses to be fully coalesced, both the width of the thread
block and the width of the array must be a multiple of the warp size.

In particular, this means that an array whose width is not a multiple
of this size will be accessed much more efficiently if it is actually
allocated with a width rounded up to the closest multiple of this size
and its rows padded accordingly. The cudaMallocPitch() and
cuMemAllocPitch() functions and associated memory copy functions
described in the reference manual enable programmers to write
non-hardware-dependent code to allocate arrays that conform to these
constraints.

Local Memory

Local memory accesses only occur for some automatic variables as
mentioned in Variable Type Qualifiers. Automatic
variables that the compiler is likely to place in local memory are:

Arrays for which it cannot determine that they are indexed with
constant quantities,

Large structures or arrays that would consume too much register
space,

Any variable if the kernel uses more registers than available (this
is also known as register spilling).

Inspection of the PTX assembly code (obtained by
compiling with the -ptx or-keep
option) will tell if a variable has been placed in local memory during
the first compilation phases as it will be declared using the
.local mnemonic and accessed using the
ld.local and st.local mnemonics. Even
if it has not, subsequent compilation phases might still decide
otherwise though if they find it consumes too much register space for
the targeted architecture: Inspection of the cubin object using
cuobjdump will tell if this is the case. Also, the
compiler reports total local memory usage per kernel
(lmem) when compiling with the
--ptxas-options=-v option. Note that some mathematical
functions have implementation paths that might access local memory.

The local memory space resides in device memory, so local memory
accesses have same high latency and low bandwidth as global memory
accesses and are subject to the same requirements for memory coalescing
as described in Device Memory Accesses. Local memory
is however organized such that consecutive 32-bit words are accessed by
consecutive thread IDs. Accesses are therefore fully coalesced as long
as all threads in a warp access the same relative address (e.g., same
index in an array variable, same member in a structure variable).

On devices of compute capability 5.x, local memory accesses
are always cached in L2 in the same way as global memory
accesses (see Compute Capability 5.x).

Shared Memory

Because it is on-chip, shared memory has much higher bandwidth and
much lower latency than local or global memory.

To achieve high bandwidth, shared memory is divided into equally-sized
memory modules, called banks, which can be accessed simultaneously. Any
memory read or write request made of n addresses that fall in
n distinct memory banks can therefore be serviced
simultaneously, yielding an overall bandwidth that is n times as
high as the bandwidth of a single module.

However, if two addresses of a memory request fall in the same memory
bank, there is a bank conflict and the access has to be serialized. The
hardware splits a memory request with bank conflicts into as many
separate conflict-free requests as necessary, decreasing throughput by
a factor equal to the number of separate memory requests. If the number
of separate memory requests is n, the initial memory request is
said to cause n-way bank conflicts.

To get maximum performance, it is therefore important to understand
how memory addresses map to memory banks in order to schedule the
memory requests so as to minimize bank conflicts. This is described in
Compute Capability 2.x, Compute Capability 3.x, and Compute Capability 5.x for devices of compute capability
2.x, 3.x, and 5.x, respectively.

Constant Memory

The constant memory space resides in device memory and is cached in
the constant cache mentioned in Compute Capability 2.x.

A request is then split into as many separate requests as there are
different memory addresses in the initial request, decreasing
throughput by a factor equal to the number of separate requests.

The resulting requests are then serviced at the throughput of the
constant cache in case of a cache hit, or at the throughput of device
memory otherwise.

Texture and Surface Memory

The texture and surface memory spaces reside in device memory and are
cached in texture cache, so a texture fetch or surface read costs one
memory read from device memory only on a cache miss, otherwise it just
costs one read from texture cache. The texture cache is optimized for
2D spatial locality, so threads of the same warp that read texture or
surface addresses that are close together in 2D will achieve best
performance. Also, it is designed for streaming fetches with a constant
latency; a cache hit reduces DRAM bandwidth demand but not fetch
latency.

Reading device memory through texture or surface fetching present some
benefits that can make it an advantageous alternative to reading device
memory from global or constant memory:

If the memory reads do not follow the access patterns that global
or constant memory reads must follow to get good performance, higher
bandwidth can be achieved providing that there is locality in the
texture fetches or surface reads;

Addressing calculations are performed outside the kernel by
dedicated units;

Packed data may be broadcast to separate variables in a single
operation;

Minimize the use of arithmetic instructions with low throughput; this includes trading precision for speed when it does not
affect the end result, such as using intrinsic instead of regular functions (intrinsic functions are listed in Intrinsic Functions), single-precision instead of double-precision, or flushing denormalized numbers to zero;

Reduce the number of instructions, for example, by optimizing out synchronization points whenever possible as described in
Synchronization Instruction or by using restricted pointers as described in __restrict__.

In this section, throughputs are given in number of operations per clock cycle per multiprocessor. For a warp size of 32,
one instruction corresponds to 32 operations, so if N is the number of operations per clock cycle, the instruction throughput
is N/32 instructions per clock cycle.

All throughputs are for one multiprocessor. They must be multiplied by the number of multiprocessors in the device to get
throughput for the whole device.

Other instructions and functions are implemented on top of the native
instructions. The implementation may be different for devices of
different compute capabilities, and the number of native instructions
after compilation may fluctuate with every compiler version. For
complicated functions, there can be multiple code paths depending on
input. cuobjdump can be used to inspect a particular
implementation in a cubin object.

The implementation of some functions are readily available on the CUDA
header files (math_functions.h,
device_functions.h, ...).

In general, code compiled with -ftz=true (denormalized
numbers are flushed to zero) tends to have higher performance than code
compiled with -ftz=false. Similarly, code compiled with
-prec div=false (less precise division) tends to have
higher performance code than code compiled with -prec
div=true, and code compiled with
-prec-sqrt=false (less precise square root) tends to
have higher performance than code compiled with
-prec-sqrt=true. The nvcc user manual describes these
compilation flags in more details.

Single-Precision Floating-Point Division

Single-Precision Floating-Point Reciprocal Square Root

To preserve IEEE-754 semantics the compiler can optimize
1.0/sqrtf() into rsqrtf() only when
both reciprocal and square root are approximate, (i.e., with
-prec-div=false and
-prec-sqrt=false). It is therefore recommended to
invoke rsqrtf() directly where desired.

Single-Precision Floating-Point Square Root

Single-precision floating-point square root is implemented as a
reciprocal square root followed by a reciprocal instead of a reciprocal
square root followed by a multiplication so that it gives correct
results for 0 and infinity.

Sine and Cosine

sinf(x), cosf(x),
tanf(x), sincosf(x), and
corresponding double-precision instructions are much more expensive and
even more so if the argument x is large in magnitude.

More precisely, the argument reduction code (see Mathematical Functions for implementation) comprises two
code paths referred to as the fast path and the slow path,
respectively.

The fast path is used for arguments sufficiently small in magnitude
and essentially consists of a few multiply-add operations. The slow
path is used for arguments large in magnitude and consists of lengthy
computations required to achieve correct results over the entire
argument range.

At present, the argument reduction code for the trigonometric
functions selects the fast path for arguments whose magnitude is less
than 105615.0f for the single-precision functions, and
less than 2147483648.0 for the double-precision
functions.

As the slow path requires more registers than the fast path, an
attempt has been made to reduce register pressure in the slow path by
storing some intermediate variables in local memory, which may affect
performance because of local memory high latency and bandwidth (see
Device Memory Accesses). At present, 28 bytes of local
memory are used by single-precision functions, and 44 bytes are used by
double-precision functions. However, the exact amount is subject to
change.

Due to the lengthy computations and use of local memory in the slow
path, the throughput of these trigonometric functions is lower by one
order of magnitude when the slow path reduction is required as opposed
to the fast path reduction.

Integer Arithmetic

Integer division and modulo operation are costly as they compiler to up to 20 instructions.
They can be replaced with
bitwise operations in some cases: If n is a power of
2, (i/n) is equivalent to
(i>>log2(n)) and (i%n) is
equivalent to (i&(n-1)); the compiler will perform
these conversions if n is literal.

__brev and __popc map to a single
instruction and
__brevll and __popcll to a
few instructions.

__[u]mul24 are legacy intrinsic functions that have no longer any reason to be used.

Type Conversion

Sometimes, the compiler must insert conversion instructions,
introducing additional execution cycles. This is the case for:

Functions operating on variables of type char or
short whose operands generally need to be converted
to int,

Double-precision floating-point constants (i.e., those constants
defined without any type suffix) used as input to single-precision
floating-point computations (as mandated by C/C++ standards).

This last case can be avoided by using single-precision floating-point
constants, defined with an f suffix such as
3.141592653589793f, 1.0f,
0.5f.

Any flow control instruction (if, switch, do, for, while) can
significantly impact the effective instruction throughput by causing threads of the same warp to diverge (i.e., to follow
different execution paths). If this happens, the different executions paths have to be serialized, increasing the total number
of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to
the same execution path.

To obtain best performance in cases where the control flow depends on the thread ID, the controlling condition should be written
so as to minimize the number of divergent warps. This is possible because the distribution of the warps across the block is
deterministic as mentioned in SIMT Architecture. A trivial example is when the controlling condition only depends on (threadIdx / warpSize) where warpSize is the warp size. In this case, no warp diverges since the controlling condition is perfectly aligned with the warps.

Sometimes, the compiler may unroll loops or it may optimize out if or switch statements by using branch predication instead, as detailed below. In these cases, no warp can ever diverge. The programmer
can also control loop unrolling using the #pragma unroll directive (see #pragma unroll).

When using branch predication none of the instructions whose execution depends on the controlling condition gets skipped.
Instead, each of them is associated with a per-thread condition code or predicate that is set to true or false based on the
controlling condition and although each of these instructions gets scheduled for execution, only the instructions with a true
predicate are actually executed. Instructions with a false predicate do not write results, and also do not evaluate addresses
or read operands.

The compiler replaces a branch instruction with predicated instructions only if the number of instructions controlled by the
branch condition is less or equal to a certain threshold: If the compiler determines that the condition is likely to produce
many divergent warps, this threshold is 7, otherwise it is 4.

It is equivalent to declare a function with only the __host__ qualifier or to declare it without any of the __host__, __device__, or __global__ qualifier; in either case the function is compiled for the host only.

The __global__ and __host__ qualifiers cannot be used together.

The __device__ and __host__ qualifiers can be used together however, in which case the function is compiled for both the host and the device. The __CUDA_ARCH__ macro introduced in Application Compatibility can be used to differentiate code paths between host and device:

Variable type qualifiers specify the memory location on the device of a variable.

An automatic variable declared in device code without any of the __device__,
__shared__ and __constant__ qualifiers described
in this section generally resides in a register. However in some cases the compiler
might choose to place it in local memory, which can have adverse performance
consequences as detailed in Device Memory Accesses.

The __device__ qualifier declares a variable that resides on the device.

At most one of the other type qualifiers defined in the next two sections may be used together with __device__ to further specify which memory space the variable belongs to. If none of them is present, the variable:

Resides in global memory space.

Has the lifetime of an application.

Is accessible from all the threads within the grid and from the host through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()).

May be additionally qualified with the __managed__ qualifier. Such a
variable can be directly referenced from host code, e.g., its address can
be taken or it can read or written directly from a host function. As a convenience, __managed__ implies __managed____device__
i.e., the __device__ qualifier is implicit when the __managed__
qualifier is specified.

The __shared__ qualifier, optionally used together with __device__, declares a variable that:

Resides in the shared memory space of a thread block,

Has the lifetime of the block,

Is only accessible from all the threads within the block.

When declaring a variable in shared memory as an external array such as

extern__shared__float shared[];

the size of the array is determined at launch time (see Execution Configuration). All variables declared in this fashion, start at the same address in memory, so that the layout of the variables in the
array must be explicitly managed through offsets. For example, if one wants the equivalent of

short array0[128];
float array1[64];
int array2[256];

in dynamically allocated shared memory, one could declare and initialize the arrays the following way:

Restricted pointers were introduced in C99 to alleviate the aliasing problem that exists in C-type languages, and which inhibits
all kind of optimization from code re-ordering to common sub-expression elimination.

Here is an example subject to the aliasing issue, where use of restricted pointer can help the compiler to reduce the number
of instructions:

In C-type languages, the pointers a, b, and c may be aliased, so any write through c could modify elements of a or b. This means that to guarantee functional correctness, the compiler cannot load a[0] and b[0] into registers, multiply them, and store the result to both c[0] and c[1], because the results would differ from the abstract execution model if, say, a[0] is really the same location as c[0]. So the compiler cannot take advantage of the common sub-expression. Likewise, the compiler cannot just reorder the computation
of c[4] into the proximity of the computation of c[0] and c[1] because the preceding write to c[3] could change the inputs to the computation of c[4].

By making a, b, and c restricted pointers, the programmer asserts to the compiler that the pointers are in fact not aliased, which in this case
means writes through c would never overwrite elements of a or b. This changes the function prototype as follows:

Note that all pointer arguments need to be made restricted for the compiler optimizer to derive any benefit. With the __restrict__ keywords added, the compiler can now reorder and do common sub-expression elimination at will, while retaining functionality
identical with the abstract execution model:

These are vector types derived from the basic integer and floating-point types. They are structures and the 1st, 2nd, 3rd,
and 4th components are accessible through the fields x, y, z, and w, respectively. They all come with a constructor function of the form make_<type name>; for example,

int2 make_int2(int x, int y);

which creates a vector of type int2 with value(x, y).

In host code, the alignment requirement of a vector type is equal to the alignment requirement of its base type. This is not
always the case in device code as detailed in Table 3.

The CUDA programming model assumes a device with a weakly-ordered memory model,
that is:

The order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a
peer device
is not necessarily the order in which the data is observed being written by another CUDA or host thread;

The order in which a CUDA thread reads data from shared memory, global memory, page-locked host memory, or the memory of a
peer device
is not necessarily the order in which the read instructions appear in the program for instructions that are independent of
each other.

For example, if thread 0 executes writeXY() and thread 1 executes readXY() as defined in the following code sample

it is possible that A ends up equal to 1 and B equal to 20 for thread 1:

either because at the time thread 1 reads X and Y,
thread 0's write to Y has happened from thread 1's perspective, but thread 0's write to X has not,

or because thread 1 reads Y before X and
thread 0's writes to X and Y happen after thread 1's read of Y and before thread 1's read of X.

In a strongly-ordered memory model,
the only possibilities would be:

A equal to 1 and B equal to 2
(thread 0's writes to X and Y happen after thread 1's read of X and Y),

A equal to 10 and B equal to 2
(thread 0's write to X happens before thread 1's read of X and
thread 0's write to Y happens after thread 1's read of Y),

A equal to 10 and B equal to 20
(thread 0's writes to X and Y happen before thread 1's read of X and Y),

Memory fence functions can be used to enforce some ordering:

void __threadfence_block();

ensures that:

All writes to shared and global memory made by the calling thread before the call to
__threadfence_block() are observed by all threads in the block of the calling thread
as occurring before all writes to shared memory and global memory made by the calling thread after the call to
__threadfence_block();

All reads from shared memory and global memory made by the calling thread before the call to
__threadfence_block() are performed before all reads from shared memory and global memory made by the calling thread after the call to
__threadfence_block().

void __threadfence();

acts as __threadfence_block() for all threads in the block of the calling thread and
also ensures that all writes to global memory made by the calling thread before the call to
__threadfence() are observed by all threads in the device as occurring before all writes to global memory made by the calling thread after
the call to
__threadfence().

void __threadfence_system();

acts as __threadfence_block() for all threads in the block of the calling thread and
also ensures that:

All writes to global memory, page-locked host memory, and the memory of a peer device made by the calling thread before the
call to
__threadfence_system() are observed by all threads in the device, host threads, and all threads in peer devices
as occurring before all writes to global memory, page-locked host memory, and the memory of a peer device
made by the calling thread after the call to __threadfence_system().

All reads from shared memory, global memory, page-locked host memory, and the memory of a peer device made by the calling
thread before the call to
__threadfence_system() are performed before all reads from shared memory, global memory, page-locked host memory, and the memory of a peer device
made by the calling thread after the call to
__threadfence_system().

__threadfence_system() is only supported by devices of compute capability
2.x and higher.

In the previous code sample, inserting a fence function call between X = 10; and Y = 20; and between int A = X; and int B = Y; would ensure that for thread 1,
A will always be equal to 10 if B is equal to 20.
If thread 0 and 1 belong to the same block, it is enough to use __threadfence_block().
If thread 0 and 1 do not belong to the same block, __threadfence() must be used if they are CUDA threads from the same device
and __threadfence_system() must be used if they are CUDA threads from two different devices.

A common use case is when threads consume some data produced by other threads as illustrated by
the following code sample of a kernel that computes the sum of an array of N numbers in
one call. Each block first sums a subset of the array and stores the result in global
memory. When all blocks are done, the last block done reads each of these partial sums
from global memory and sums them to obtain the final result. In order to determine which
block is finished last, each block atomically increments a counter to signal that it is
done with computing and storing its partial sum (see Atomic Functions about atomic
functions). The last block is the one that receives the counter value equal to
gridDim.x-1. If no fence is placed between storing the partial sum
and incrementing the counter, the counter might increment before the partial sum is
stored and therefore, might reach gridDim.x-1 and let the last block
start reading partial sums before they have been actually updated in memory.

Memory fence functions only affect the ordering of memory operations by a thread; they do not ensure that these memory operations
are visible to other threads
(like __syncthreads() does for threads within a block (see Synchronization Functions)).
In the code sample below, the visibility of memory operations on the result variable is ensured by declaring it as volatile (see Volatile Qualifier).

__device__unsignedint count = 0;
__shared__ bool isLastBlockDone;
__global__void sum(constfloat* array, unsignedint N,
volatilefloat* result)
{
// Each block sums a subset of the input array.float partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {
// Thread 0 of each block stores the partial sum// to global memory. The compiler will use // a store operation that bypasses the L1 cache// since the "result" variable is declared as// volatile. This ensures that the threads of// the last block will read the correct partial// sums computed by all other blocks.
result[blockIdx.x] = partialSum;
// Thread 0 makes sure that the incrementation// of the "count" variable is only performed after// the partial sum has been written to global memory.
__threadfence();
// Thread 0 signals that it is done.unsignedint value = atomicInc(&count, gridDim.x);
// Thread 0 determines if its block is the last// block to be done.
isLastBlockDone = (value == (gridDim.x - 1));
}
// Synchronize to make sure that each thread reads// the correct value of isLastBlockDone.
__syncthreads();
if (isLastBlockDone) {
// The last block sums the partial sums// stored in result[0 .. gridDim.x-1]float totalSum = calculateTotalSum(result);
if (threadIdx.x == 0) {
// Thread 0 of last block stores the total sum// to global memory and resets the count// varialble, so that the next kernel call// works properly.
result[0] = totalSum;
count = 0;
}
}
}

waits until all threads in the thread block have reached this point and all global and shared
memory accesses made by these threads prior to __syncthreads() are
visible to all threads in the block.

__syncthreads() is used to coordinate communication between the threads of
the same block. When some threads within a block access the same addresses in shared or
global memory, there are potential read-after-write, write-after-read, or
write-after-write hazards for some of these memory accesses. These data hazards can be
avoided by synchronizing threads in-between these accesses.

__syncthreads() is allowed in conditional code but only if the conditional
evaluates identically across the entire thread block, otherwise the code execution is
likely to hang or produce unintended side effects.

Devices of compute capability 2.x and higher support three variations of
__syncthreads() described below.

int __syncthreads_count(int predicate);

is identical to __syncthreads() with the additional feature that it
evaluates predicate for all threads of the block and returns the number of threads for
which predicate evaluates to non-zero.

int __syncthreads_and(int predicate);

is identical to __syncthreads() with the additional feature that it
evaluates predicate for all threads of the block and returns non-zero if and only if
predicate evaluates to non-zero for all of them.

int __syncthreads_or(int predicate);

is identical to __syncthreads() with the additional feature that it
evaluates predicate for all threads of the block and returns non-zero if and only if
predicate evaluates to non-zero for any of them.

fetches the region of linear memory specified by the one-dimensional texture object texObj using integer texture coordinate x.
tex1Dfetch() only works with non-normalized coordinates, so only the border and clamp addressing modes are supported.
It does not perform any texture filtering. For integer types, it may optionally promote the integer to single-precision
floating point.

fetches the region of linear memory bound to the one-dimensional texture reference texRef using integer texture coordinate x. tex1Dfetch() only works with non-normalized coordinates, so only the border and clamp addressing modes are supported. It does not perform
any texture filtering. For integer types, it may optionally promote the integer to single-precision floating point.

Besides the functions shown above, 2-, and 4-tuples are supported; for example:

fetches the CUDA array bound to the one-dimensional texture reference texRef
using texture coordinate x. Type is equal to DataType except when readMode is equal to cudaReadModeNormalizedFloat (see Texture Reference API),
in which case Type is equal to the matching floating-point type.

fetches the CUDA array or the region of linear memory bound to the two-dimensional texture
reference texRef using texture coordinates x and
y. Type is equal to DataType except when readMode is equal to cudaReadModeNormalizedFloat (see Texture Reference API),
in which case Type is equal to the matching floating-point type.

fetches the CUDA array bound to the three-dimensional texture reference texRef
using texture coordinates x, y, and
z. Type is equal to DataType except when readMode is equal to cudaReadModeNormalizedFloat (see Texture Reference API),
in which case Type is equal to the matching floating-point type.

fetches the CUDA array bound to the one-dimensional layered texture reference
texRef using texture coordinate x and
index layer, as described in Layered Textures. Type is equal to DataType except when readMode is equal to cudaReadModeNormalizedFloat (see Texture Reference API),
in which case Type is equal to the matching floating-point type.

fetches the CUDA array bound to the two-dimensional layered texture
reference texRef using texture coordinates
x and y, and index
layer, as described in Texture Memory. Type is equal to DataType except when readMode is equal to cudaReadModeNormalizedFloat (see Texture Reference API),
in which case Type is equal to the matching floating-point type.

fetches the CUDA array bound to the cubemap texture reference texRef using
texture coordinates x, y, and
z, as described in Cubemap Textures.
Type is equal to DataType except when readMode is equal to cudaReadModeNormalizedFloat (see Texture Reference API),
in which case Type is equal to the matching floating-point type.

fetches the CUDA array bound to the cubemap layered texture reference texRef
using texture coordinates x, y,
and z, and index layer, as described in Cubemap Layered Textures. Type is equal to DataType except when readMode is equal to cudaReadModeNormalizedFloat (see Texture Reference API),
in which case Type is equal to the matching floating-point type.

fetches the CUDA array bound to the 2D texture reference texRef using
texture coordinates x and y and the comp parameter as
described in Texture Gather. Type is a 4-component vector type. It is based on the base type of DataType except when readMode is equal to cudaReadModeNormalizedFloat (see Texture Reference API),
in which case it is always float4.

In the sections below, boundaryMode specifies the
boundary mode, that is how out-of-range surface coordinates are handled;
it is equal to either cudaBoundaryModeClamp, in which
case out-of-range coordinates are clamped to the valid range, or
cudaBoundaryModeZero, in which case out-of-range reads
return zero and out-of-range writes are ignored, or
cudaBoundaryModeTrap, in which case out-of-range
accesses cause the kernel execution to fail.

when executed in device code, returns the value of a per-multiprocessor counter that is incremented every clock cycle. Sampling
this counter at the beginning and at the end of a kernel, taking the difference of the two samples, and recording the result
per thread provides a measure for each thread of the number of clock cycles taken by the device to completely execute the
thread, but not of the number of clock cycles the device actually spent executing thread instructions. The former number is
greater than the latter since threads are time sliced.

An atomic function performs a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared
memory. For example, atomicAdd() reads a word at some address in global or shared memory, adds a number to it, and writes the result back to the same address.
The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads. In other
words, no other thread can access this address until the operation is complete. Atomic functions can only be used in device
functions and atomic functions operating on mapped page-locked memory (Mapped Memory) are not atomic from the point of view of the host or other devices.

Note that any atomic operation can be implemented based on atomicCAS() (Compare And Swap). For example, atomicAdd() for double-precision floating-point numbers can be implemented as follows:

reads the 32-bit or 64-bit word old located at the address
address in global or shared memory, computes (old + val),
and stores the result back to memory at the same address. These three operations are
performed in one atomic transaction. The function returns old.

The floating-point version of atomicAdd() is only supported by devices of
compute capability 2.x and higher.

reads the 32-bit word old located at the address address in global or shared memory, computes (old - val), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction.
The function returns old.

reads the 32-bit or 64-bit word old located at the address
address in global or shared memory and stores val
back to memory at the same address. These two operations are performed in one atomic
transaction. The function returns old.

reads the 32-bit or 64-bit word old located at the address address in
global or shared memory, computes the minimum of old and val,
and stores the result back to memory at the same address. These three operations are performed in
one atomic transaction. The function returns old.

The 64-bit version of atomicMin() is only supported by devices of
compute capability 3.5 and higher.

reads the 32-bit or 64-bit word old located at the address address in
global or shared memory, computes the maximum of old and
val, and stores the result back to memory at the same address.
These three operations are performed in one atomic transaction. The function returns
old.

The 64-bit version of atomicMax() is only supported by devices of
compute capability 3.5 and higher.

reads the 32-bit word old located at the address
address in global or shared memory, computes ((old >=
val) ? 0 : (old+1)), and stores the result back to memory at the same
address. These three operations are performed in one atomic transaction. The
function returns old.

reads the 32-bit word old located at the address
address in global or shared memory, computes (((old ==
0) | (old > val)) ? val : (old-1) ), and stores the result back to
memory at the same address. These three operations are performed in one atomic
transaction. The function returns old.

reads the 32-bit or 64-bit word old located at the address
address in global or shared memory, computes (old ==
compare ? val : old) , and stores the result back to memory at the same
address. These three operations are performed in one atomic transaction. The
function returns old (Compare And Swap).

reads the 32-bit or 64-bit word old located at the address
address in global or shared memory, computes (old
& val), and stores the result back to memory at the same
address. These three operations are performed in one atomic transaction. The
function returns old.

The 64-bit version of atomicAnd() is only supported by devices of
compute capability 3.5 and higher.

reads the 32-bit or 64-bit word old located at the address
address in global or shared memory, computes (old |
val), and stores the result back to memory at the same address. These
three operations are performed in one atomic transaction. The function returns
old.

The 64-bit version of atomicOr() is only supported by devices of
compute capability 3.5 and higher.

reads the 32-bit or 64-bit word old located at the address
address in global or shared memory, computes (old ^
val), and stores the result back to memory at the same address. These
three operations are performed in one atomic transaction. The function returns
old.

The 64-bit version of atomicXor() is only supported by devices of
compute capability 3.5 and higher.

The warp vote functions allow the threads of a given warp to perform a
reduction-and-broadcast operation. These functions take as input an
integer predicate from each thread in the warp and
compare those values with zero. The results of the comparisons are
combined (reduced) across the active threads of the warp
in one of the following ways, broadcasting a single return value to
each participating thread:

__all(predicate):

Evaluate predicate for all active threads of the
warp and return non-zero if and only if predicate
evaluates to non-zero for all of them.

__any(predicate):

Evaluate predicate for all active threads of the
warp and return non-zero if and only if predicate
evaluates to non-zero for any of them.

__ballot(predicate):

Evaluate predicate for all active threads of the
warp and return an integer whose Nth bit is set if and only if
predicate evaluates to non-zero for the Nth thread
of the warp and the Nth thread is active.

Notes

For each of these warp vote operations, the result excludes
threads that are inactive (e.g., due to
warp divergence). Inactive threads are represented by 0 bits in the value
returned by __ballot() and are not considered in
the reductions performed by __all() and
__any().

The __shfl() intrinsics permit exchanging of a
variable between threads within a warp without use of shared memory.
The exchange occurs simultaneously for all active threads within the
warp, moving 4 bytes of data per thread. Exchange of 8-byte quantities
must be broken into two separate invocations of
__shfl().

Threads within a warp are referred to as lanes, and for
devices of compute capability 3.x may have an index between 0 and
warpSize-1 (inclusive). Four source-lane addressing
modes are supported:

__shfl()

Direct copy from indexed lane

__shfl_up()

Copy from a lane with lower ID relative to caller

__shfl_down()

Copy from a lane with higher ID relative to caller

__shfl_xor()

Copy from a lane based on bitwise XOR of own lane ID

Threads may only read data from another thread which is actively
participating in the __shfl() command. If the target
thread is inactive, the
retrieved value is undefined.

All the __shfl() intrinsics take an optional width
parameter which permits sub-division of the warp into segments - for
example to exchange data between 4 groups of 8 lanes in a SIMD manner. If
width is less than warpSize then each subsection of the
warp behaves as a separate entity with a starting logical lane ID of 0. A
thread may only exchange data with others in its own subsection. width
must have a value which is a power of 2 so that the warp can be
subdivided equally; results are undefined if width is not a power of 2,
or is a number greater than warpSize.

__shfl() returns the value of var held
by the thread whose ID is given by srcLane. If
srcLane is outside the range
[0:width-1], then the thread's own value of var is
returned.

__shfl_up() calculates a source lane ID by subtracting
delta from the caller's lane ID. The value of
var held by the resulting lane ID is returned: in
effect, var is shifted up the warp by
delta lanes. The source lane index will not wrap around
the value of width, so effectively the lower
delta lanes will be unchanged.

__shfl_down() calculates a source lane ID by adding
delta to the caller's lane ID. The value of
var held by the resulting lane ID is returned: this has
the effect of shifting var down the warp by
delta lanes. As for __shfl_up(), the ID
number of the source lane will not wrap around the value of width and so
the upper delta lanes will remain unchanged.

__shfl_xor() calculates a source line ID by performing
a bitwise XOR of the caller's lane ID with laneMask: the
value of var held by the resulting lane ID is returned.
If the resulting lane ID falls outside the range permitted by
width, the thread's own value of var is
returned. This mode implements a butterfly addressing pattern such as is
used in tree reduction and broadcast.

All __shfl() intrinsics return the 4-byte word referenced by var from the source lane ID as an unsigned integer. If the source lane ID
is out of range or the source thread has exited, the calling thread's own var is returned.

Each multiprocessor has a set of sixteen hardware counters that an
application can increment with a single instruction by calling the
__prof_trigger() function.

[void __prof_trigger(int counter);

increments by one per warp the per-multiprocessor hardware counter of
index counter. Counters 8 to 15 are reserved and
should not be used by applications.

The value of counters 0, 1, ..., 7 can be obtained via nvprof by
nvprof --events prof_trigger_0x where x
is 0, 1, ..., 7. The value of those counters for the first multiprocessor
can also be obtained via the old CUDA command-line profiler by listing
prof_trigger_00, prof_trigger_01, ..., prof_trigger_07
, etc. in the profiler.conf file (see the
profiler manual for more details). All counters are reset before each
kernel launch (note that when collecting counters, kernel launches are
synchronous as mentioned in Concurrent Execution between Host and Device).

Assertion is only supported by devices of compute capability 2.x and higher. It is not supported on MacOS, regardless of the
device,
and loading a module that references the assert function on Mac OS will fail.

void assert(int expression);

stops the kernel execution if expression is equal to zero. If the
program is run within a debugger, this triggers a breakpoint and the debugger can be
used to inspect the current state of the device. Otherwise, each thread for which
expression is equal to zero prints a message to stderr
after synchronization with the host via cudaDeviceSynchronize(),
cudaStreamSynchronize(), or
cudaEventSynchronize(). The format of this message is as
follows:

Any subsequent host-side synchronization calls made for the same device will return
cudaErrorAssert. No more commands can be sent to this device until
cudaDeviceReset() is called to reinitialize the device.

If expression is different from zero, the kernel execution is
unaffected.

Assertions are for debugging purposes. They can affect performance and it is therefore
recommended to disable them in production code. They can be disabled at compile time by
defining the NDEBUG preprocessor macro before including
assert.h. Note that expression should not be an
expression with side effects (something like (++i > 0), for example),
otherwise disabling the assertion will affect the functionality of the code.

Formatted output is only supported by devices of compute capability 2.x and higher.

int printf(constchar *format[, arg, ...]);

prints formatted output from a kernel to a host-side output stream.

The in-kernel printf() function behaves in a similar way to the standard
C-library printf() function, and the user is referred to the host system's manual
pages for a complete description of printf() behavior. In essence,
the string passed in as format is output to a stream on the host,
with substitutions made from the argument list wherever a format specifier is
encountered. Supported format specifiers are listed below.

The printf() command is executed as any other device-side function:
per-thread, and in the context of the calling thread. From a multi-threaded kernel,
this means that a straightforward call to printf() will be executed
by every thread, using that thread's data as specified. Multiple versions of the
output string will then appear at the host stream, once for each thread which
encountered the printf().

It is up to the programmer to limit the output to a single thread if only a single output string is desired (see Examples for an illustrative example).

Unlike the C-standard printf(), which returns the number of characters
printed, CUDA's printf() returns the number of arguments parsed. If
no arguments follow the format string, 0 is returned. If the format string is NULL,
-1 is returned. If an internal error occurs, -2 is returned.

As for standard printf(), format specifiers take the
form: %[flags][width][.precision][size]type

The following fields are supported (see widely-available documentation
for a complete description of all behaviors):

Flags: `#' ` ' `0' `+' `-'

Width: `*' `0-9'

Precision: `0-9'

Size: `h' `l' `ll'

Type: `%cdiouxXpeEfgGaAs'

Note that CUDA's printf()will accept any combination
of flag, width, precision, size and type, whether or not overall they
form a valid format specifier. In other words, "%hd"
will be accepted and printf will expect a double-precision variable in
the corresponding location in the argument list.

Final formatting of the printf() output takes place
on the host system. This means that the format string must be
understood by the host-system's compiler and C library. Every effort
has been made to ensure that the format specifiers supported by CUDA's
printf function form a universal subset from the most common host
compilers, but exact behavior will be host-OS-dependent.

As described in Format Specifiers,
printf() will accept all combinations of valid
flags and types. This is because it cannot determine what will and will
not be valid on the host system where the final output is formatted.
The effect of this is that output may be undefined if the program emits
a format string which contains invalid combinations.

The printf() command can accept at most 32 arguments
in addition to the format string. Additional arguments beyond this will
be ignored, and the format specifier output as-is.

Owing to the differing size of the long type on
64-bit Windows platforms (four bytes on 64-bit Windows platforms, eight
bytes on other 64-bit platforms), a kernel which is compiled on a
non-Windows 64-bit machine but then run on a win64 machine will see
corrupted output for all format strings which include
"%ld". It is recommended that the compilation platform
matches the execution platform to ensure safety.

The output buffer for printf() is set to a fixed size
before kernel launch (see Associated Host-Side API).
It is circular and if more output is produced during kernel execution
than can fit in the buffer, older output is overwritten. It is flushed
only when one of these actions is performed:

Kernel launch via <<<>>> or
cuLaunchKernel() (at the start of the launch, and if
the CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end
of the launch as well),

Memory copies via any blocking version of
cudaMemcpy*() or cuMemcpy*(),

Module loading/unloading via cuModuleLoad() or
cuModuleUnload(),

Context destruction via cudaDeviceReset() or
cuCtxDestroy().

Prior to executing a stream callback added by cudaStreamAddCallback
or cuStreamAddCallback.

Note that the buffer is not flushed automatically when the program
exits. The user must call cudaDeviceReset() or
cuCtxDestroy() explicitly, as shown in the examples
below.

Internally printf() uses a shared data structure and so it is possible that calling printf() might change the order of execution of threads.
In particular, a thread which calls printf() might take a longer execution path than one which does not call printf(), and that path length is dependent upon the parameters of the printf().
Note, however, that CUDA makes no guarantees of thread execution order except at explicit __syncthreads() barriers, so it is impossible to tell whether execution order has been modified by printf() or by other scheduling behaviour in the hardware.

Notice how each thread encounters the printf() command, so there are as
many lines of output as there were threads launched in the grid. As expected, global
values (i.e., float f) are common between all threads, and local values
(i.e., threadIdx.x) are distinct per-thread.

Dynamic global memory allocation and operations are only supported by devices of compute capability 2.x and higher.

void* malloc(size_t size);
void free(void* ptr);

allocate and free memory dynamically from a fixed-size heap in global memory.

void* memcpy(void* dest, constvoid* src, size_t size);

copy size bytes from the memory location pointed by src to the memory location pointed by dest.

void* memset(void* ptr, int value, size_t size);

set size bytes of memory block pointed by ptr to value (interpreted as an unsigned char).

The CUDA in-kernel malloc() function allocates at least
size bytes from the device heap and returns a pointer to the
allocated memory or NULL if insufficient memory exists to fulfill the request. The
returned pointer is guaranteed to be aligned to a 16-byte boundary.

The CUDA in-kernel free() function deallocates the memory pointed to by
ptr, which must have been returned by a previous call to
malloc(). If ptr is NULL, the call to
free() is ignored. Repeated calls to free()
with the same ptr has undefined behavior.

The memory allocated by a given CUDA thread via malloc() remains
allocated for the lifetime of the CUDA context, or until it is explicitly released
by a call to free(). It can be used by any other CUDA threads even
from subsequent kernel launches. Any CUDA thread may free memory allocated by
another thread, but care should be taken to ensure that the same pointer is not
freed more than once.

The device memory heap has a fixed size that must be specified before any program using
malloc() or free() is loaded into the context. A
default heap of eight megabytes is allocated if any program uses
malloc() without explicitly specifying the heap size.

The following API functions get and set the heap size:

cudaDeviceGetLimit(size_t* size, cudaLimitMallocHeapSize)

cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)

The heap size granted will be at least size bytes.
cuCtxGetLimit()and cudaDeviceGetLimit() return the
currently requested heap size.

The actual memory allocation for the heap occurs when a module is loaded into the context, either
explicitly via the CUDA driver API (see Module), or implicitly via the CUDA runtime
API (see CUDA C Runtime). If the memory allocation fails, the module load will generate a
CUDA_ERROR_SHARED_OBJECT_INIT_FAILED error.

Heap size cannot be changed once a module load has occurred and it does not resize dynamically according to need.

Memory reserved for the device heap is in addition to memory allocated through host-side CUDA API
calls such as cudaMalloc().

Any call to a __global__ function must specify the
execution configuration for that call. The execution
configuration defines the dimension of the grid and blocks that will be
used to execute the function on the device, as well as the associated
stream (see CUDA C Runtime for a description of
streams).

The execution configuration is specified by inserting an expression of
the form <<< Dg, Db, Ns, S >>> between the
function name and the parenthesized argument list, where:

Dg is of type dim3 (see dim3) and specifies the dimension and size of the grid,
such that Dg.x * Dg.y * Dg.z equals the number of
blocks being launched;

Db is of type dim3 (see dim3) and specifies the dimension and size of each
block, such that Db.x * Db.y * Db.z equals the
number of threads per block;

Ns is of type size_t and
specifies the number of bytes in shared memory that is dynamically
allocated per block for this call in addition to the statically
allocated memory; this dynamically allocated memory is used by any of
the variables declared as an external array as mentioned in __shared__; Ns is an optional argument
which defaults to 0;

S is of type cudaStream_t and
specifies the associated stream; S is an optional
argument which defaults to 0.

As an example, a function declared as

__global__void Func(float* parameter);

must be called like this:

Func<<< Dg, Db, Ns >>>(parameter);

The arguments to the execution configuration are evaluated before the
actual function arguments.

The function call will fail if Dg or
Db are greater than the maximum sizes allowed for the
device as specified in Compute Capabilities, or if
Ns is greater than the maximum amount of shared memory
available on the device, minus the amount of shared memory required for
static allocation.

As discussed in detail in Multiprocessor Level, the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can
improve performance.

Therefore, the compiler uses heuristics to minimize register usage while keeping register spilling (see Device Memory Accesses) and instruction count to a minimum. An application can optionally aid these heuristics by providing additional information
to the compiler in the form of launch bounds that are specified using the __launch_bounds__() qualifier in the definition of a __global__ function:

maxThreadsPerBlock specifies the maximum number of threads per block with which the application will ever launch MyKernel(); it compiles to the .maxntidPTX directive;

minBlocksPerMultiprocessor is optional and specifies the desired minimum number of resident blocks per multiprocessor; it compiles to the .minnctapersmPTX directive.

If launch bounds are specified, the compiler first derives from them the upper limit L on the number of registers the kernel should use to ensure that minBlocksPerMultiprocessor blocks (or a single block if minBlocksPerMultiprocessor is not specified) of maxThreadsPerBlock threads can reside on the multiprocessor (see Hardware Multithreading for the relationship between the number of registers used by a kernel and the number of registers allocated per block). The
compiler then optimizes register usage in the following way:

If the initial register usage is higher than L, the compiler reduces it further until it becomes less or equal to L, usually at the expense of more local memory usage and/or higher number of instructions;

If the initial register usage is lower than L

If maxThreadsPerBlock is specified and minBlocksPerMultiprocessor is not, the compiler uses maxThreadsPerBlock to determine the register usage thresholds for the transitions between n and n+1 resident blocks (i.e., when using one less register makes room for an additional resident block as in the example of Multiprocessor Level) and then applies similar heuristics as when no launch bounds are specified;

If both minBlocksPerMultiprocessor and maxThreadsPerBlock are specified, the compiler may increase register usage as high as L to reduce the number of instructions and better hide single thread instruction latency.

A kernel will fail to launch if it is executed with more threads per block than its launch bound maxThreadsPerBlock.

Optimal launch bounds for a given kernel will usually differ across major architecture revisions. The sample code below shows
how this is typically handled in device code using the __CUDA_ARCH__ macro introduced in Application Compatibility

In the common case where MyKernel is invoked with the maximum number of threads per block (specified as the first parameter of __launch_bounds__()), it is tempting to use MY_KERNEL_MAX_THREADS as the number of threads per block in the execution configuration:

// Host code
MyKernel<<<blocksPerGrid, MY_KERNEL_MAX_THREADS>>>(...);

This will not work however since __CUDA_ARCH__ is undefined in host code as mentioned in Application Compatibility, so MyKernel will launch with 256 threads per block even when __CUDA_ARCH__ is greater or equal to 200. Instead the number of threads per block should be determined:

Either at compile time using a macro that does not depend on __CUDA_ARCH__, for example

Register usage is reported by the --ptxas options=-v compiler option. The number of resident blocks can be derived from the occupancy reported by the CUDA profiler (see Device Memory Accessesfor a definition of occupancy).

Register usage can also be controlled for all __global__ functions in a file using the maxrregcount compiler option. The value of maxrregcount is ignored for functions with launch bounds.

By default, the compiler unrolls small loops with a known trip count. The #pragma
unroll
directive however can be used to control unrolling of any given loop. It
must be placed immediately before the loop and only applies to that loop. It is optionally
followed by a number that specifies how many times the loop must be unrolled.

For example, in this code sample:

#pragma unroll 5
for (int i = 0; i < n; ++i)

the loop will be unrolled 5 times. The compiler will also insert code to ensure correctness (in
the example above, to ensure that there will only be n iterations if
n is less than 5, for example). It is up to the programmer to make sure
that the specified unroll number gives the best performance.

#pragma unroll
1
will prevent the compiler from ever unrolling a loop.

If no number is specified after #pragma unroll, the loop is completely unrolled
if its trip count is constant, otherwise it is not unrolled at all.

PTX ISA version 3.0 includes SIMD (Single Instruction, Multiple
Data) video instructions which operate on pairs of 16-bit
values and quads of 8-bit values. These are available on
devices of compute capability 3.0.

The SIMD video instructions are:

vadd2, vadd4

vsub2, vsub4

vavrg2, vavrg4

vabsdiff2, vabsdiff4

vmin2, vmin4

vmax2, vmax4

vset2, vset4

PTX instructions, such as the SIMD video instructions,
can be included in CUDA programs by way of the assembler,
asm(), statement.

This uses the vabsdiff4 instruction to compute
an integer quad byte SIMD sum of absolute differences. The
absolute difference value is computed for each byte of the
unsigned integers A and B in SIMD fashion. The optional
accumulate operation (.add) is specified to
sum these differences.

Refer to the document "Using Inline PTX Assembly in CUDA"
for details on using the assembly statement in your code.
Refer to the PTX ISA documentation ("Parallel Thread
Execution ISA Version 3.0" for example) for details on the
PTX instructions for the version of PTX that you are using.

Dynamic Parallelism is an extension to the CUDA programming model enabling
a CUDA kernel to create and synchronize with new work directly on the GPU. The creation
of parallelism dynamically at whichever point in a program that it is needed offers
exciting new capabilities.

The ability to create work directly from the GPU can reduce the need to transfer
execution control and data between host and device, as launch configuration decisions
can now be made at runtime by threads executing on the device. Additionally,
data-dependent parallel work can be generated inline within a kernel at run-time, taking
advantage of the GPU's hardware schedulers and load balancers dynamically and adapting
in response to data-driven decisions or workloads. Algorithms and programming patterns
that had previously required modifications to eliminate recursion, irregular loop
structure, or other constructs that do not fit a flat, single-level of parallelism may
more transparently be expressed.

This document describes the extended capabilities of CUDA which enable Dynamic
Parallelism, including the modifications and additions to the CUDA programming model
necessary to take advantage of these, as well as guidelines and best practices for
exploiting this added capacity.

Dynamic Parallelism is only supported by devices of compute capability 3.5 and
higher.

A Grid is a collection of Threads. Threads
in a Grid execute a Kernel Function and
are divided into Thread Blocks.

Thread Block

A Thread Block is a group of threads which execute on
the same multiprocessor (SMX). Threads
within a Thread Block have access to shared memory
and can be explicitly synchronized.

Kernel Function

A Kernel Function is an implicitly parallel subroutine
that executes under the CUDA execution and memory
model for every Thread in a Grid.

Host

The Host refers to the execution environment that
initially invoked CUDA. Typically the thread running
on a system's CPU processor.

Parent

A Parent Thread, Thread Block, or Grid is
one that has launched new grid(s), the
Child Grid(s). The Parent is not
considered completed until all of its launched Child
Grids have also completed.

Child

A Child thread, block, or grid is one that has been
launched by a Parent grid. A Child grid must
complete before the Parent Thread, Thread Block, or
Grid are considered complete.

Thread Block Scope

Objects with Thread Block Scope have the lifetime of a
single Thread Block. They only have defined behavior
when operated on by Threads in the Thread Block that
created the object and are destroyed when the Thread
Block that created them is complete.

Device Runtime

The Device Runtime refers to the runtime system and APIs
available to enable Kernel Functions to use Dynamic
Parallelism.

The CUDA execution model is based on primitives of threads, thread blocks, and grids,
with kernel functions defining the program executed by individual threads within a
thread block and grid. When a kernel function is invoked the grid's properties are
described by an execution configuration, which has a special syntax in CUDA. Support for
dynamic parallelism in CUDA extends the ability to configure, launch, and synchronize
upon new grids to threads that are running on the device.

A device thread that configures and launches a new grid belongs to the
parent grid, and the grid created by the invocation is a child grid.

The invocation and completion of child grids is properly nested, meaning
that the parent grid is not considered complete until all child grids
created by its threads have completed. Even if the invoking threads do
not explicitly synchronize on the child grids launched, the runtime
guarantees an implicit synchronization between the parent and child.

On both host and device, the CUDA runtime offers an API for launching kernels, for
waiting for launched work to complete, and for tracking dependencies between launches
via streams and events. On the host system, the state of launches and the CUDA
primitives referencing streams and events are shared by all threads within a process;
however processes execute independently and may not share CUDA objects.

A similar hierarchy exists on the device: launched kernels and CUDA objects are visible
to all threads in a thread block, but are independent between thread blocks. This means
for example that a stream may be created by one thread and used by any other thread in
the same thread block, but may not be shared with threads in any other thread block.

CUDA runtime operations from any thread, including kernel launches, are visible across a
thread block. This means that an invoking thread in the parent grid may perform
synchronization on the grids launched by that thread, by other threads in the thread
block, or on streams created within the same thread block. Execution of a thread block
is not considered complete until all launches by all threads in the block have
completed. If all threads in a block exit before all child launches have completed, a
synchronization operation will automatically be triggered.

CUDA Streams and Events allow control over
dependencies between grid launches: grids launched into the same stream
execute in-order, and events may be used to create dependencies between
streams. Streams and events created on the device serve this exact same
purpose.

Streams and events created within a grid exist within thread block
scope but have undefined behavior when used outside of the thread block
where they were created. As described above, all work launched by a
thread block is implicitly synchronized when the block exits; work
launched into streams is included in this, with all dependencies
resolved appropriately. The behavior of operations on a stream that has
been modified outside of thread block scope is undefined.

Streams and events created on the host have undefined behavior when
used within any kernel, just as streams and events created by a parent
grid have undefined behavior if used within a child grid.

The ordering of kernel launches from the device runtime follows CUDA Stream ordering
semantics. Within a thread block, all kernel launches into the same stream are executed
in-order. With multiple threads in the same thread block launching into the same stream,
the ordering within the stream is dependent on the thread scheduling within the block,
which may be controlled with synchronization primitives such as
__syncthreads().

Note that because streams are shared by all threads within a thread block, the implicit
NULL stream is also shared. If multiple threads in a thread block
launch into the implicit stream, then these launches will be executed in-order. If
concurrency is desired, explicit named streams should be used.

Dynamic Parallelism enables concurrency to be expressed more easily within a
program; however, the device runtime introduces no new concurrency guarantees within the
CUDA execution model. There is no guarantee of concurrent execution between any number
of different thread blocks on a device.

The lack of concurrency guarantee extends to parent thread blocks and their child grids.
When a parent thread block launches a child grid, the child is not guaranteed to begin
execution until the parent thread block reaches an explicit synchronization point (e.g.
cudaDeviceSynchronize()).

While concurrency will often easily be achieved, it may vary as a function of
deviceconfiguration, application workload, and runtime scheduling. It is therefore
unsafe to depend upon any concurrency between different thread blocks.

There is no multi-GPU support from the device runtime; the device runtime is only capable
of operating on the device upon which it is currently executing. It is permitted,
however, to query properties for any CUDA capable device in the system.

Parent and child grids have coherent access to global memory, with weak
consistency guarantees between child and parent. There are two points in
the execution of a child grid when its view of memory is fully consistent
with the parent thread: when the child grid is invoked by the parent, and
when the child grid completes as signaled by a synchronization API
invocation in the parent thread.

All global memory operations in the parent thread prior to the child
grid's invocation are visible to the child grid. All memory operations of
the child grid are visible to the parent after the parent has
synchronized on the child grid's completion.

In the following example, the child grid executing
child_launch is only guaranteed to see the modifications
to data made before the child grid was launched. Since
thread 0 of the parent is performing the launch, the child will be
consistent with the memory seen by thread 0 of the parent. Due to the
first __syncthreads() call, the child will see
data[0]=0, data[1]=1, ...,
data[255]=255 (without the
__syncthreads() call, only data[0]
would be guaranteed to be seen by the child). When the child grid
returns, thread 0 is guaranteed to see modifications made by the threads
in its child grid. Those modifications become available to the other
threads of the parent grid only after the second
__syncthreads() call:

Zero-copy system memory has identical coherence and consistency guarantees to global
memory, and follows the semantics detailed above. A kernel may not allocate or free
zero-copy memory, but may use pointers to zero-copy passed in from the host program.

Constants are immutable and may not be modified from the device, even between parent and
child launches. That is to say, the value of all __constant__ variables
must be set from the host prior to launch. Constant memory is inherited automatically by
all child kernels from their respective parents.

Taking the address of a constant memory object from within a kernel thread has the same
semantics as for all CUDA programs, and passing that pointer from parent to child or
from a child to parent is naturally supported.

Shared and Local memory is private to a thread block or thread, respectively, and is not
visible or coherent between parent and child. Behavior is undefined when an object in
one of these locations is referenced outside of the scope within which it belongs, and
may cause an error.

The NVIDIA compiler will attempt to warn if it can detect that a pointer to local or
shared memory is being passed as an argument to a kernel launch. At runtime, the
programmer may use the __isGlobal() intrinsic to determine whether a
pointer references global memory and so may safely be passed to a child launch.

Note that calls to cudaMemcpy*Async() or
cudaMemset*Async() may invoke new child kernels on the device in
order to preserve stream semantics. As such, passing shared or local memory pointers to
these APIs is illegal and will return an error.

Local memory is private storage for an executing thread, and is not
visible outside of that thread. It is illegal to pass a pointer to local
memory as a launch argument when launching a child kernel. The result of
dereferencing such a local memory pointer from a child will be
undefined.

For example the following is illegal, with undefined behavior if
x_array is accessed by
child_launch:

It is sometimes difficult for a programmer to be aware of when a
variable is placed into local memory by the compiler. As a general rule,
all storage passed to a child kernel should be allocated explicitly from
the global-memory heap, either with cudaMalloc(),
new() or by declaring __device__
storage at global scope. For example:

Writes to the global memory region over which a texture is mapped are
incoherent with respect to texture accesses. Coherence for texture memory
is enforced at the invocation of a child grid and when a child grid
completes. This means that writes to memory prior to a child kernel
launch are reflected in texture memory accesses of the child. Similarly,
writes to memory by a child will be reflected in the texture memory
accesses by a parent, but only after the parent synchronizes on the
child's completion. Concurrent accesses by parent and child may result
in inconsistent data.

This section describes changes and additions to the CUDA C/C++ language extensions for
supporting Dynamic Parallelism.

The language interface and API available to CUDA kernels using CUDA C/C++ for Dynamic
Parallelism, referred to as the Device Runtime, is substantially like that
of the CUDA Runtime API available on the host. Where possible the syntax and semantics
of the CUDA Runtime API have been retained in order to facilitate ease of code reuse for
routines that may run in either the host or device environments.

As with all code in CUDA C/C++, the APIs and code outlined here is per-thread code. This
enables each thread to make unique, dynamic decisions regarding what kernel or operation
to execute next. There are no synchronization requirements between threads within a
block to execute any of the provided device runtime APIs, which enables the device
runtime API functions to be called in arbitrarily divergent kernel code without
deadlock.

Kernels may be launched from the device using the standard CUDA
<<< >>> syntax:

kernel_name<<< Dg, Db, Ns, S >>>([kernel arguments]);

Dg is of type dim3 and specifies
the dimensions and size of the grid

Db is of type dim3 and specifies
the dimensions and size of each thread block

Ns is of type size_t and specifies
the number of bytes of shared memory that is dynamically allocated per
thread block for this call and addition to statically allocated memory.
Ns is an optional argument that defaults to 0.

S is of type cudaStream_t and
specifies the stream associated with this call. The stream must have
been allocated in the same thread block where the call is being made.
S is an optional argument that defaults to 0.

Identical to host-side launches, all device-side kernel launches are
asynchronous with respect to the launching thread. That is to say, the
<<<>>> launch command will return
immediately and the launching thread will continue to execute until it
hits an explicit launch-synchronization point such as
cudaDeviceSynchronize(). The grid launch is posted to
the device and will execute independently of the parent thread. The child
grid may begin execution at any time after launch, but is not guaranteed
to begin execution until the launching thread reaches an explicit
launch-synchronization point.

All global device configuration settings (e.g., shared memory and L1
cache size as returned from cudaDeviceGetCacheConfig(),
and device limits returned from cudaDeviceGetLimit())
will be inherited from the parent. That is to say if, when the parent is
launched, execution is configured globally for 16k of shared memory and
48k of L1 cache, then the child's execution state will be configured
identically. Likewise, device limits such as stack size will remain
as-configured.

For host-launched kernels, per-kernel configurations set from the host
will take precedence over the global setting. These configurations will
be used when the kernel is launched from the device as well. It is not
possible to reconfigure a kernel's environment from the device.

Although the device runtime enables kernel launches from either the host
or device, kernel launches from __host__ __device__
functions are unsupported. The compiler will fail to compile if a
__host__ device__ function is used to launch a
kernel.

Both named and unnamed (NULL) streams are available from the device runtime. Named
streams may be used by any thread within a thread-block, but stream handles may not be
passed to other blocks or child/parent kernels. In other words, a stream should be
treated as private to the block in which it is created. Stream handles are not
guaranteed to be unique between blocks, so using a stream handle within a block that did
not allocate it will result in undefined behavior.

Similar to host-side launch, work launched into separate streams may run concurrently,
but actual concurrency is not guaranteed. Programs that depend upon concurrency between
child kernels are not supported by the CUDA programming model and will have undefined
behavior.

The host-side NULL stream's cross-stream barrier semantic is not supported on the device
(see below for details). In order to retain semantic compatibility with the host
runtime, all device streams must be created using the
cudaStreamCreateWithFlags() API, passing the
cudaStreamNonBlocking flag. The cudaStreamCreate()
call is a host-runtime- only API and will fail to compile for the device.

As cudaStreamSynchronize() and cudaStreamQuery() are
unsupported by the device runtime, cudaDeviceSynchronize() should be
used instead when the application needs to know that stream-launched child kernels have
completed.

Within a host program, the unnamed (NULL) stream has additional barrier
synchronization semantics with other streams (see Default Stream for details). The device runtime offers a
single implicit, unnamed stream shared between all threads in a block,
but as all named streams must be created with the
cudaStreamNonBlocking flag, work launched into the NULL
stream will not insert an implicit dependency on pending work in any
other streams.

Only the inter-stream synchronization capabilities of CUDA events are supported. This
means that cudaStreamWaitEvent() is supported, but
cudaEventSynchronize(), cudaEventElapsedTime(),
and cudaEventQuery() are not. As
cudaEventElapsedTime() is not supported, cudaEvents must be created
via cudaEventCreateWithFlags(), passing the
cudaEventDisableTiming flag.

As for all device runtime objects, event objects may be shared between all threads
withinthe thread-block which created them but are local to that block and may not be
passed to other kernels, or between blocks within the same kernel. Event handles are not
guaranteed to be unique between blocks, so using an event handle within a block that did
not create it will result in undefined behavior.

The cudaDeviceSynchronize() function will synchronize on all work
launched by any thread in the thread-block up to the point where cudaDeviceSynchronize()
was called. Note that cudaDeviceSynchronize() may be called from within
divergent code (see Block Wide Synchronization).

It is up to the program to perform sufficient additional inter-thread synchronization,
for example via a call to __syncthreads(), if the calling thread is
intended to synchronize with child grids invoked from other threads.

The cudaDeviceSynchronize() function does not imply intra-block
synchronization. In particular, without explicit synchronization via a
__syncthreads() directive the calling thread can make no
assumptions about what work has been launched by any thread other than itself. For
example if multiple threads within a block are each launching work and synchronization
is desired for all this work at once (perhaps because of event-based dependencies), it
is up to the program to guarantee that this work is submitted by all threads before
calling cudaDeviceSynchronize().

Because the implementation is permitted to synchronize on launches from any thread in the
block, it is quite possible that simultaneous calls to
cudaDeviceSynchronize() by multiple threads will drain all work in
the first call and then have no effect for the later calls.

Only the device on which a kernel is running will be controllable from
that kernel. This means that device APIs such as
cudaSetDevice() are not supported by the device runtime.
The active device as seen from the GPU (returned from
cudaGetDevice()) will have the same device number as
seen from the host system. The cudaGetDeviceProperty()
call may request information about another device as this API allows
specification of a device ID as a parameter of the call. Note that the
catch-all cudaGetDeviceProperties() API is not offered
by the device runtime - properties must be queried individually.

Memory declared at file scope with __device__ or
__constant__ qualifiers behave identically when using
the device runtime. All kernels may read or write device variables,
whether the kernel was initially launched by the host or device runtime.
Equivalently, all kernels will have the same view of
__constant__s as declared at the module scope.

CUDA supports dynamically created texture and surface objects1, where a texture
reference may be created on the host, passed to a kernel, used by that
kernel, and then destroyed from the host. The device runtime does not
allow creation or destruction of texture or surface objects from within
device code, but texture and surface objects created from the host may be
used and passed around freely on the device. Regardless of where they are
created, dynamically created texture objects are always valid and may be
passed to child kernels from a parent.

Note: The device runtime does not support legacy module-scope (i.e.,
Fermi-style) textures and surfaces within a kernel launched from the
device. Module-scope (legacy) textures may be created from the host and
used in device code as for any kernel, but may only be used by a
top-level kernel (i.e., the one which is launched from the host).

In CUDA C/C++ shared memory can be declared either as a statically sized
file-scope or function-scoped variable, or as an extern
variable with the size determined at runtime by the kernel's caller via a
launch configuration argument. Both types of declarations are valid under
the device runtime.

Device-side symbols (i.e., those marked __device__) may
be referenced from within a kernel simply via the &
operator, as all global-scope device variables are in the kernel's
visible address space. This also applies to __constant__
symbols, although in this case the pointer will reference read-only
data.

Given that device-side symbols can be referenced directly, those CUDA
runtime APIs which reference symbols (e.g.,
cudaMemcpyToSymbol() or
cudaGetSymbolAddress()) are redundant and hence not
supported by the device runtime. Note this implies that constant data
cannot be altered from within a running kernel, even ahead of a child
kernel launch, as references to __constant__ space are
read-only.

As usual for the CUDA runtime, any function may return an error code.
The last error code returned is recorded and may be retrieved via the
cudaGetLastError() call. Errors are recorded per-thread,
so that each thread can identify the most recent error that it has
generated. The error code is of type cudaError_t.

Similar to a host-side launch, device-side launches may fail for many
reasons (invalid arguments, etc). The user must call
cudaGetLastError() to determine if a launch generated an
error, however lack of an error after launch does not imply the child
kernel completed successfully.

For device-side exceptions, e.g., access to an invalid address, an error
in a child grid will be returned to the host instead of being returned by
the parent's call to cudaDeviceSynchronize().

Kernel launch is a system-level mechanism exposed through the device
runtime library, and as such is available directly from PTX via the
underlying cudaGetParameterBuffer() and
cudaLaunchDevice() APIs. It is permitted for a CUDA
application to call these APIs itself, with the same requirements as for
PTX. In both cases, the user is then responsible for correctly populating
all necessary data structures in the correct format according to
specification. Backwards compatibility is guaranteed in these data
structures.

As with host-side launch, the device-side operator
<<<>>> maps to underlying kernel launch
APIs. This is so that users targeting PTX will be able to enact a launch,
and so that the compiler front-end can translate
<<<>>> into these calls.

Table 4. New Device-only Launch Implementation Functions

Runtime API Launch Functions

Description of Difference From Host Runtime Behaviour
(behaviour is identical if no description)

cudaGetParameterBuffer

Generated automatically from
<<<>>>. Note different API to host
equivalent.

cudaLaunchDevice

Generated automatically from
<<<>>>. Note different API to host
equivalent.

The APIs for these launch functions are different to those of the CUDA
Runtime API, and are defined as follows:

The portions of the CUDA Runtime API supported in the device runtime are detailed here.
Host and device runtime APIs have identical syntax; semantics are the same except where
indicated. The table below provides an overview of the API relative to the version
available from the host.

Table 5. Supported API Functions

Runtime API Functions

Details

cudaDeviceSynchronize

Synchronizes on work launched from thread's own block only

cudaDeviceGetCacheConfig

cudaDeviceGetLimit

cudaGetLastError

Last error is per-thread state, not per-block state

cudaPeekAtLastError

cudaGetErrorString

cudaGetDeviceCount

cudaGetDeviceProperty

Will return properties for any device

cudaGetDevice

Always returns current device ID as would be seen from host

cudaStreamCreateWithFlags

Must pass cudaStreamNonBlocking flag

cudaStreamDestroy

cudaStreamWaitEvent

cudaEventCreateWithFlags

Must pass cudaEventDisableTiming flag

cudaEventRecord

cudaEventDestroy

cudaFuncGetAttributes

cudaMemcpyAsync

Notes about all memcpy/memset
functions:

Only async memcpy/set functions are
supported

Only device-to-device memcpy is permitted

May not pass in local or shared memory pointers

cudaMemcpy2DAsync

cudaMemcpy3DAsync

cudaMemsetAsync

cudaMemset2DAsync

cudaMemset3DAsync

cudaRuntimeGetVersion

cudaMalloc

May not call cudaFree on the device on
a pointer created on the host, and vice-versa

This section is for the programming language and compiler implementers who target
Parallel Thread Execution (PTX) and plan to support Dynamic
Parallelism in their language. It provides the low-level details related to
supporting kernel launches at the PTX level.

Device-side kernel launches can be implemented using the following two APIs accessible
from PTX: cudaLaunchDevice() and
cudaGetParameterBuffer(). cudaLaunchDevice()
launches the specified kernel with the parameter buffer that is obtained by calling
cudaGetParameterBuffer() and filled with the parameters to the
launched kernel. The parameter buffer can be NULL, i.e., no need to invoke
cudaGetParameterBuffer(), if the launched kernel does not take any
parameters.

The CUDA-level declaration below is mapped to one of the aforementioned
PTX-level declarations and is found in the system header file
cuda_device_runtime_api.h. The function is defined in
the cudadevrt system library, which must be linked with
a program in order to use device-side kernel launch functionality.

The first parameter is a pointer to the kernel to be is launched, and
the second parameter is the parameter buffer that holds the actual
parameters to the launched kernel. The layout of the parameter buffer is
explained in Parameter Buffer Layout, below. Other
parameters specify the launch configuration, i.e., as grid dimension,
block dimension, shared memory size, and the stream associated with the
launch (please refer to Execution Configuration for the
detailed description of launch configuration.

The first parameter specifies the alignment requirement of the parameter
buffer and the second parameter the size requirement in bytes. In the
current implementation, the parameter buffer returned by
cudaGetParameterBuffer() is always guaranteed to be 64-
byte aligned, and the alignment requirement parameter is ignored.
However, it is recommended to pass the correct alignment requirement
value - which is the largest alignment of any parameter to be placed in
the parameter buffer - to cudaGetParameterBuffer() to
ensure portability in the future.

Parameter reordering in the parameter buffer is prohibited, and each individual
parameter placed in the parameter buffer is required to be aligned. That is, each
parameter must be placed at the nth byte in the parameter buffer,
where n is the smallest multiple of the parameter size that is greater than the
offset of the last byte taken by the preceding parameter. The maximum size of the
parameter buffer is 4KB.

For a more detailed description of PTX code generated by the CUDA compiler, please refer
to the PTX-3.5 specification.

Similar to the host-side runtime API, prototypes for the CUDA device runtime API are
included automatically during program compilation. There is no need to include
cuda_device_runtime_api.h explicitly.

CUDA programs are automatically linked with the host runtime library
when compiled with nvcc, but the device runtime is shipped
as a static library which must explicitly be linked with a program which
wishes to use it.

The device runtime is offered as a static library
(cudadevrt.lib on Windows,
libcudadevrt.a under Linux and MacOS), against which a
GPU application that uses the device runtime must be linked. Linking of
device libraries can be accomplished through nvcc and/or
nvlink. Two simple examples are shown below.

A device runtime program may be compiled and linked in a single step, if all required source files can be specified from the
command line:

$ nvcc -arch=sm_35 -rdc=true hello_world.cu -o hello -lcudadevrt

It is also possible to compile CUDA .cu source files first to object
files, and then link these together in a two-stage process:

The device runtime is a functional subset of the host runtime. API level
device management, kernel launching, device memcpy, stream management,
and event management are exposed from the device runtime.

Programming for the device runtime should be familiar to someone who
already has experience with CUDA. Device runtime syntax and semantics are
largely the same as that of the host API, with any exceptions detailed
earlier in this document.

The following example shows a simple Hello World program
incorporating dynamic parallelism:

Synchronization by one thread may impact the performance of other threads in the same
Thread Block, even when those other threads do not call
cudaDeviceSynchronize() themselves. This impact will depend upon
the underlying implementation.

System software which is active when controlling dynamic launches may
impose an overhead on any kernel which is running at the time, whether or
not it invokes kernel launches of its own. This overhead arises from the
device runtime's execution tracking and management software and may
result in decreased performance for e.g., library calls when made from the
device compared to from the host side. This overhead is, in general,
incurred for applications that link against the device runtime
library.

Dynamic Parallelism guarantees all semantics described in this document,
however, certain hardware and software resources are implementation-dependent and limit
the scale, performance and other properties of a program which uses the device
runtime.

The device runtime system software reserves memory for various management purposes, in
particular one reservation which is used for saving parent-grid state during
synchronization, and a second reservation for tracking pending grid launches.
Configuration controls are available to reduce the size of these reservations in
exchange for certain launch limitations. See Configuration Options, below, for details.

The majority of reserved memory is allocated as backing-store for parent kernel state,
for use when synchronizing on a child launch. Conservatively, this memory must support
storing of state for the maximum number of live threads possible on the device. This
means that each parent generation at which cudaDeviceSynchronize() is
callable may require up to 150MB of device memory, depending on the device
configuration, which will be unavailable for program use even if it is not all consumed.

Using the device runtime, one kernel may launch another kernel, and that
kernel may launch another, and so on. Each subordinate launch is
considered a new nesting level, and the total number of
levels is the nesting depth of the program. The
synchronization depth is defined as the deepest level at
which the program will explicitly synchronize on a child launch.
Typically this is one less than the nesting depth of the program, but if
the program does not need to call
cudaDeviceSynchronize() at all levels then the
synchronization depth might be substantially different to the nesting
depth.

The overall maximum nesting depth is limited to 24, but practically
speaking the real limit will be the amount of memory required by the
system for each new level (see Memory Footprint above). Any launch
which would result in a kernel at a deeper level than the maximum will
fail. Note that this may also apply to
cudaMemcpyAsync(), which might itself generate a kernel
launch. See Configuration Options for details.

By default, sufficient storage is reserved for two levels of
synchronization. This maximum synchronization depth (and hence reserved
storage) may be controlled by calling
cudaDeviceSetLimit() and specifying
cudaLimitDevRuntimeSyncDepth. The number of levels to be
supported must be configured before the top-level kernel is launched from
the host, in order to guarantee successful execution of a nested program.
Calling cudaDeviceSynchronize() at a depth greater than
the specified maximum synchronization depth will return an error.

An optimization is permitted where the system detects that it need not
reserve space for the parent's state in cases where the parent kernel
never calls cudaDeviceSynchronize(). In this case,
because explicit parent/child synchronization never occurs, the memory
footprint required for a program will be much less than the conservative
maximum. Such a program could specify a shallower maximum synchronization
depth to avoid over-allocation of backing store.

When a kernel is launched, all associated configuration and parameter data is tracked
until the kernel completes. This data is stored within a system-managed launch pool.

The launch pool is divided into a fixed-size pool and a virtualized pool with lower
performance. The device runtime system software will try to track launch data in the
fixed-size pool first. The virtualized pool will be used to track new launches when the
fixed-size pool is full.

The size of the fixed-size launch pool is configurable by calling
cudaDeviceSetLimit() from the host and specifying
cudaLimitDevRuntimePendingLaunchCount.

Resource allocation for the device runtime system software is controlled
via the cudaDeviceSetLimit() API from the host program.
Limits must be set before any kernel is launched, and may not be changed
while the GPU is actively running programs.

The following named limits may be set:

Limit

Behavior

cudaLimitDevRuntimeSyncDepth

Sets the maximum depth at which
cudaDeviceSynchronize() may be called.
Launches may be performed deeper than this, but explicit
synchronization deeper than this limit will return the
cudaErrorLaunchMaxDepthExceeded. The default
maximum sync depth is 2.

cudaLimitDevRuntimePendingLaunchCount

Controls the amount of memory set aside for buffering
kernel launches which have not yet begun to execute, due either
to unresolved dependencies or lack of execution resources. When
the buffer is full, the device runtime system software will
attempt to track new pending launches in a lower performance
virtualized buffer. If the virtualized buffer is also full,
i.e. when all available heap space is consumed, launches will
not occur, and the thread's last error will be set to
cudaErrorLaunchPendingCountExceeded. The
default pending launch count is 2048 launches.

cudaMalloc() and cudaFree() have
distinct semantics between the host and device environments. When invoked
from the host, cudaMalloc() allocates a new region from
unused device memory. When invoked from the device runtime these
functions map to device-side malloc() and
free(). This implies that within the device environment
the total allocatable memory is limited to the device
malloc() heap size, which may be smaller than the
available unused device memory. Also, it is an error to invoke
cudaFree() from the host program on a pointer which was
allocated by cudaMalloc() on the device or
vice-versa.

Note that in PTX %smid and %warpid are
defined as volatile values. The device runtime may reschedule thread
blocks onto different SMs in order to more efficiently manage resources.
As such, it is unsafe to rely upon %smid or
%warpid remaining unchanged across the lifetime of a
thread or thread block.

No notification of ECC errors is available to code within a CUDA kernel. ECC errors are
reported at the host side once the entire launch tree has completed. Any ECC errors
which arise during execution of a nested program will either generate an exception or
continue execution (depending upon error and configuration).

The reference manual lists, along with their description, all the functions of the C/C++ standard library mathematical functions
that are supported in device code, as well as all intrinsic functions (that are only supported in device code).

This appendix provides accuracy information for some of these functions when applicable.

The functions from this section can be used in both host and device
code.

This section specifies the error bounds of each function when executed
on the device and also when executed on the host in the case where the
host does not supply the function.

The error bounds are generated from extensive but not exhaustive tests,
so they are not guaranteed bounds.

Single-Precision Floating-Point Functions

Addition and multiplication are IEEE-compliant, so have a maximum
error of 0.5 ulp.

The recommended way to round a single-precision floating-point operand
to an integer, with the result being a single-precision floating-point
number is rintf(), not roundf(). The
reason is that roundf() maps to an 8-instruction
sequence on the device, whereas rintf() maps to a
single instruction. truncf(),
ceilf(), and floorf() each map to a
single instruction as well.

Table 6. Single-Precision Mathematical Standard Library Functions with
Maximum ULP Error. The maximum error is stated as the absolute value of the
difference in ulps between a correctly rounded single-precision
result and the result returned by the CUDA library function.

Function

Maximum ulp error

x+y

0 (IEEE-754 round-to-nearest-even)

x*y

0 (IEEE-754 round-to-nearest-even)

x/y

0 for compute capability ≥
2 when compiled with -prec-div=true

2 (full range), otherwise

1/x

0 for compute capability ≥
2 when compiled with -prec-div=true

1 (full range), otherwise

rsqrtf(x)

1/sqrtf(x)

2 (full range)

Applies to 1/sqrtf(x) only when it is
converted to rsqrtf(x) by the compiler.

sqrtf(x)

0 for compute capability ≥
2 when compiled with -prec-sqrt=true

3 (full range), otherwise

cbrtf(x)

1 (full range)

rcbrtf(x)

1 (full range)

hypotf(x,y)

3 (full range)

rhypotf(x,y)

2 (full range)

norm3df(x,y,z)

3 (full range)

rnorm3df(x,y,z)

2 (full range)

norm4df(x,y,z,t)

3 (full range)

expf(x)

2 (full range)

exp2f(x)

2 (full range)

exp10f(x)

2 (full range)

expm1f(x)

1 (full range)

logf(x)

1 (full range)

log2f(x)

2 (full range)

log10f(x)

2 (full range)

log1pf(x)

2 (full range)

sinf(x)

2 (full range)

cosf(x)

2 (full range)

tanf(x)

4 (full range)

sincosf(x,sptr,cptr)

2 (full range)

sinpif(x)

2 (full range)

cospif(x)

2 (full range)

sincospif(x,sptr,cptr)

2 (full range)

asinf(x)

4 (full range)

acosf(x)

3 (full range)

atanf(x)

2 (full range)

atan2f(y,x)

3 (full range)

sinhf(x)

3 (full range)

coshf(x)

2 (full range)

tanhf(x)

2 (full range)

asinhf(x)

3 (full range)

acoshf(x)

4 (full range)

atanhf(x)

3 (full range)

powf(x,y)

8 (full range)

erff(x)

2 (full range)

erfcf(x)

4 (full range)

erfinvf(x)

2 (full range)

erfcinvf(x)

2 (full range)

erfcxf(x)

4 (full range)

normcdff(x)

5 (full range)

normcdfinvf(x)

5 (full range)

lgammaf(x)

6 (outside interval -10.001 ... -2.264; larger
inside)

tgammaf(x)

11 (full range)

fmaf(x,y,z)

0 (full range)

frexpf(x,exp)

0 (full range)

ldexpf(x,exp)

0 (full range)

scalbnf(x,n)

0 (full range)

scalblnf(x,l)

0 (full range)

logbf(x)

0 (full range)

ilogbf(x)

0 (full range)

j0f(x)

9 for |x| < 8

otherwise, the maximum absolute error is 2.2 x
10-6

j1f(x)

9 for |x| < 8

otherwise, the maximum absolute error is 2.2 x
10-6

jnf(x)

For n = 128, the maximum absolute error is 2.2 x 10-6

y0f(x)

9 for |x| < 8

otherwise, the maximum absolute error is 2.2 x
10-6

y1f(x)

9 for |x| < 8

otherwise, the maximum absolute error is 2.2 x
10-6

ynf(x)

ceil(2 + 2.5n) for |x| < n

otherwise, the maximum absolute error is 2.2 x
10-6

cyl_bessel_i0f(x)

6 (full range)

cyl_bessel_i1f(x)

6 (full range)

fmodf(x,y)

0 (full range)

remainderf(x,y)

0 (full range)

remquof(x,y,iptr)

0 (full range)

modff(x,iptr)

0 (full range)

fdimf(x,y)

0 (full range)

truncf(x)

0 (full range)

roundf(x)

0 (full range)

rintf(x)

0 (full range)

nearbyintf(x)

0 (full range)

ceilf(x)

0 (full range)

floorf(x)

0 (full range)

lrintf(x)

0 (full range)

lroundf(x)

0 (full range)

llrintf(x)

0 (full range)

llroundf(x)

0 (full range)

Double-Precision Floating-Point Functions

The recommended way to round a double-precision floating-point operand
to an integer, with the result being a double-precision floating-point
number is rint(), not round(). The
reason is that round() maps to an 8-instruction
sequence on the device, whereas rint() maps to a
single instruction. trunc(), ceil(),
and floor() each map to a single instruction as
well.

Table 7. Double-Precision Mathematical Standard Library Functions with
Maximum ULP Error. The maximum error is stated as the absolute value of the
difference in ulps between a correctly rounded double-precision
result and the result returned by the CUDA library function.

Among these functions are the less accurate, but faster versions of some
of the functions of Standard Functions .They have the
same name prefixed with __ (such as
__sinf(x)). They are faster as they map to fewer native
instructions. The compiler has an option
(-use_fast_math) that forces each function in Table 8 to
compile to its intrinsic counterpart. In addition to reducing the
accuracy of the affected functions, it may also cause some differences in
special case handling. A more robust approach is to selectively replace
mathematical function calls by calls to intrinsic functions only where it
is merited by the performance gains and where changed properties such as
reduced accuracy and different special case handling can be
tolerated.