Programming Guides

This guide provides a detailed discussion of
the CUDA programming model and programming interface. It then describes
the hardware implementation, and provides guidance on how to achieve
maximum performance. The Appendixes include a list of all CUDA-enabled
devices, detailed description of all extensions to the C language,
listings of supported mathematical functions, C++ features supported in
host and device code, details on texture fetching, technical
specifications of various devices, and concludes by introducing the
low-level driver API.

This guide presents established
parallelization and optimization techniques and explains coding
metaphors and idioms that can greatly simplify programming for
CUDA-capable GPU architectures. The intent is to provide guidelines for
obtaining the best performance from NVIDIA GPUs using the CUDA
Toolkit.

This application note is intended to help
developers ensure that their NVIDIA CUDA applications will run
properly on GPUs based on the NVIDIA Maxwell Architecture. This
document provides guidance to ensure that your software applications are
compatible with Maxwell.

Kepler is NVIDIA's 3rd-generation
architecture for CUDA compute applications. Applications that follow
the best practices for the Fermi architecture should typically
see speedups on the Kepler architecture without any code changes. This
guide summarizes the ways that applications can be fine-tuned to gain
additional speedups by leveraging Kepler architectural features.

Maxwell is NVIDIA's 4th-generation
architecture for CUDA compute applications. Applications that follow
the best practices for the Kepler architecture should typically see
speedups on the Maxwell architecture without any code changes. This
guide summarizes the ways that applications can be fine-tuned to gain
additional speedups by leveraging Maxwell architectural features.

This guide provides detailed instructions on the
use of PTX, a low-level parallel thread execution virtual machine and
instruction set architecture (ISA). PTX exposes the GPU as a
data-parallel computing device.

This document shows how to inline PTX (parallel
thread execution) assembly language statements into CUDA code. It
describes available assembler statement parameters and constraints, and
the document also provides a list of some pitfalls that you may
encounter.

CUDA API References

The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA runtime. It allows
the user to access the computational resources of NVIDIA Graphical Processing Unit (GPU), but does not auto-parallelize across
multiple GPUs.

NVIDIA NPP is a library of functions for performing CUDA accelerated
processing. The initial set of functionality in the library focuses on
imaging and video processing and is widely applicable for developers in
these areas. NPP will evolve over time to encompass more of the compute
heavy tasks in a variety of problem domains. The NPP library is written
to maximize flexibility, while maintaining high performance.

This document contains a complete listing of the code samples that are
included with the NVIDIA CUDA Toolkit. It describes each code sample,
lists the minimum GPU specification, and provides links to the source
code and white papers if available.

Tools

This document is a reference guide on the use of the CUDA compiler driver nvcc. Instead of being a specific CUDA compilation
driver, nvcc mimics the behavior of the GNU compiler gcc, accepting a range of conventional compiler options, such as for
defining macros and include/library paths, and for steering the compilation process.

The NVIDIA tool for debugging CUDA applications running on Linux and Mac, providing developers with a mechanism for debugging
CUDA applications running on actual hardware. CUDA-GDB is an extension to the x86-64 port of GDB, the GNU Project debugger.

White Papers

A number of issues related to floating point accuracy and compliance are
a frequent source of confusion on both CPUs and GPUs. The purpose of this
white paper is to discuss the most common issues related to NVIDIA GPUs
and to supplement the documentation in the CUDA C Programming Guide.

In this white paper we show how to use the
cuSPARSE and cuBLAS libraries to achieve a 2x speedup over CPU in the
incomplete-LU and Cholesky preconditioned iterative methods. We focus on
the Bi-Conjugate Gradient Stabilized and Conjugate Gradient iterative
methods, that can be used to solve large sparse nonsymmetric and
symmetric positive definite linear systems, respectively. Also, we
comment on the parallel sparse triangular solve, which is an essential
building block in these algorithms.

Miscellaneous

A technology introduced in Kepler-class GPUs and CUDA 5.0,
enabling a direct path for communication between the GPU and a third-party peer
device on the PCI Express bus when the devices share the same upstream
root complex using standard features of PCI Express. This document
introduces the technology and describes the steps necessary to enable a
GPUDirect RDMA connection to NVIDIA GPUs within the Linux device
driver model.