Added 4_Finance/binomialOptions_nvrtc. Demonstrates runtime compilation using libNVRTC of CUDA kernel which evaluates fair call price for a given set of European
options under binomial model.

Added 4_Finance/BlackScholes_nvrtc. Demonstrates runtime compilation using libNVRTC of CUDA kernel which evaluates fair call and put prices for a given set
of European options by Black-Scholes formula.

Added 7_CUDALibraries/cuHook. Demonstrates how to build and use an intercept library with CUDA.

Added 7_CUDALibraries/simpleCUFFT_callback. Demonstrates how to compute a 1D-convolution of a signal with a filter using a user-supplied CUFFT callback routine, rather
than a separate kernel call.

Added 7_CUDALibraries/simpleCUFFT_MGPU. Demonstrates how to compute a 1D-convolution of a signal with a filter by transforming both into frequency domain, multiplying
them together, and transforming the signal back to time domain on Multiple GPUs.

Added 7_CUDALibraries/simpleCUFFT_2d_MGPU. Demonstrates how to compute a 2D-convolution of a signal with a filter by transforming both into frequency domain, multiplying
them together, and transforming the signal back to time domain on Multiple GPUs.

Removed 3_Imaging/cudaEncode. Support for the CUDA Video Encoder (NVCUVENC) has been removed.

Removed 4_Finance/ExcelCUDA2007. The topic will be covered in a blog post at Parallel Forall.

Removed 4_Finance/ExcelCUDA2010. The topic will be covered in a blog post at Parallel Forall.

The 4_Finance/binomialOptions sample is now restricted to running on GPUs with SM architecture 2.0 or greater.

The 4_Finance/quasirandomGenerator sample is now restricted to running on GPUs with SM architecture 2.0 or greater.

The 7_CUDALibraries/boxFilterNPP sample now demonstrates how to use the static NPP libraries on Linux and Mac.

The 7_CUDALibraries/conjugateGradient sample now demonstrates how to use the static CUBLAS and CUSPARSE libraries on Linux and Mac.

The 7_CUDALibraries/MersenneTwisterGP11213 sample now demonstrates how to use the static CURAND library on Linux and Mac.

Linux makefiles have been updated to generate code for the AMRv7
architecture. Only the ARM hard-float floating point ABI is supported.
Both native ARMv7 compilation and cross compilation from
x86 is supported

Added 0_Simple/simpleIPC - CUDA Runtime API sample is a very basic sample that demonstrates Inter Process Communication with one process per GPU for
computation. Requires Compute Capability 2.0 or higher and a Linux Operating System.

Added 0_Simple/simpleSeparateCompilation - demonstrates a CUDA 5.0 feature, the ability to create a GPU device static library and use it within another CUDA kernel.
This example demonstrates how to pass in a GPU device function (from the GPU device static library) as a function pointer
to be called. Requires Compute Capability 2.0 or higher.

Added 2_Graphics/bindlessTexture - demonstrates use of cudaSurfaceObject, cudaTextureObject, and MipMap support in CUDA. Requires Compute Capability 3.0 or higher.

Added HSOpticalFlow - When working with image sequences or video it's often useful to have information about objects movement. Optical flow describes
apparent motion of objects in image sequence. This sample is a Horn-Schunck method for optical flow written using CUDA.

The Windows samples are built using the Visual Studio IDE. Solution files (.sln) are provided for each supported version of
Visual Studio, using the format:

*_vs<version>.sln - for Visual Studio <version>

Complete samples solution files exist at:

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.0\

Each individual sample has its own set of solution files at:

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.0\<sample_dir>\

To build/examine all the samples at once, the complete solution files should be used. To build/examine a single sample, the
individual sample solution files should be used.

Note:
Some samples require that the Microsoft DirectX SDK (June 2010 or newer) be installed and that the VC++ directory paths are
properly set up (Tools > Options...). Check DirectX Dependencies section for details.

SMS="A B ..." - override the SM architectures for which the sample will be built, where "A B ..." is a space-delimited list of SM architectures. For example, to generate SASS for SM 20 and SM 30, use SMS="20 30".

$ make SMS="20 30"

HOST_COMPILER=<host_compiler> - override the default g++ host compiler. See the Linux Getting Started Guide for a list of supported host compilers.

The Mac samples are built using makefiles. To use the makefiles, change directory into the sample directory you wish to build,
and run make:

$ cd <sample_dir>
$ make

The samples makefiles can take advantage of certain options:

dbg=1 - build with debug symbols

$ make dbg=1

SMS="A B ..." - override the SM architectures for which the sample will be built, where "A B ..." is a space-delimited list of SM architectures. For example, to generate SASS for SM 20 and SM 30, use SMS="20 30".

$ make SMS="A B ..."

HOST_COMPILER=<host_compiler> - override the default clang host compiler. See the Mac Getting Started Guide for a list of supported host compilers.

This section describes the options used to build cross-platform samples. TARGET_ARCH=<arch> and TARGET_OS=<os> should be chosen based on the supported targets shown below. TARGET_FS=<path> can be used to point nvcc to libraries and headers used by the sample.

The most reliable method to cross-compile the CUDA Samples is to use the TARGET_FS variable. To do so, mount the target's
filesystem on the host, say at /mnt/target. This is typically done using exportfs. In cases where exportfs is unavailable, it is sufficient to copy the target's filesystem to /mnt/target. To cross-compile a sample, execute:

If the TARGET_FS option is not available, the libraries used should be copied from the target system to the host system, say
at /opt/target/libs. If the sample uses GL, the GL headers must also be copied, say at /opt/target/include. The linker must then be told where the libraries are with the -rpath-link and/or -L options. To ignore unresolved symbols from some libraries, use the --unresolved-symbols option as shown below. SAMPLE_ENABLED should be used to force the sample to build. For example, to cross-compile a sample which uses such libraries, execute:

Creating a new CUDA Program using the CUDA Samples infrastructure is easy. We have provided a template and template_runtime project that you can copy and modify to suit your needs. Just follow these steps:

(<category> refers to one of the following folders: 0_Simple, 1_Utilities, 2_Graphics, 3_Imaging, 4_Finance, 5_Simulations, 6_Advanced, 7_CUDALibraries.)

Note: The default installation folder <SAMPLES_INSTALL_PATH> is NVIDIA_CUDA_7.0_Samples and <category> is one of the following: 0_Simple, 1_Utilities, 2_Graphics, 3_Imaging, 4_Finance, 5_Simulations, 6_Advanced, 7_CUDALibraries.

Creating a new CUDA Program using the NVIDIA CUDA Samples infrastructure is easy. We have provided a template or template_runtime project that you can copy and modify to suit your needs. Just follow these steps:

This document contains a complete listing of the code samples that are included with the NVIDIA CUDA Toolkit. It describes
each code sample, lists the minimum GPU specification, and provides links to the source code and white papers if available.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This example shows how to use the clock function using libNVRTC to measure the performance of kernel accurately.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This example demonstrates how to integrate CUDA into an existing C++ application, i.e. the CUDA entry point on host side is
only a function which is called from C++ code and only the file containing this function is compiled with nvcc. It also demonstrates
that vector types can be used from cpp.

This sample demonstrates how to use OpenMP API to write an application for multiple GPUs.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

A simple test application that demonstrates a new CUDA 4.0 ability to embed PTX in a CUDA kernel.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample implements matrix multiplication and is exactly the same as Chapter 6 of the programming guide. It has been written
for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant
generic kernel for matrix multiplication. To illustrate GPU performance for matrix multiply, this sample also shows how to
use the new CUDA 4.0 interface for CUBLAS to demonstrate high-performance performance for matrix multiplication.

This sample implements matrix multiplication and is exactly the same as Chapter 6 of the programming guide. It has been written
for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant
generic kernel for matrix multiplication. To illustrate GPU performance for matrix multiply, this sample also shows how to
use the new CUDA 4.0 interface for CUBLAS to demonstrate high-performance performance for matrix multiplication.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample implements matrix multiplication from Chapter 3 of the programming guide. To illustrate GPU performance for matrix
multiply, this sample also shows how to use the new CUDA 4.0 interface for CUBLAS to demonstrate high-performance performance
for matrix multiplication.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample implements matrix multiplication and uses the new CUDA 4.0 kernel launch Driver API. It has been written for clarity
of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant generic
kernel for matrix multiplication. CUBLAS provides high-performance matrix multiplication.

This CUDA Runtime API sample is a very basic sample that implements how to use the assert function in the device code. Requires
Compute Capability 2.0 .

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This CUDA Runtime API sample is a very basic sample that demonstrates Inter Process Communication with one process per GPU
for computation. Requires Compute Capability 2.0 or higher and a Linux Operating System

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

Supported in GPUs with Compute Capability 1.1, overlapping compute with one memcopy is possible from the host system. For
Quadro and Tesla GPUs with Compute Capability 2.0, a second overlapped copy operation in either direction at full speed is
possible (PCI-e is symmetric). This sample illustrates the usage of CUDA streams to achieve overlapping of kernel execution
with data copies to and from the device.

This sample demonstrates the basic usage of the CUDA occupancy calculator and occupancy-based launch configurator APIs by
launching a kernel with the launch configurator, and measures the utilization difference against a manually configured launch.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample demonstrates a CUDA 5.0 feature, the ability to create a GPU device static library and use it within another CUDA
kernel. This example demonstrates how to pass in a GPU device function (from the GPU device static library) as a function
pointer to be called. This sample requires devices with compute capability 2.0 or higher.

This sample uses CUDA streams to overlap kernel executions with memory copies between the host and a GPU device. This sample
uses a new CUDA 4.0 feature that supports pinning of generic host memory. Requires Compute Capability 2.0 or higher.

This sample is a templatized version of the template project. It also shows how to correctly templatize dynamically allocated
shared memory arrays.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

Simple program which demonstrates how to use the Vote (any, all) intrinsic instruction in a CUDA kernel with runtime compilation
using NVRTC APIs. Requires Compute Capability 2.0 or higher.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample demonstrates the use of OpenMP and streams with Unified Memory on a single GPU.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This CUDA Runtime API sample is a very basic sample that implements element by element vector addition. It is the same as
the sample illustrating Chapter 3 of the programming guide with some additions like error checking.

This CUDA Driver API sample uses NVRTC for runtime compilation of vector addition kernel. Vector addition kernel demonstrated
is the same as the sample illustrating Chapter 3 of the programming guide.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This Vector Addition sample is a basic sample that is implemented element by element. It is the same as the sample illustrating
Chapter 3 of the programming guide with some additions like error checking. This sample also uses the new CUDA 4.0 kernel
launch Driver API.

This is a simple test program to measure the memcopy bandwidth of the GPU and memcpy bandwidth across PCI-e. This test application
is capable of measuring device to device copy bandwidth, host to device copy bandwidth for pageable and page-locked memory,
and device to host copy bandwidth for pageable and page-locked memory.

This example demonstrates use of cudaSurfaceObject, cudaTextureObject, and MipMap support in CUDA. A GPU with Compute Capability
SM 3.0 is required to run the sample.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample uses CUDA to compute and display the Mandelbrot or Julia sets interactively. It also illustrates the use of "double
single" arithmetic to improve precision when zooming a long way into the pattern. This sample uses double precision. Thanks
to Mark Granger of NewTek who submitted this code sample.!

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample extracts a geometric isosurface from a volume dataset using the marching cubes algorithm. It uses the scan (prefix
sum) function from the Thrust library to perform stream compaction.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

Simple program which demonstrates interoperability between CUDA and Direct3D10. The program generates a vertex array with
CUDA and uses Direct3D10 to render the geometry. A Direct3D Capable device is required.

Simple program which demonstrates interop of rendertargets between Direct3D10 and CUDA. The program uses RenderTarget positions
with CUDA and generates a histogram with visualization. A Direct3D10 Capable device is required.

Simple program which demonstrates how to interoperate CUDA with Direct3D10 Texture. The program creates a number of D3D10
Textures (2D, 3D, and CubeMap) which are generated from CUDA kernels. Direct3D then renders the results on the screen. A
Direct3D10 Capable device is required.

Simple program which demonstrates Direct3D11 Texture interoperability with CUDA. The program creates a number of D3D11 Textures
(2D, 3D, and CubeMap) which are written to from CUDA kernels. Direct3D then renders the results on the screen. A Direct3D
Capable device is required.

Simple program which demonstrates interoperability between CUDA and Direct3D9. The program generates a vertex array with CUDA
and uses Direct3D9 to render the geometry. A Direct3D capable device is required.

Simple program which demonstrates Direct3D9 Texture interoperability with CUDA. The program creates a number of D3D9 Textures
(2D, 3D, and CubeMap) which are written to from CUDA kernels. Direct3D then renders the results on the screen. A Direct3D
capable device is required.

Simple program which demonstrates interoperability between CUDA and OpenGL. The program modifies vertex positions with CUDA
and uses OpenGL to render the geometry.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

Demonstrates data exchange between CUDA and OpenGL ES (aka Graphics interop). The program modifies vertex positions with CUDA
and uses OpenGL ES to render the geometry.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

Demonstrates data exchange between CUDA and OpenGL ES (aka Graphics interop). The program modifies vertex positions with CUDA
and uses OpenGL ES to render the geometry.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

Simple program which demonstrates SLI with Direct3D10 Texture interoperability with CUDA. The program creates a D3D10 Texture
which is written to from a CUDA kernel. Direct3D then renders the results on the screen. A Direct3D Capable device is required.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample demonstrates how to efficiently implement a Bicubic B-spline interpolation filter with CUDA texture.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

Bilateral filter is an edge-preserving non-linear smoothing filter that is implemented with CUDA with OpenGL rendering. It
can be used in image recovery and denoising. Each pixel is weight by considering both the spatial distance and color distance
between its neighbors. Reference:"C. Tomasi, R. Manduchi, Bilateral Filtering for Gray and Color Images, proceeding of the
ICCV, 1998, http://users.soe.ucsc.edu/~manduchi/Papers/ICCV98.pdf"

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample demonstrates how 2D convolutions with very large kernel sizes can be efficiently implemented using FFT transformations.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample demonstrates how to efficiently use the CUDA Video Decoder API to decode MPEG-2, VC-1, or H.264 sources. YUV
to RGB conversion of video is accomplished with CUDA kernel. The output result is rendered to a D3D9 surface. The decoded
video is not displayed on the screen, but with -displayvideo at the command line parameter, the video output can be seen.
Requires a Direct3D capable device and Compute Capability 2.0 or higher.

This sample demonstrates how to efficiently use the CUDA Video Decoder API to decode video sources based on MPEG-2, VC-1,
and H.264. YUV to RGB conversion of video is accomplished with CUDA kernel. The output result is rendered to a OpenGL surface.
The decoded video is black, but can be enabled with -displayvideo added to the command line. Requires Compute Capability
2.0 or higher.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample demonstrates how Discrete Cosine Transform (DCT) for blocks of 8 by 8 pixels can be performed using CUDA: a naive
implementation by definition and a more traditional approach used in many libraries. As opposed to implementing DCT in a fragment
shader, CUDA allows for an easier and more efficient implementation.

High Quality DXT Compression using CUDA. This example shows how to implement an existing computationally-intensive CPU compression
algorithm in parallel on the GPU, and obtain an order of magnitude performance improvement.

This sample demonstrates two adaptive image denoising techniques: KNN and NLM, based on computation of both geometric and
color distance between texels. While both techniques are implemented in the DirectX SDK using shaders, massively speeded up
variation of the latter technique, taking advantage of shared memory, is implemented in addition to DirectX counterparts.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample shows how to post-process an image rendered in OpenGL using CUDA.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample implements a Gaussian blur using Deriche's recursive method. The advantage of this method is that the execution
time is independent of the filter width.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample shows how to copy CUDA image back to OpenGL using the most efficient methods.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample evaluates fair call price for a given set of European options under binomial model. This sample makes use of NVRTC
for Runtime Compilation.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample evaluates fair call and put prices for a given set of European options by Black-Scholes formula, compiling the
CUDA kernels involved at runtime using NVRTC.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample evaluates fair call price for a given set of European options using the Monte Carlo approach, taking advantage
of all CUDA-capable GPUs installed in the system. This sample use double precision hardware if a GTX 200 class GPU is present.
The sample also takes advantage of CUDA 4.0 capability to supporting using a single CPU thread to control multiple GPUs

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

An example of fluid simulation using CUDA and CUFFT, with OpenGL rendering.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample demonstrates efficient all-pairs simulation of a gravitational n-body simulation in CUDA. This sample accompanies
the GPU Gems 3 chapter "Fast N-Body Simulation with CUDA". With CUDA 5.5, performance on Tesla K20c has increased to over
1.8TFLOP/s single precision. Double Performance has also improved on all Kepler and Fermi GPU architectures as well. Starting
in CUDA 4.0, the nBody sample has been updated to take advantage of new features to easily scale the n-body simulation across
multiple GPUs in a single PC. Adding "-numbodies=<bodies>" to the command line will allow users to set # of bodies for simulation.
Adding “-numdevices=<N>” to the command line option will cause the sample to use N devices (if available) for simulation.
In this mode, the position and velocity data for all bodies are read from system memory using “zero copy” rather than from
device memory. For a small number of devices (4 or fewer) and a large enough number of bodies, bandwidth is not a bottleneck
so we can achieve strong scaling across these devices.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample demonstrates efficient all-pairs simulation of a gravitational n-body simulation in CUDA. Unlike the OpenGL nbody
sample, there is no user interaction.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample simulates an Ocean height field using CUFFT Library and renders the result using OpenGL.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample uses CUDA to simulate and visualize a large set of particles and their physical interaction. Adding "-particles=<N>"
to the command line will allow users to set # of particles for simulation. This example implements a uniform grid data structure
using either atomic operations or a fast radix sort from the Thrust library

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

The sample models formation of V-shaped flocks by big birds, such as geese and cranes. The algorithms of such flocking are
borrowed from the paper "V-like formations in flocks of artificial birds" from Artificial Life, Vol. 14, No. 2, 2008. The
sample has CPU- and GPU-based implementations. Press 'g' to toggle between them. The GPU-based simulation works many times
faster than the CPU-based one. The printout in the console window reports the simulation time per step. Press 'r' to reset
the initial distribution of birds.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample demonstrates the use of CUDA streams for concurrent execution of several kernels on devices of compute capability
2.0 or higher. Devices of compute capability 1.x will run the kernels sequentially. It also illustrates how to introduce
dependencies between CUDA streams with the new cudaStreamWaitEvent function introduced in CUDA 3.2

The computation of all or a subset of all eigenvalues is an important problem in Linear Algebra, statistics, physics, and
many other fields. This sample demonstrates a parallel implementation of a bisection algorithm for the computation of all
eigenvalues of a tridiagonal symmetric matrix of arbitrary size with CUDA.

This sample illustrates how to use function pointers and implements the Sobel Edge Detection filter for 8-bit monochrome images.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample is an implementation of a simple line-of-sight algorithm: Given a height map and a ray originating at some observation
point, it computes all the points along the ray that are visible from the observation point. The implementation is based on
the Thrust library (http://code.google.com/p/thrust/).

This sample revisits matrix multiplication using the CUDA driver API. It demonstrates how to link to CUDA driver at runtime
and how to use JIT (just-in-time) compilation from PTX code. It has been written for clarity of exposition to illustrate various
CUDA programming principles, not with the goal of providing the most performant generic kernel for matrix multiplication.
CUBLAS provides high-performance matrix multiplication.

This sample implements a merge sort (also known as Batcher's sort), algorithms belonging to the class of sorting networks.
While generally subefficient on large sequences compared to algorithms with better asymptotic algorithmic complexity (i.e.
merge sort or radix sort), may be the algorithms of choice for sorting batches of short- to mid-sized (key, value) array pairs.
Refer to the excellent tutorial by H. W. Lang http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/indexen.htm

This sample uses the Driver API to just-in-time compile (JIT) a Kernel from PTX code. Additionally, this sample demonstrates
the seamless interoperability capability of the CUDA Runtime and CUDA Driver API calls. For CUDA 5.5, this sample shows how
to use cuLink* functions to link PTX assembly using the CUDA driver at runtime.

This sample demonstrates a very fast and efficient parallel radix sort uses Thrust library (http://code.google.com/p/thrust/).
The included RadixSort class can sort either key-value pairs (with float or unsigned integer keys) or keys only. The optimized
code in this sample (and also in reduction and scan) uses a technique known as warp-synchronous programming, which relies
on the fact that within a warp of threads running on a CUDA GPU, all threads execute instructions synchronously. The code
uses this to avoid __syncthreads() when threads within a warp are sharing data via __shared__ memory. It is important to note
that for this to work correctly without race conditions on all GPUs, the shared memory used in these warp-synchronous expressions
must be declared volatile. If it is not declared volatile, then in the absence of __syncthreads(), the compiler is free to
delay stores to __shared__ memory and keep the data in registers (an optimization technique), which will result in incorrect
execution. So please heed the use of volatile in these samples and use it in the same way in any code you derive from them.

This example demonstrates an efficient CUDA implementation of parallel prefix sum, also known as "scan". Given an array of
numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.

This sample demonstrates the use of CUDA streams for concurrent execution of several kernels on devices which provide HyperQ
(SM 3.5). Devices without HyperQ (SM 2.0 and SM 3.0) will run a maximum of two kernels concurrently.

This sample implements bitonic sort and odd-even merge sort (also known as Batcher's sort), algorithms belonging to the class
of sorting networks. While generally subefficient, for large sequences compared to algorithms with better asymptotic algorithmic
complexity (i.e. merge sort or radix sort), this may be the preferred algorithms of choice for sorting batches of short-sized
to mid-sized (key, value) array pairs. Refer to an excellent tutorial by H. W. Lang http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/indexen.htm

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample shows how to perform a reduction operation on an array of values using the thread Fence intrinsic to produce a
single value in a single kernel (as opposed to two or more kernel calls as shown in the "reduction" CUDA Sample). Single-pass
reduction requires global atomic instructions (Compute Capability 2.0 or later) and the _threadfence() intrinsic (CUDA 2.2
or later).

Simple program illustrating how to the CUDA Context Management API and uses the new CUDA 4.0 parameter passing and CUDA launch
API. CUDA contexts can be created separately and attached independently to different threads.

A CUDA Sample that demonstrates how using batched CUBLAS API calls to improve overall performance.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

A NPP CUDA Sample that demonstrates how to use NPP FilterBox function to perform a Box Filter.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample implements a conjugate gradient solver on GPU using CUBLAS and CUSPARSE library.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample implements a preconditioned conjugate gradient solver on GPU using CUBLAS and CUSPARSE library.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample implements a conjugate gradient solver on GPU using CUBLAS and CUSPARSE library, using Unified Memory

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

A simple CUDA Sample demonstrate how to use FreeImage library with NPP.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This CUDA Sample demonstrates how to use NPP for histogram equalization for image data.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample that demonstrates how to perform image segmentation using the NPP GraphCut function.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample demonstrates a simple image processing pipline. First, a JPEG file is huffman decoded and inverse DCT transformed
and dequantized. Then the different plances are resized. Finally, the resized image is quantized, forward DCT transformed
and huffman encoded.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample uses Monte Carlo to simulate Single Asian Options using the NVIDIA CURAND library.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample demonstrates the Mersenne Twister random number generator GP11213 in cuRAND.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample illustrates pseudo- and quasi- random numbers produced by CURAND.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

Example of using CUBLAS using the new CUBLAS API interface available in CUDA 4.0.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

Example of using CUFFT. In this example, CUFFT is used to compute the 1D-convolution of some signal with some filter by transforming
both into frequency domain, multiplying them together, and transforming the signal back to time domain.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

Example of using CUFFT. In this example, CUFFT is used to compute the 1D-convolution of some signal with some filter by transforming
both into frequency domain, multiplying them together, and transforming the signal back to time domain on Multiple GPU.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

Example of using CUFFT. In this example, CUFFT is used to compute the 1D-convolution of some signal with some filter by transforming
both into frequency domain, multiplying them together, and transforming the signal back to time domain. The difference between
this example and the Simple CUFFT example is that the multiplication step is done by the CUFFT kernel with a user-supplied
CUFFT callback routine, rather than by a separate kernel call.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

Example of using CUFFT. In this example, CUFFT is used to compute the 1D-convolution of some signal with some filter by transforming
both into frequency domain, multiplying them together, and transforming the signal back to time domain on Multiple GPU.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies
are not available on the system, the sample will not be installed. If these dependencies are available, but not installed,
the sample will waive itself at build time.

Some CUDA Samples rely on third-party applications and/or libraries, or features provided by the CUDA Toolkit and Driver,
to either build or execute. These dependencies are listed below.

If a sample has a dependency that is not available on the system, the sample will not be installed. If a sample has a third-party
dependency that is available on the system, but is not installed, the sample will waive itself at build time.

These third-party dependencies are required by some CUDA samples. If available, these dependencies are either installed on
your system automatically, or are installable via your system's package manager (Linux) or a third-party website.

FreeImage is an open source imaging library. FreeImage can usually be installed on Linux using your distribution's package
manager system. FreeImage can also be downloaded from the FreeImage website. FreeImage is also redistributed with the CUDA Samples.

MPI (Message Passing Interface) is an API for communicating data between distributed processes. A MPI compiler can be installed
using your Linux distribution's package manager system. It is also available on some online resources, such as Open MPI.

DirectX is a collection of APIs designed to allow development of multimedia applications on Microsoft platforms. For Microsoft
platforms, NVIDIA's CUDA Driver supports DirectX. Several CUDA Samples for Windows demonstrates CUDA-DirectX Interoperability,
for building such samples one needs to install Direct X SDK (June 2010 or newer) , this is required to be installed only on Windows 7 and Windows Server 2008, Other Windows OSes do not need to explicitly
install the DirectX SDK.

OpenMP is an API for multiprocessing programming. OpenMP can be installed using your Linux distribution's package manager
system. It usually comes preinstalled with GCC. It can also be found at the OpenMP website.