In the previous CUDACasts episode, we saw how to flash your Jetson TK1 to the latest release of Linux4Tegra, and install both the CUDA toolkit and OpenCV SDK. We’ll continue exploring the power efficiency the Jetson TK1 Kepler-based GPU brings to computer vision by porting a simple OpenCV sample to run on the GPU. We’ll explore computer vision further in a future CUDACast when we look at the VisionWorks toolkit from NVIDIA.

The Jetson TK1 development kit has fast become a must-have for mobile and embedded parallel computing due the amazing level of performance packed into such a low-power board. In this and the following CUDACast, you’ll learn how to get started building computer vision applications on your Jetson TK1 using CUDA and the OpenCV library.

One of the main reasons for accelerating code on an NVIDIA GPU is for an increase in application performance. This is why it’s important to use the best tools available to help you get the performance you’re looking for. CUDA 6 includes great improvements to the guided analysis tool in the NVIDIA Visual Profiler. Watch today’s CUDACast to see how to use guided analysis to locate potential optimizations for your GPU code.

CUDA 6 introduces Unified Memory, which dramatically simplifies memory management for GPU computing. Now you can focus on writing parallel kernels when porting code to the GPU, and memory management becomes an optimization.

The CUDA 6 Release Candidate is now publicly available. In today’s CUDACast, I will show you some simple examples showing how easy it is to accelerate code on the GPU using Unified Memory in CUDA 6, and how powerful Unified Memory is for sharing C++ data structures between host and device code. If you’re interested in looking at the code in detail, you can find it in the Parallel Forall repository on GitHub. You can also check out the great Unified Memory post by Mark Harris.

The OpenACC 2.0 specification focuses on increasing programmer productivity by addressing limitations of OpenACC 1.0. Previously, programmers were required to use structured code blocks to control when to transfer data to or from the device, which limited the applications that could quickly be accelerated without major code restructuring. It also prevented adding OpenACC directives to handle data movement in the constructors and destructors of C++ classes.

OpenACC 2.0 provides unstructured data lifetime pragmas to make it easier to instruct the compiler to transfer data most efficiently. In today’s CUDACast, I will cover three unstructured data lifetime methods within a single piece of code. Because the example code is fairly long, I’ve uploaded the source to GitHub for you to look at.

Continuing the Thrust mini-series (see Part 1), today’s episode of CUDACasts focuses on a few of the algorithms that make Thrust a flexible and powerful parallel programming library. You’ll also learn how to use functors, or C++ “function objects”, to customize how Thrust algorithms process data.

In the next CUDACast in this Thrust mini-series, we’ll take a look at how fancy iterators increase the flexibility Thrust has for expressing parallel algorithms in C++.

Whenever I hear about a developer interested in accelerating his or her C++ application on a GPU, I make sure to tell them about Thrust. Thrust is a parallel algorithms library loosely based on the C++ Standard Template Library. Thrust provides a number of building blocks, such as sort, scans, transforms, and reductions, to enable developers to quickly embrace the power of parallel computing. In addition to targeting the massive parallelism of NVIDIA GPUs, Thrust supports multiple system back-ends such as OpenMP and Intel’s Threading Building Blocks. This means that it’s possible to compile your code for different parallel processors with a simple flick of a compiler switch.

For this first in a mini-series of screencasts about Thrust, we’ll write a simple sorting program and execute it on both a GPU and a multi-core CPU. In upcoming episodes, we’ll explore more capabilities of Thrust which really show its flexibility and power. For more examples of using Thrust, read the post Expressive Algorithmic Programming with Thrust, and check out the Thrust Quick Start Guide.

The key to the power of GPUs is their 1000’s of parallel processors that execute threads. Anyone who has worked with even a handful of threads know how easy it can be to introduce race conditions, and how difficult it can be to debug and fix these errors. Because a modern GPU can have thousands of simultaneously executing threads, NVIDIA engineers felt it was imperative to create an incredibly powerful tool for detecting and debugging race conditions.

This racecheck tool comes as part of the cuda-memcheck command-line utility. In CUDA 5.5 a new racecheck analysis mode presents much more human-readable analysis of your code, even reporting which source lines conflict with other lines. In this episode of CUDACasts we use a simple version of Conway’s Game of Life to show the new racecheck features cuda-memcheck. We’ll start with a few race condition bugs, and then use the analysis tool to find and fix them.

In the world of high-performance computing, it is important to understand how your code affects the operating characteristics of your HW. For example, if your program executes inefficient code, it may cause the GPU to work harder than it needs to, leading to higher power consumption, and a potential slow-down due to throttling.

A new profiling feature in CUDA 5.5 allows you to profile the clocks, power, and thermal characteristics of the GPU as it executes your code. This feature is available in the NVIDIA Visual Profiler on Linux and 64-bit Windows 7/8 and NSight Eclipse Edition on Linux. Learn how to activate and use this feature by watching CUDACasts Episode 13.

So far in the CUDA Python mini-series on CUDACasts, I introduced you to using the @vectorize decorator and CUDA libraries, two different methods for accelerating code using NVIDIA GPUs. In today’s CUDACast, I’ll be demonstrating how to use the NumbaPro compiler from Continuum Analytics to write CUDA Python code which runs on the GPU.

In CUDACast #12, we’ll continue using the Monte Carlo options pricing example, and I’ll show how to write the step function in CUDA Python rather than using the @vectorize decorator. In addition, by using the nvprof command-line profiler, we’ll be able to see the speed-up we’re able to achieve by writing the code explicitly in CUDA.