Guest Lectures : Case studies

Graphics Processing Units (GPUs) have been growing in
popularity due to their impressive processing capabilities, and
equipped with general purpose programming languages such as NVIDIA's
CUDA, are becoming the platform of choice in the scientific computing community.

This talk will focus on two topics: how to utilize GPUs to accelerate critical medical image reconstruction algorithms, and how best to
utilize multiple GPUs to accelerate a range of applications. Previous studies that used GPUs focused on obtaining significant performance
gains from execution on a single GPU. These studies employed
low-level, architecture-specific, tuning in order to achieve sizeable benefits over multicore CPU execution.

In this talk, we consider the benefits of running on multiple
(parallel) GPUs to provide further orders of performance speedup. Our
approach attempts to reduce or eliminate the need to apply low-level fine
tuning to extract performance from a GPU. Our methodology allows developers to accurately predict execution time for GPU applications
while varying the number and configuration of the GPUs, and the size
of the input data set.

We believe this is a natural next step in GPU computing because it
allows researchers to determine the most appropriate GPU configuration
for an application without having to purchase hardware, or write
customized code for a mulitple-GPU implementation.

CUDA Tricks and High-Performance Computational Physics

In this
talk I will discuss advanced tricks to maximize CUDA performance,
taking examples from my physics research. The topics to be covered
include: how to maximize device bandwidth, the (unofficial) CUDA
disassembler decuda, and a discussion of the optimizations performed by
the CUDA compiler. I will also mention various gotcha's, pitfalls,
tips, ands tricks that I have encountered. An interactive discussion
format is encouraged.

Out-of-Core Programming with NVIDIA's CUDA

The word "core" in this title has a double meaning.
The older term core refers to an ancient implementation of RAM. The
newer term core refers to a CPU or GPU core. For example, each NVIDIA
SM (streaming multiprocessor) currently has eight cores. The amount of
on-chip memory, or cache, on an SM is some small number of kilobytes.
We will abuse the term out-of-core to refer to data that lies off-chip (outside the SM).

The
key to efficiency in many CUDA algorithms is to efficiently move data
between on-chip cache (for in-core programming), and off-chip global
memory on the video baord (for out-of-core programming). As the dual
use of the term core implies, CUDA programming is not the first example
in which skill in out-of-core programming has been important. This talk
will clarify the issue by abstracting the issue of out-of-core
programming. It will then discuss some principles that we have found
useful in our own lab, and their application both to CUDA programmingand to disk-based programming.

Radar Pulse Compression using Modern NVIDIA GPUs

Over the past several years, graphics processing units (GPUs) have
gained interest as general purpose highly parallel coprocessors. Early
adopters were forced to use traditional 3D graphics application
programming interfaces (APIs) in order to access the computational
power of the GPU. This process of recasting general purpose problems
into graphical terms can be time consuming and create obscure code.
The introduction of NVidia's Compute Unified Device Architecture
(CUDA) Framework, a C-language development environment for NVidia GPUs,
is designed to ease the burden placed on the general purpose GPU
programmer. In parallel with the CUDA release, NVidia also released
implementations of the BLAS and FFT libraries for the GPU under the
names CUBLAS and CUFFT, respectively.

Previous research has shown the vast computational power of GPUs for
signal processing. Modern radar signal processing is a data parallel
operation that benefits from parallel processing architectures. This
investigation will focus on the real-world benefit of GPUs for radar
pulse compression. First, the performance of 1D and 2D FFTs on a GPU
via CUFFT will be compared to a modern day multi-core CPU
implementation using FFTW. Subsequently, these performance results
will inform the implementation of two surrogate radar pulse compression
chains, having differing processing complexity, which will also in turn
be benchmarked similar to the FFTs.

The Present and Future of GPU Computing

My presentation will be on alternate GPU topics, focusing on the various hardware differences and future feature sets.

In this lecture I hope to give an overview of contemporary GPGPU topics
including a survey of architecture-agnostic languages (OpenCL, BGSP,
etc), a discussion of hand-coded algorithms (Folding@Home, for
instance), GPU overclocking hacks, vendor hardware options and
roadmaps, and publicly accessible multi-GPU systems to test out your
massively parallel codes.

High-Productivity Supercomputing: Metaprogramming GPUs

Tuning high-performance computational kernels relies on detailed
machine knowledge, is error-prone and often tedious. It is thus an
attractive target for automation. This is "metaprogramming": Programs write and tune other programs.

After a brief introduction to the ideas behind modern,
high-productivity scripting languages, I will discuss PyCuda, a toolkit
for making CUDA-based GPUs accessible from Python, one such language.
PyCuda allows the easy creation of high-performance script+GPU hybrid
computational codes. In addition, PyCuda provides a vehicle for metaprogramming of GPUs.

In the final part of the talk, I will outline how we used these tools
to implement a self-tuning GPU-based Discontinuous Galerkin solver. On
real-world 3D electromagnetic scattering problems, a single GPU with
this solver achieves speedups between 40 and 60 over a
current-generation CPU.

GPUs for Computer Vision: Overview, Examples & Opportunities

Abstract:Flexibly programmable graphics processors usher in a new era for
computer vision and image processing. Many vision algorithms map
extremely well to the GPUs massively parallel architecture,
perhaps even as well as graphics algorithms themselves. Techniques
previously limited to off-line experimentation or expensive
supercomputers can now be deployed for real-time use in consumer
machines. This talk will introduce computer vision on the GPU,
and highlight some features especially well suited towards vision
tasks. We'll show examples of optical flow and stereo vision, two
computationally intensive algorithms enabled for real-time use on the
GPU. Finally we'll conclude with a discussion of interesting
future directions and exciting opportunities for research and products.

CUDA Optimization, an Image Processing Case Study

Graphics processors can be easily programmed to provide significant
acceleration in many common parallel tasks. However, with additional
architecture knowledge and understanding
of optimization strategies, a savvy programmer can unleash the full
potential of the GPU's massive memory bandwidth and ensure the
processing resources are utilized to their fullest extent. In this
talk, we'll explore several different approaches to a very
simple but ubiquitous image processing algorithm, the convolution. A
naive approach shows the detrimental impact of poorly written code, a
simple approach achieves decent results with little effort or code
complexity, and a few highly optimized techniques
realize the GPUs full power for the most demanding tasks. The
techniques explored in this simple but illustrative example will serve
as a base for understanding the optimization strategies to apply
towards more complex algorithms.

The study of biological vision and the
creation of artificial vision systems are naturally intertwined –
exploration of the neuronal substrates of visual processing provides
clues and inspiration for artificial systems, and artificial systems,
in turn, serve as important generators of new ideas and working
hypotheses. However, while systems neuroscience has provided
inspiration for some of the "broad-stroke" properties of the visual
system, much is still unknown. Even for those qualitative properties
that most biologically-inspired models share, experimental data
currently provide little constraint on their key parameters.
Consequently, it is difficult to truly evaluate a set of computational
ideas, since the performance of a model depends strongly on its
particular instantiation – the size of the pooling kernels, the number
of units per layer, exponents in normalization operations, etc.

To
pave a way forward, we have developed a high-throughput approach to
more expansively explore the possible range of biologically-inspired
models, including models of larger, more realistic scale, leveraging
recent advances in commodity stream processing hardware - particularly,
high-end NVIDIA GPUs. In analogy to high-throughput screening
approaches in molecular biology and genetics, we generated and trained
thousands of potential network architectures and parameter
instantiations, and "screened" the visual representations produced by
these models using an object recognition task. From these candidate
models, the most promising were selected for further analysis. We
have shown that this approach can yield significant, reproducible gains
in performance in a basic object recognition tasks, and that it can
offer insight into which computational ideas are most important for
achieving this performance.

In this talk, I'll also highlight how the
application of flexible programming tools, such as high-level scripting
and template metaprogramming, can enable large performance gains, while
managing complexity for the developer. As the scale of available
computational power continues to expand, our approach holds great
potential both for accelerating progress in artificial vision, and for
generating new, experimentally-testable hypotheses for the study of
biological vision.