Abstract
We will describe our project in 250 words or less:
We used a GPU to efficiently optimize the parameters to various
technical indicators often used on chart analysis for currency
trading. Issues encountered include: the limited amount of shared
memory; how to precompute some values using parallelism; how to
distribute tasks over the multiprocessors; and figuring out where the
GPU is needed and where it isn't.

Abstract
Automated graph drawing remains a difficult placement and layout problem. This problem is difficult in part due to the complexity of formulating good algorithms to draw graphs which are aesthetically pleasing for human visualization. In this project, we implemented a force-directed placement algorithm using CUDA that will solve the graph layout problem by using an energy minimization technique. This work aims to demonstrate the performance advantage of a GPU implementation as compared to a CPU implementation. Taking advantage of the inherent parallel computational power in the GPU and with some additional memory bandwidth optimizations for our algorithm, the GPU implementation was able to achieve up to a 58x speed-up as compared to the CPU.

Abstract
This project focuses on how GPUs can be used to take advantage of the inherent parallelism in neural networks to provide a better implementation in terms of performance. We focused on the Restricted Boltzmann machine, a popular type of neural network. The implemented algorithm resulted in a computational speed of 672 million connections-per-second and a speed-up of 66-fold over an optimized C++ program running on a 2.83GHz Intel processor.

Abstract
Speech recognition engines can be used in many different applications. An example of such an application is the execution of uttered commands by aircraft pilots to alleviate their workload. All the practical applications rely on the achievement of very high recognition accuracy. Two of the prominent methods include dynamic time warping and hidden Markov model. This paper examines an implementation of the dynamic time warping algorithm with mel scale frequency cepstral coefficients on a graphics processing unit (GPU) using CUDA. By offloading some of the work from the CPU to the GPU, it is shown that this approach is feasible and yields a speed increase of a factor of 5.8.

Abstract
Feature detection and matching is one of the fundamental problems in low-level computer vision. Given a series of images of the same object, feature detection and matching algorithms try to (1) repeatably detect the same point of interest in every image, regardless of the scale and orientation of the object, (2) match each point of interest from one image with the same point in another image. Feature detection and matching forms the basis for many higher-level algorithms such as object recognition, robot navigation, panorama stitching, and three-dimensional scene modeling. The Scale Invariant Feature Transform (SIFT) has become the standard algorithm for this task. However, while its performance is impressive on difficult matching tasks, it is too computationally expensive to use in online or real-time applications. In response to this, Bay et al. created the Speeded-Up Robust Feature (SURF), which uses an integral image (essentially a 2D prefix sum) to approximate many of the calculations in the SIFT algorithm. The SURF algorithm is much faster than SIFT but it is still not fast enough to use in real-time applications. Our interest is the use of feature detection an matching to enable autonomous robot navigation. To this end, we have implemented a pure CUDA version of the SURF algorithm. Our implementation exhibits a 3 to 40 times speedup (depending on the compute capability of the hardware) and we have verified its correctness through rigorous comparison with a parallel CPU implementation.

Abstract
I implemented an image processing algorithm that takes an image and partitions it into segments called superpixels. It was done using CUDA on an NVIDIA GTX280 and runs at 2-3 frames per second on a 640×480 image, with room for future performance enhancements.

Abstract
Peer-to-peer content distribution systems have become enormously popular on today's Internet. In this project, we explore the feasibility of using GPU to expdite parallel P2P simulation performed in rounds. We consider the problem of disseminating k data blocks from a single source to all the nodes in a network by letting each peer transmit a block to one its neighbors in each round. Two commonly used disseminating algorithms are considered: 1) Random Useful Block (RUB), where each peer chooses a random neighbor to transmit a random useful block in each round, and 2) Global Rarest First (GRF), where each peer tries to transmit the rarest block to a random neighbor. To optimize the performance, we devised 2 algorithms for GPU-based RUB simulation either with or without using shared memory, and 3 algorithms for GRF performing parallelization either at the peer level or the data block level. Using sequential CPU code as a benchmark, we can obtain up to 8x speedup for RUB simulation with a small error, and up to 25x speedup for GRF.

Abstract
I implemented a convolutional neural network with one layer of
convolution. I tested it on the CIFAR-10 dataset, which consists of
6000 32×32 colour images in each of 10 classes. The convolutional net
does well on the classification task and takes roughly 140x less time
to train than a CPU implementation.

Quantum computers promise immense improvements in computational power over existing classical computers. Due to quantum superposition of states, a quantum computer allows for the simultaneous manipulation of all possible combinations of a set of bits in a single operation, giving some algorithms an exponential speed-up as compared to their classical versions. Currently, the physical realization of a practical quantum computer has yet to be achieved. Until then, simulations are used to validate quantum algorithms.

Classical simulations of quantum algorithms are time-consuming because of the need to simulate an exponential number of states in parallel. This inherent parallelism makes the problem suitable for implementation on a graphics processor. In this project, we have selected the Quantum Fourier Transform, which lies at the heart of many other well-known quantum algorithms, for implementation with CUDA. We have achieved a 160-fold speedup over “libquantum”, an open-source CPU quantum computer simulation toolkit.

Abstract
Transient stability analysis examines the dynamic behavior of a power system for as much as several seconds following a power disturbance.The computational complexity of transient stability analysis has limited the ability to run this type of analysis to support decision making at the time of a disturbance and prevent cascading failures. This project investigates the feasibility of performing transient stability analysis on graphics multiprocessors, an emerging general purpose parallel platform.
The transient stability analysis involves solution of extremely large systems of differential and algebraic equations. Various integration techniques are available to transform the differential equations to a set of nonlinear equations. Typically, these equations are solved with iterative methods such as the Newton-Raphson algorithm, which in turn requires the solution of large sparse linear systems. It has been shown in prior research that direct methods (e.g., LU decomposition) perform poorly on parallel platforms, particularly on massively parallel processors akin to the GPU; therefore, iterative Conjugate Gradient method is implemented for the linear solver. The fact that iterative methods involve just matrix addition and multiplication, which are highly parallelizable, suggest significant speed-up potential.
In order to improve the convergence rate of the conjugate gradient method, Chebychev Polynomial Preconditionner is implemented on GPU. The preconditionner effects the condition number of jacobian matrix and reduce the total number of CG iterations. Number of iterations before and after preconditioning, execution time and speed up for CG and preconditionner process is reported.

Abstract
Analytical Placement (AP) is a CAD algorithm for ASIC placement. This algorithm generates a system of linear equations, which we solve using Gaussian elimination. We demonstrate how the specific sub-problem of Gaussian elimination on AP input matrices has the unique property where the pivot row is always in place, and show how this greatly enhances our ability to parallelize the algorithm. We demonstrate why a sparse-matrix implementation of the elimination algorithm would be difficult if not impossible to implement on a GPU. We then parallelize a full-matrix implementation of the placer on a GPU using CUDA, and show good speedups versus an optimized CPU implementation.

Abstract
As part of the greater effort to find a cure for cancer, the
Ontario Cancer Institute (OCI) is working to improve the throughput of
protein crystallography, which is used to study protein structure. Since
it's difficult to predict under which conditions crystals will form,
many automated experiments with hundreds or thousands of different
solutions are photographed over the course of days or weeks. OCI has
written software to extract image features that can be used to identify
successful experiments, but the task is computationally intensive.
Processing the current image backlog will require over 100,000 years of
CPU time, so even running on many thousands of cores, the CPU-based
software won't finish until 2015. In our project, we have implemented a
CUDA-enabled version of the image analysis software that runs 22 times
faster, which we hope will help cancer researchers determine the
structure of cancer-related proteins faster.

Abstract
The goal of this project was to explore the potential performance improvements that could be gained through the use GPU processing techniques within the CUDA architecture for JPEG compression algorithm. The choice of compression algorithms as the focus was based on examples of data level parallelism found within the algorithms and a desire to explore the effectiveness of cooperative algorithm management between the system CPU and an available GPU. In our project we ported JPEG Compression algorithm in CUDA and achieved xtimes performance.

Abstract
Graphics processors (GPU) are interesting for non-graphics parallel computation because of the potential for more than an order of magnitude of speedup over CPUs. Because the GPU is often presented as a C-like abstraction like Nvidia’s CUDA, little is known about the hardware architecture of the GPU beyond the high-level descriptions documented by the manufacturer.

We develop a suite of micro-benchmarks to measure the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU. We measure properties of the arithmetic pipelines, the stack-based handling of branch divergence, and the warp-granularity operation of the barrier synchronization instruction. We confirm that global memory is uncached with ~441 clock cycles of latency, and measure parameters of the three levels of instruction and constant caches and three levels of TLBs.

We succeed in revealing more detail about the GT200 architecture than previously disclosed.