barREUcuda

The CUDA programming language offers new opportunities to a field regarded by many to be fairly difficult — parallel programming. Through a simple framework that comparable to Google’s recent (over hyped) map reduce frame work, we can identify certain components in a serial program and parallelize repetitive function calls.

The concept behind CUDA is fairly simple. You break your algorithm into small pieces and hand each thread a little piece of computation. And, since most algorithms have a while loop embedded in them, the code inside the while loop can be the code handed to each thread. If that loop is repetitive, i.e. there is little difference between them, then we can gain a performance boost, since the subroutines are called all in one clock cycle (or warp).

CUDA accomplishes this by extending (only a bit) C. They add declarations that define whether a function runs on the device (the GPU) or the host (the CPU). They also add new types of memory which need to be studied more carefully.

Take, for example, quicksort, which is defined by the following python/functional code.

Recursion in an algorithm is a big hint that the algorithm can be parallelized — even though CUDA’s function cannot recursed. Most algorithms can be easily parallelized. This is specifically the case for programs that simulate a large quantity of relatively independent objects. It so happens that most physics can be reduced to simulating many particles’ interactions with the environment, and many abstractions exist to simulate this interaction. Take 1D rule based cellular automata, for example. In a serial program on would keep track of the previous row and would write the following C function to compute the next row

The above code will easily achieve more than 20x speedup, but with some optimization, it can run much much faster. A similar program can be sketched for 2D cellular automata such as the Game of Life, and a 3D cellular automata could also be implemented (when I find an example of 3D cellular automata).

The CUDA SDK provides a variety of examples exposing the power of parallel processing. CUDA’s N-Body problem, for example, simulates tens of thousands of particles using the direct method which calculates the force between two particles by a slight variation of Newton’s 2-Body formula

Such algorithm is unfeasible on commodity CPU’s, and most physicist have used a computationally feasible, albeit undesired, algorithm for the n-body problem. CUDA changed the landscape, making supercomputing more “democratic.”

Other examples in the SDK include fluidsGL which simulates fluid dynamics based on the idea of stable fluids. It too achieves speed that is unattained on a CPU. On the downside, however, the program is in 2D, and relatively uninteresting for 3D visualization.

In this project we would like to bring the power of the CUDA chip to the CAVE/CUBE. There are a few reasons to do that:

Price — CUDA enabled chips are reasonably priced and are within the reach of a consumer. The lower end chips are under 100 dollars, while the higher end ones go for over 1,000 dollars.

Abundance — NVIDIA claims to have sold 40 million cards that are CUDA enabled.

Power — While price and abundance are important to guarantee that this is not just a hype, the performance of the CUDA chip makes the case for its importance. A CUDA enabled GPU can easily outperform a CPU for half the price.

The CUDA chip is not reserved just for visualization. Two recent papers 12 implemented the AES cryptography algorithm with considerable speedup. Other people used the power provided by CUDA for more sinister purpose 3. By porting the md5crack program to CUDA, the author was able to crack all md5 passwords less than 8 characters by brute force in under 16 minutes.

Take, for example, the University of Antwerp in Belgium which recently built a super computer with four NVIDIA GPUs for under 4000 dollars. The system, which is contained in one desktop workstation, can outperform a cluster of hundreds of machines.