The computing power of all cudaki now exceeds 10 TFLOPS.
They use up to 4 GTX480 or 580 GPUs per box, with 512 cuda cores ea.

Tuesday, July 10, 2012

Welcome to the GPU supercomputing blog

You will find here the description of several experimental GPU installations with a total computing power of >10 TFLOPS. They're a part of Pawel Artymowicz's computer lab at UofT.

Please also see the description of a summer project in 2010, where a team of f UofT students has constructed and programmed GPU mini-supercomputers using Nvidia cards and CUDA language.

At present, 5 cudak machines are exist and all but those in brackets are currently online & operational. Their motherboards are ASUS P6T6/P6T7Workstation/Supercomputer, CPUs are Intel i7 920...960.
They're housed in Thermaltake high tower cases and powered bvy Thermaltake and Silverstone 1.2-1.5 kW power supplies.
They run the following cards and achieve a rough FLOP count as follows, assuming a realistic tested 1 TF/card (rather than the advertised higher Nvidia numbers)

There is one 24-port Infiniband 20 Gbps interconnect, currently not in use,
as we still learn the capabilities of communication via PCIe bus.
We use the CUDA version 4.2 and Fedora and other linux O/S.

It looks like we have ~12 TFLOPS available for computation.
Of course Jeffrey Fung and I are not computing 24/7, most of the time is spent in code development. Jeffrey mastered the art of C CUDA and recently ported a 3-D high-res hydrocode of the PPMLR ilk to our machines.
Previously all our CFD codes were 2-D. It runs as fast as about 40 cpu kernels on cudak5/seti, at St George campus of UofT. We also dabble in CUDA Fortran (PGI).

We're waiting anxiously for the new Tesla-like Kepler GPUs meant for computations that should have improved bandwidths (nominally 320 GB/s), as we're usually bandwidth-limited. Unfortunately, the consumer Kepler (GK 104) GPUs are not much different from what we already have. We're waiting for GK 110s (to be shown in 4 Q 2012). They'll have 2500+ cuda cores, but only 384-lane interface and too small a bandwidth to satisfy us.
Oh well.. the small N-Body programs, for instance the statistical simulations of a lot of planetary systems at once, will be running at a high speed. CFD will see only a very moderate improvement.