Confessions of the World's Largest Switcher

It's a shame that Apple no longer runs the "Switch" campaign on
television. Dr. Srinidhi Varadarajan would make an excellent spokesperson for moving to the Mac. Just as Ellen Feiss' switcher story was the hit at Macworld Expo, so has been Dr. Varadarajan's presentation at the O'Reilly Mac OS X Conference, where he received a standing ovation.

His ad might go something like this. "I was in the market for a new
machine. I was hoping to get ten teraflops by the end of the year. I'd
never used a Mac and had been looking at Dells and IBMs. Then Apple
released the G5 on June 23. A week later I bought 1,100 duals online at
the Apple Store. I'm Srinidhi Varadarajan and I build Supercomputers at
Virginia Tech."

Goals in Building Virginia Tech's G5 Supercomputer

The timing was right to make a big move at Virginia Tech.
Varadarajan explained that they had a new dean and a new program in
Computational Sciences and Engineering (CSE). They also had experience
building a previous, smaller cluster. The goals were pretty straightforward: To build a world-class program, you need to
provide world-class resources. This included creating high-end
computing facilities and high-performance networking capabilities to
tie the computational facilities into national computational grids.

In addition, there were political goals to get communication and
cooperation across department lines. Most universities have subcultures
and pockets that don't speak to each other. Varadarajan asks how you
get people to talk across these different fiefdoms and explains that
another of the goals of the project was to get everyone on the same
team by providing support for both experimental and production
research. The cooperation was evident in the speed with which the
project was accepted and supported. Conference co-chair Derrick Story
asked how long the project took from start to finish. The answer was
surprising. The project was started in March and April of this year. Within
a month it had everyone's backing. Money was raised in April and May and
the cluster was launched in ramping-up mode in September.

In addition to these goals, there were also architectural ideals.
Varadarajan wanted a high-performance supercomputer based on a 64-bit
processor and never looked at 32 bit. In addition, he felt that
clusters imply gigabit Ethernet. You need high-performance
interconnect with high-bandwidth, ultra-low latency. He also wanted to
offer the cluster as a service, which meant he needed connections to Internet1, Internet2, and soon, into the NLR (National Lambda Rail). Lambda is a
proposed high-speed network used to support research institutions.

In addition, the project had usage goals--to provide easy
access for new investigators and exploratory research. The access
policy was open door. Varadarajan explained that he didn't want to
shut people out just because they don't have a grant. Often, you need
results in order to get a grant. He also wanted to support
multi-site research activities. Finally, for a premium, he wanted to
support on-demand access to computational cycles. For example, an external customer
may ask for so much power and in so much time. This required being able
to check-point and store the current state of the system so that
currently running applications wouldn't be lost.

Dude, You Need a Mac

A prime constraint in designing the Supercomputer was cost.
Academia has small budgets so the focus was on high-price performance.
Competing installations include DOE (Department of Energy)
installations, which can afford to pay top dollar. The Virginia
Tech computer wanted the same performance but for bottom dollar. The
cost was more than just the machines. The existing facilities would
need to be upgraded with cooling systems and power distribution. And they would have to account for the cost of the cables, memory, and back-up power. Varadarajan's team built
one of the cheapest world-class Supercomputers. He laughed that "The
fact that it's running is a big deal in itself."

He looked at various architecture options and was in the process of
buying Dells when the deal fell through. He also worked with IBM and
AMD and couldn't get the price to match. The budgets were coming in at
$9 to $12 million dollars. The IBM with a PowerPC 970 was a first choice
but the earliest delivery date would have been January 2004.
Varadarajan said that you can't design a Supercomputer and wait that
long for delivery. "You wouldn't buy a car and leave it at the dealer
for a year and a half. We wanted a short three-month-build cycle and
could not wait six months.

On June 23 Apple announced the G5. Varadarajan said that contrary
to rumors, it was the first that they had heard about it as well. On
June 26 they told Apple they were interested in placing a "fairly large
order". A day later he flew to California and met with Apple. One of
their first questions was how long he'd been a Mac owner. Varadarajan
said he never had one. Twenty-four hours later Apple committed.
Starting on September 5, the G5's arrived in Virginia. An audience
member asked if he'd made the purchase through the Apple store.
Varadarajan smiled and said that actually, yes, he had.

Performance and Power

Varadarajan said that a lot of people get the math wrong when
calculating the performance of the machines. Each G5 processor has two, double-precision, floating-point units. Each is capable of a fused, multiple-add operation per cycle, so you get 2 flops per cycle. This
means that 2GHz corresponds to 8 GFlops, so each dual G5 can deliver
a peak of 16 GFlops of double-precision performance. That is more than
a modern Cray node.

The primary communications architecture is built on InfiniBand's
card, which has two ports on each node connecting into the network at 20 Gbps full-duplex bandwidth. Each node has a connection open to each
other node and there is the potential to hold 150K connections per node. This translates into very low latency--less than 10 microseconds.

The computers and cables are just one piece of the infrastructure.
Varadarajan also needed a large enough building to house the cluster, with a
raised floor, environmental controls, fire suppression, and round-the-clock controlled access. In addition, the power needs include 1.5 MW of power
coming in from two substations with back-up UPS and finally, a back-up
diesel generator.

If you've ever sat with a TiBook in your lap, you understand that
there is a further significant issue. As hot as a G4 runs, a G5 runs
hotter. With a traditional air-conditioning setup, the calculations
showed that instead of emptying out the air three times an hour as
would be typical, they would need to empty the air three times per minute.
Computers tend to each cool front to back. So the plan was to arrange
the computers in rows back to back and pull the hot air out of the hot
aisle. This would have required wind velocity under the floor of more than 60 miles per hour and still would have resulted in some hot spots. They
decided instead to use a refrigerator-like system. Chillers cool
water to 40 degrees to 50 degrees, which is then used to chill refrigerant, which is
piped into a matrix of copper pipes. Effectively, you have a distributed
refrigerator.

Tuning

The computers ran with few customizations. The volunteers started
the computers, connected the InfiniBand card, restarted the computers, and
cabled them up. The machines are currently running stock Mac OS X
10.2.7. An audience member asked if they use Software Update.
Varadarajan said no but that there are plans to Pantherize the system
in the next few weeks. This will require an install and a recompile of
some of the code. Custom code included InfiniBand drivers and some
parallel communication libraries known as MV APICH developed in Dr.
Dhabaleswar Panda's lab at Ohio State University. This library had
to be ported from Linux to Mac OS X. The PCI-X timing was changed to
increase InfiniBand performance to 870MBps.
Also, message caching and dynamic memory management were added for
improved scientific application performance.

The LINPACK benchmark solves a very large system of linear
equations, involving dense matrix operations. The main phase is LU decomposition
(Gaussian elimination with partial row pivoting). The G5 cluster solved
a system of equations at N=500K. The team realized that the only way to
improve the benchmark score is to improve the numerical libraries. This
boils down to the BLAS libraries. The core routine--matrix multiply
(GEMM)--was optimized by Kazushige Goto. The current impressive
benchmark results are due to a mix of Goto's libraries and Apple's
veclib framework.

Varadarajan reported that "our latest numbers are 9.555 tera and we
still have more tricks left. We are hoping for another 10 percent boost to
become the first academic machine to cross 10 tera. The last ratings
put us at number three worldwide." During the question-and-answer period at
the end, an audience member from the Lawrence Livermore National Laboratory
introduced himself as coming from the institution that had the
Supercomputer that the Virginia Tech cluster had just passed. He asked
whether the details of the Supercomputer would be published. The reply
was that in addition to documentation and papers, the plans are to
return the changes to MVAPICH to the open source project so that it
would be freely available. There are also plans to open source the caching
code and Varadarajan expects that Mellanox's code will be
available.

Varadarajan said that they are getting requests for clones. "Expect
to see a lot more G5 clusters."

Daniel H. Steinberg
is the editor for the new series of Mac Developer titles for the Pragmatic Programmers. He writes feature articles for Apple's ADC web site and is a regular contributor to Mac Devcenter. He has presented at Apple's Worldwide Developer Conference, MacWorld, MacHack and other Mac developer conferences.