The TOP500 list for June 2010
has just been published. A second supercomputer built on GPUs appeared in
the top 10, Nebulae, the first one being
Tianhe-1.
Designed by the supercomputer manufacturer Dawning Information Industry
and installed at the
National Supercomputing Center in
Shenzhen (NSCS),
Nebulae features Intel Xeon X5650 2.66 GHz 6-core "Westmere"
processors and Nvidia Tesla C2050 448-ALU 1150 MHz "Fermi"
graphics processing units. TOP500 describes it as having a total of
"120640 cores", a vague figure that I will explain later.
These are the only pieces of information available right now.
There are stories and anecdotes that sometimes leak to the public and
indicate that multiple large companies around the world are experimenting with
GPU-based supercomputers that are not in the TOP500 list, but the fact the
2 fastest public ones, Nebulae and Tianhe-1, are designed and operated by
chinese companies and universities shows that China is becoming a
leader in the domain of GPU supercomputing.

Nebulae

The information page for
Nebulae gives no information about the interesting part, how many C2050
GPUs Nebulae might have.
The only technical detail available is the X5650 perfomance, listed
as 10.64 GFLOPS (double precision), which by the way
is misleading and slightly inaccurate because it is the approximate
performance of 1 core only out of the 6: 4 double precision floating point
operations per clock (with a combined multiply and add 128-bit SSE instruction)
* 2.666 GHz = 10.664 GFLOPS.
As to the C2050, its double precision floating point performance is:
2 (one multiply and add instruction) /
2 (a double precision instruction can only be
executed every other cycle) * 448 ALUs (or shaders) * 1150 MHz = 515.2 GFLOPS.
The X5650 is a dual-socket processor, so it is logical to assume that the
supercomputer is built on dual-socket nodes and has a certain number of
C2050 GPUs per node. Based on all this information, it is almost certain that
Nebulae is built on 4640 nodes, where each node has
two X5650 processors and one C2050 GPU, for a total of 9280 processors
and 4640 GPUs:

This 2984.30 TFLOPS number matches exactly the Rpeak value published in
the TOP500 list —note that the more correct number is 2984.45 TFLOPS
because the unrounded X5650 performance is 10.6666... GFLOPS per core,
but who cares about a discrepancy of 0.15 TFLOPS? This also concords
with the rumored "4700 nodes" according to this
article from The Register.
[2010-05-31 update: the EETimes confirms the figure of
4640 GPUs, which also supports the rest of my numbers.]
Finally, this also explains the
"120640 cores" figure which combines processors and GPUs.
The NSCS defined one SIMD unit (in Nvidia's terminology: streaming
multiprocessor) as one core. The C2050 GPU has 14 SIMD units, and the
two X5650 processors provide 12 cores:

4640 nodes * (12 processor cores + 14 SIMD units) = 120640 cores

So there you go, all the numbers published by TOP500 are now explained,
despite a lack of public detailed specs.
From a micro-architectural point of view, it makes sense to count a SIMD
unit as a core, because each of them contain 32 ALUs (shaders) executing
the same instructions, just from different thread contexts.
However from a computing point of view, a SIMD unit provides more
theoretical computing power than a traditional processor core: 36.80 GFLOPS
(C2050) compared to 10.664 GFLOPS (X5650).

Given that Nvidia was 2 or 3 quarters late in delivering Fermi, there is no
doubt that the delay had a direct impact on this supercomputer. The NSCS
probably had to wait months before
taking delivery of their 4.6 thousand C2050 GPUs. I would wager that
the fraction of Nvidia's initial C2050 production allocated to this supercomputer
alone was pretty large, given that they were rumors that up until last month,
Nvidia was only able to manufacture "thousands" of Fermi GPUs due to low yields.
They are also in
dire need of communicating good news about Fermi —which has been bashed
by the press lately— and this supercomputer is a opportunity for them to
do so. Expect soon a big press release about how Nebulae represents a success
of Fermi in the GPGPU world.

Tianhe-1

By contrast, the previously fastest GPU-based supercomputer
operated by the National SuperComputer Center in Tianjin (NSCC-TJ),
Tianhe-1, is built on 3072 nodes, where each node has
either two Xeon E5450 or two E5540 processors and some are equipped
with an AMD HD 4870 X2 GPU, for a total of 6144 processors and
5120 GPUs (2560 dual-GPU cards):

2560 compute nodes provide an average 10.507 GFLOPS
per processor core (2048 of them are based on the E5540
—10.133 GFLOPS— and 512 of them based on the E5450
—12 GFLOPS—), plus 368 GFLOPS with a downclocked
HD 4870 X2:
2560 compute nodes * (10.507 GFLOPS per core * 4 cores * 2 sockets +
368 GFLOPS per GPU) = 1157.26 TFLOPS

At the nominal clock of 750 MHz, an HD 4870 X2 provides 2400 single precision
GFLOPS, because a single VLIW unit (in AMD's terminology: thread processor,
contains 5 ALUs) can execute 5 single precision instructions per clock.
But such a VLIW unit can only execute 1 double precision instruction per clock,
so an HD 4870 X2 provides only 480 double precision GFLOPS, and downclocking
it from 750 to 575 MHz brings this down to 368 GFLOPS.
When I first tried to break down Tianhe-1's
FLOPS numbers, I could not come up with anything that made sense, unless the
GPUs were downclocked to a nice round 575 MHz number... which I then confirmed
after finding this TOP500
article.

As for the number of cores, for some reason NSCC-TJ is not consistent with
how Rpeak is calculated and does not include the number of cores of the 512
operation nodes. Also, similarly to Nebulae, one SIMD unit (in AMD's
terminology: SIMD core or engine) is counted as one core. The HD 4870 X2 has 20
of them, and the two Xeon processors provide 8 cores, therefore:

The NSCC-TJ should be consistent IMHO and should include the operation nodes's
4096 cores (512 nodes * 8 cores) in the total, which would be 75776 cores.

Single Precision and Integer Workloads

The TOP500 list focuses on double precision floating point
LINPACK benchmarks only. But how would these 2 supercomputers fare on
single precision and integer workloads? AMD's architecture is very strong
on these workloads, whereas Nvidia's is better on the former.
Despite Nebulae being roughly twice faster
than Tianhe-1 in theoretical and practical double precision performance,
despite running a generation of Nvidia GPU ahead of AMD (Tianhe-1 is
running R700 GPUs instead of the more recent R800 GPUs), despite having
twice as many GPU cards (4640 vs. 2560), Nebulae would still be unable
to surpass Tianhe-1. In fact the theoretical single precision computing
performance provided by the GPUs from these 2 supercomputers is,
surprisingly, the same, or about 4750 TFLOPS:

(HD 4870 X2's theoretical 2400 GFLOPS scaled down to 1840 GFLOPS to account
for the downclocking from 750 MHz to 575 MHz).

Theoretical Nebulae with HD 5970 GPUs

Given that AMD GPUs are more powerful per Watt and per unit of price, looking at
how powerful Nebulae could have been with them is interesting...
Had Nebulae been built on AMD HD 5970 GPUs (4640 single precision GFLOPS,
928 double precision GFLOPS each), with each single-GPU C2050 card replaced
with a dual-GPU HD 5970, it would have been 1.6x faster in double precision
and 3.8x faster in single precision, providing respectively a bewildering
4900 double precision TFLOPS and 22717 single precision TFLOPS of
theoretical performance (including CPUs). It would be technically feasible
to use these AMD GPUs as the power envelope of the HD 5970 is only slightly
higher than the C2050 (294W vs. 247W). Note that Nebulae's "4640 nodes" coincide
strangely with the 4640 single precision GFLOPS performance number of an HD 5970.
Coincidence? My guess as to why NSCS chose Nvidia instead of AMD is perhaps
because of a large CUDA code base or set of CUDA applications that they already
had and wanted to run on the supercomputer, or
because they wanted ECC GDDR RAM (AMD GPUs do not support ECC), or
perhaps Nvidia, in need of proving Fermi is a solid GPGPU choice despite its
technical flaws, decided to practically give them those C2050 cards for free
to effectively buy the number 2 spot on the TOP500 list... Sleazy but possible.

My guess why they used NVIDIA instead of Ati would be a combination of more onboard memory for the fermi cards (PCIe transfers are a really bad bottleneck for many GPU compute tasks) and the software ecosystem. ATI has a long way to go to provide a similar mature developing platform as CUDA constitutes for NVIDIAGPUs. And this is obviously the reason why more software exists (and is in development) for NVIDIAGPUs than for ATIs cards. Despite that does fermi offer some gpgpu specific features which might increase their useability compared to previouse GPU generations.

Ceearem, - 01-06-’10 12:34

Good point. Compute tasks on the C2050 can access 3GB, but only 2GB on the HD 5970 Eyefinity’s 2×2GB (split across 2 GPUs). That said if the tradeoff is 1 extra GB (Nvidia) versus 1.6x/3.8x higher double/single precision perf (AMD), is Nvidia decidedly the right choice for the very diverse set of GPGPU workloads that this research supercomputer is likely to execute?

mrb, - 01-06-’10 13:33

True but with dual GPUs there are other tradeoffs. I.e. you should count the 5970 really as two GPUs with 1GB each. Thats how you see it as a GPGPU Program. So basically each instance of the programm has only access to 1GB of memory and the GPU has to share the bottleneck PCIe with a second. I do this differentiation because often you have to seperate a workload according to available memory. And communications costs can become a dominant factor. One example: 3D fft. Seperating a 3d FFT over multiple GPUs does often not decrease walltime since you have to communicate the full set of data after doing each dimension. So in that case if my dataset will fit into the 3GB of the Tesla card but would be needed to be spread over both parts of the 5970 the Tesla would probably outperform the Ati card by a wide margin.

Note that when you say only 1GB is usable on the HD 5970, you are referring to the original HD 5970 edition. I was specifically referring to the HD 5970 Eyefinity which has 2×2GB total, so 2GB per GPU.

mrb, - 01-06-’10 16:21

I’m not sure it’s possible to use the Tianhe system for any real computing, because it doesn’t have ECC memory, which is pretty much required in big HPC. Otherwise, you don’t know if you have a soft memory error during a large computation, and the entire simulation is potentially invalid.

AMD doesn’t have ECC on any of their GPUs. The Tesla C2050 does.

RecessionCone, - 03-06-’10 18:40

@mrb:
Those on the other hand do have a “slightly” higher TDP.

@RecessionCone
It’s not desireable, but can’t you do ECC in software? Costs performance, true, but at least you’d ensure your calculations aren’t wasted at all.

Carsten, - 04-06-’10 00:20

@RecessionCone
Neither can you be certain when you have errors on Fermi because Nvidia implemented ECCSECDED which is unable to detect many multi-bit errors. ECC is certainly desirable, but at a certain scale, with or without ECC, errors are inevitable and applications have to deal with them.

mrb, - 04-06-’10 02:38

the problem is that people who ported their hydro-codes in OpenCL to ATI cards and in CUDA to Nvidia cards, see 3x faster execution on Nvidia, denying the theoretical calculations of equal or higher FLOPs from ATI.
Memory organization inside the GPU seems to be better,
I don;t know about bandwidth to RAM .. for me it’s not even worth checking since no matter what causes it, ATI is a bad idea in 2010. AMD also produces buggy drivers.

one more thing of course is programming – much easier in CUDA, and all the great parallel libraries for free. I wish AMD offered anything remotely as useful for scientific applications. I think AMD is easily 2-3 years behind Nvidia in that department and the distance grows. Just my opinion.
10 years from now it might be the opposite. Until then…