Because GK110 is such a unique GPU from NVIDIA when it comes to compute, we’re going to shake things up a bit and take a look at compute performance first before jumping into our look at gaming performance.

On a personal note, one of the great things about working at AnandTech is all the people you get to work with. Anand himself is nothing short of fantastic, but what other review site also has a Brian Klug or a Jarred Walton? We have experts in a number of fields, and as a computer technology site that includes of course includes experts in computer science.

What I’m trying to say is that for the last week I’ve been having to fend off our CS guys, who upon hearing I had a GK110 card wanted one of their own. If you’ve ever wanted proof of just how big a deal GK110 is – and by extension Titan – you really don’t have to look too much farther than that.

Titan, its compute performance, and the possibilities it unlocks is a very big deal for researchers and other professionals that need every last drop of compute performance that they can get, for as cheap as they can get it. This is why on the compute front Titan stands alone; in NVIDIA’s consumer product lineup there’s nothing like it, and even AMD’s Tahiti based cards (7970, etc), while potent, are very different from GK110/Kepler in a number of ways. Titan essentially writes its own ticket here.

In any case, as this is the first GK110 product that we have had access to, we couldn’t help but run it through a battery of tests. The Tesla K20 series may have been out for a couple of months now, but at $3500 for the base K20 card, Titan is the first GK110 card many compute junkies are going to have real access to.

To that end I'd like to introduce our newest writer, Rahul Garg, who will be leading our look at Titan/GK110’s compute performance. Rahul is a Ph.D student specializing in the field of parallel computing and GPGPU technology, making him a prime candidate for taking a critical but nuanced look at what GK110 can do. You will be seeing more of Rahul in the future, but first and foremost he has a 7.1B transistor GPU to analyze. So let’s dive right in.

By: Rahul Garg

For compute performance, we first looked at two common benchmarks: GEMM (measures performance of dense matrix multiplication) and FFT (Fast Fourier Transform). These numerical operations are important in a variety of scientific fields. GEMM is highly parallel and typically compute heavy, and one of the first tests of performance and efficiency on any parallel architecture geared towards HPC workloads. FFT is typically memory bandwidth bound but, depending upon the architecture, can be influenced by inter-core communication bandwidth. Vendors and third-parties typically supply optimized libraries for these operations. For example, Intel supplies MKL for Intel processors (including Xeon Phi) and AMD supplies ACML and OpenCL-based libraries for their CPUs and GPUs respectively. Thus, these benchmarks measure the performance of the combination of both the hardware and software stack.

For GEMM, we tested the performance of NVIDIA's CUBLAS library supplied with CUDA SDK 5.0, on SGEMM (single-precision/fp32 GEMM) and DGEMM (double precision/fp64 GEMM) on square matrices of size 5k by 5k. For SGEMM on Titan, the data reported here was collected with boost disabled. We also conducted the experiments with boost enabled on Titan, but found that the performance was effectively equal to the non-boost case. We assume that it is because our test ran for a very short period of time and perhaps did not trigger boost. Therefore, for the sake of simpler analysis, we report the data with boost disabled on the Titan. If time permits, we may return to the boost issue in a future article for this benchmark.

Apart from the results collected by us for GTX Titan, GTX 680 and GTX 580, we refer to experiments conducted by Matsumoto, Nakasato and Sedukin reported in a technical report filed at the University of Aizu about GEMM on Radeon 7970. Their exact parameters and testbed are different than ours, and we include their results for illustrative purposes, as a ballpark estimate only. The results are below.

Titan rules the roost amongst the three listed cards in both SGEMM and DGEMM by a wide margin. We have not included Intel's Xeon Phi in this test, but the TItan's achieved performance is higher than the theoretical peak FLOPS of the current crop of Xeon Phi. Sharp-eyed readers will have observed that the Titan achieves about 1.3 teraflops on DGEMM, while the listed fp64 theoretical peak is also 1.3 TFlops; we were not expecting 100% of peak on the Titan in DGEMM. NVIDIA clarified that the fp64 rating for the Titan is a conservative estimate. At 837MHz, the calculated fp64 peak of Titan is 1.5 TFlops. However, under heavy load in fp64 mode, the card may underclock below the listed 837MHz to remain within the power and thermal specifications. Thus, fp64 ALU peak can vary between 1.3 TFlops and 1.5 TFlops and our DGEMM results are within expectations.

Next, we consider the percentage of fp32 peak achieved by the respective SGEMM implementations. These are plotted below.

Titan achieves about 71% of its peak while GTX 680 only achieves about 40% of the peak. It is clear that while both GTX 680 and Titan are said to be Kepler architecture chips, Titan is not just a bigger GTX 680. Architectural tweaks have been made that enable it to reach much higher efficiency than the GTX 680 on at least some compute workloads. GCN based Radeon 7970 obtains about 63% of peak on SGEMM using Matsumoto et al. algorithm, and Fermi based GTX 580 also obtains about 63% of peak using CUBLAS.

For FFT, we tested the performance of 1D complex-to-complex inplace transforms of size 225 using the CUFFT library. Results are given below.

Titan outperforms the GTX 680 in FFT by about 50% in single-precision. We suspect this is primarily due to increased memory bandwidth on Titan compared to GTX 680 but we have not verified this hypothesis. GTX 580 has a slight lead over the GTX 680. Again, if time permits, we may return to the benchmark for a deeper analysis. Titan achieves about 3.4x the performance of GTX 680 but this is not surprising given the poor fp64 execution resources on the GTX 680.

We then looked at an in-house benchmark called SystemCompute, developed by our own Ian Cutress. The benchmark tests the performance on a variety of sample kernels that are representative of some scientific computing applications. Ian described the CPU version of these benchmarks in a previous article. Ian wrote the GPU version of the benchmarks in C++ AMP, which is a relatively new GPGPU API introduced by Microsoft in VS2012.

Microsoft's implementation of AMP compiles down to DirectCompute shaders. These are all single-precision benchmarks and should run on any DX11 capable GPU. The benchmarks include 2D and 3D finite difference solvers, 3d particle movement, n-body benchmark and a simple matrix multiplication algorithm. Boost is enabled on both the Titan and GTX 680 for this benchmark. We give the score reported by the benchmark for both cards, and report the speedup of the Titan over 680. Speedup greater than 1 implies Titan is faster, while less than 1 implies a slowdown.

SystemCompute scores (higher is better)

Benchmark

GTX 580

GTX 680

GTX Titan

Speedup of Titan
over GTX 680

2D FD

9053

8445

12461

1.47

3D FD

3133

3827

5263

1.37

3DPmo

41722

26955

40397

1.49

MatMul

172

197

229

1.16

nbody

918

1517

2418

1.59

The benchmarks show between 16% and 60% improvement, with the most improvement coming from the relatively FLOP-heavy n-body benchmark. Interestingly, GTX 580 wins over the Titan in 3DPMo and wins over the 680 in 3DPmo and 2D.

Overall, GTX Titan is an impressive accelerator from compute perspective and posts large gains over its predecessors.

1. Does compute capability really takes that much more transistors to build? as in 2x trans. only yield ~140% improvement on gaming.I think this was a conscious decision by nVidia to focus on compute and the required profit margin to sustain R&D.

2. despite the die size shrink, I'm guessing it would be harder to have functional silicon as the process shrinks. i.e. finding 100mm^2 of functional silicon @ 40nm is easier than @28nm, from the standpoint that more transistors are packed to the same area. Which I think why they have 15SMXs designed.Thus it'd be more expensive for nVidia to build same area at 28 vs. 40 nm... at least until the process matures, but at 7B I doubt it will ever be attainable.

3. The AMD statement on no updates to 7970 essentially sealed the $1000 price for titan. I would bet if AMD announced 8970, Titan would be priced at $700 today, with 3GB memory.Reply

Luxury GPU is no more silly than Extreme CPUs that cost $1000 each. And yet, Intel continues to sell those, and what's more the performance offered by Titan is a far better deal than the performance offered by a $1000 CPU vs. a $500 CPU. Then there's the Tesla argument: it's a $3500 card for the K20 and this is less than a third that price, with the only drawbacks being no ECC and no scalability beyond three cards. For the Quadro crowd, this might be a bargain at $1000 (though I suspect Titan won't get the enhanced Quadro drivers, so it's mostly a compute Tesla alternative).Reply

The problem with this analogy, which I'm sure was floated around Nvidia's Marketing board room in formulating the plan for Titan, is that Intel offers viable alternative SKUs based on the same ASIC. Sure there are the few who will buy the Intel EE CPU (3970K) for $1K, but the overwhelming majority in that high-end market would rather opt for the $500 option (3930K) or $300 option (3820).

Extend this to the GPU market and you see Nvidia clearly withheld GK100/GK110 as the flagship part for over a year, and instead of offering a viable SKU for traditional high-end market segments based on this ASIC, they created a NEW ultra-premium market. That's the ONLY reason Titan looks better compared to GK104 than Intel's $1K and $500 options, because Nvidia's offerings are truly different classes while Intel's differences are minor binning and multiplier locked parts with a bigger black box.Reply

You are assuming that the Intel EE parts are nothing more than a marketing ploy, which is wrong, while at the same time assuming that the Titan is orders of magnitude beyond the 680 which is also wrong.

You're seeing it from the point of view of someone who buys the cheapest Intel CPU, overclocks it to the point of melting, and then feels they have a solution "just as good if not better" than the Intel EE.

Because the Titan has unlocked stream procs that the 680 lacks, and there is no way to "overclock" your way around missing SPs, you feel that NVidia has committed some great sin.

The reality is that the EE procs give out of box performance that is superior to out of box performance of the lesser SKUs by a small, but appreciable, margin. In addition, they are unlocked, and come from a better bin, which means they will overclock *even better* than the lesser SKUs. Budget buyers never want to admit this, but it is reality in most cases. Yes you can get a "lucky part" from the lesser SKU that achieves a 100% overclock, but this is an anomaly. Most who criticize the EE SKUs have never even come close to owning one.

Similarly, the Titan offers a small, but appreciable, margin of performance over the 680. It allows you to wait longer before going SLI. The only difference is you don't get the "roll of the dice" shot at a 680 that *might* be able to appear to match a Titan since the SP's arent there.

The analogy is fine, it's just that biased perspective prevents some from seeing it.Reply

Well you obviously have trouble comprehending analogies if you think 3.6B difference in transistors and ~40% difference in performance is analogous to 3MB L3 cache, an unlocked multiplier and 5% difference in performance.

But I guess that's the only way you could draw such an asinine parallel as this:

"Similarly, the Titan offers a small, but appreciable, margin of performance over the 680."

It's the only way your ridiculous analogy to Intel's EE could possibly hold true, when in reality, it couldn't be further from the truth. Titan holds a huge advantage over GTX 680, but that's expected, its a completely different class of GPU whereas the 3930K and 3960X are cut from the exact same wafer.Reply

There was no manufacturing capacity you IDIOT LIAR.The 680 came out 6 months late, and amd BARELY had 79xx's on the shelves till a day before that.

Articles were everywhere pointing out nVidia did not have reserve die space as the crunch was extreme, and the ONLY factory was in the process of doing a multi-billion dollar build out to try to keep up with bare minimum demand.

Now we've got a giant GPU core with perhaps 100 attempted dies per wafer, with a not high yield, YET YOU'RE A LIAR NONETHELESS.Reply

It has nothing to do with manufacturing capacity, it had everything to do with 7970's lackluster performance and high price tag.

GTX 680 was only late (by 3, not 6 months) because Nvidia was too busy re-formulating their high-end strategy after seeing 7970 outperform GTX 580 by only 15-20% but asking 10% higher price. Horrible price:performance metric for a new generation GPU on a new process node.

This gave Nvidia the opportunity to:

1) Position mid-range ASIC GK104 as flagship GTX 680 and still beat the 7970.2) Push back and most importantly, re-spin GK100 and refine it to be GK110.3) Screw their long-time customers and AMD/AMD fans in the process.4) Profit.

So instead of launching and mass-producing their flagship ASIC first (GK100) as they've done in every single previous generation and product launch, they shifted their production allocation at TSMC to their mid-range ASIC, GK104 instead.

Once GK110 was ready, they've had no problem churning them out, even the mfg date of these TITAN prove this point as week 31 chips are somewhere in the July-August time frame. They were able to deliver some 19,000 K20X units to ORNL for the real TITAN in October 2012. Coupled with the fact they're using ASICs with the same number of functional units for GTX Titanic, it goes to show yields are pretty good.

But the real conclusion to be drawn for this is that other SKUs based on GK110 are coming. There's no way GK110 wafer yields are anywhere close to 100% for 15 SMX ASICs. I fully expect a reduced SMX unit, maybe 13 with 2304SP as originally rumored show it's face as the GTX 780 with a bunch of GK114 refreshes behind it to fill out the line-up.

The sooner people stop overpaying for TITAN, the sooner we'll see the GTX 700 series, imo, but with no new AMD GPUs on the horizon we may be waiting awhile.Reply

you're a brainwashed lying sack of idiocy, so maybe i'll waste my time reading your idiotic lies, and maybe not, since your first line is the big fat frikkin LIE you HAVE TO BELIEVE that you made up in your frikkin head, in order to take your absolutely FALSE STANCE for the past frikkin nearly year now.Reply