NVIDIA Unveils Teraflop GPU Computing

By Michael Feldman

June 16, 2008

NVIDIA has announced two new Tesla-branded GPU computing products at ISC’08, continuing the company’s efforts to move into the HPC market. The new products are based on NVIDIA’s next generation 10-series GPU processor architecture. The T10P processor unveiled today offers double precision float point support, more local memory, plus much higher overall performance. NVIDIA is touting the new 10-series chip as the second generation processor for CUDA, the company’s GPU computing development platform.

The T10P, which is built on 55nm process technology, doubles the capability of the previous generation Tesla offerings, which were based the 8-series NVIDIA architecture. The new GPU has twice the FP precision (32-bit to 64-bit) and the raw compute performance (500 gigaflops to 1 teraflop). It’s important to note that the teraflop figure is single precision performance; double precision performance is delivered at a much more modest 100 gigaflops.

The T10P also nearly doubles the number of cores from 128 to 240. The new processor is an evolution of the 8- and 9-series GPUs, and like those older processors, allows NVIDIA to share the same componentry across the Quadro and GeForce product lines. Because of the common architecture, CUDA is able to maintain backward and cross compatibility for applications, and also allows the user software to be independent of the number of cores on the chip. The CUDA driver queues up the application threads and the hardware does the fine-grained mapping of the threads to the processing cores at runtime. So the same CUDA app can run on a cluster, a workstation or a notebook, as long as they contain recent vintage NVIDIA hardware.

Each of the 240 cores in the T10P is implemented as a “thread processor” with an integer unit, floating point unit, and a register file. Eight thread processors are arranged in a thread processor array, which shares a special functions unit (transcendental and other functions) a double precision (DP) floating point unit, and 16KB of shared memory that works at cache speed. Except for the DP unit, the design is the same as the NVIDIA’s 8-series GPU architecture.

In addition to the performance and memory bumps, the T10P will also benefit from a wider memory interface (512 bits), faster memory I/O (102 GB/sec), and upgraded I/O interface (PCIe x16 Gen2). But it’s the DP capability that will make HPC users take notice, especially now that the latest IBM Cell processor (PowerXCell 8i) and AMD FireStream GPU now boast DP capability. The absence of double precision FP support has limited Tesla’s potential market, especially in certain financial and scientific realms where applications need 64-bit floating point math.

The disparity between single and double floating point performance on the T10P reflects a trade-off that NVIDIA made between cost and capability. It also reflects the fact that a lot of HPC users can use 32-bit floating point to eke out more performance, jumping into the slower double precision calculations only when necessary. Nonetheless, the T10P’s 100 DP gigaflops is in the same ballpark as IBM’s PowerXCell 8i, which achieves nearly 109 DP gigaflops, and the brand new ClearSpeed CSX700 processor at 96 gigaflops. However, the new AMD FireStream 9250 GPU breaks out of the pack at 200 DP gigaflops.

The T10P will end up in two new Tesla products: the S1070, a 1U box to be hooked up to HPC servers; and the C1060, an accelerator card for high performance desktop systems. They are being priced aggressively: MSRP for the S1070 is $7,995, a couple of thousand less than the first generation Tesla S870; while MSRP for the C1060 is $1699, $400 less that the previous desktop offering.

The S1070 puts four 1.5 GHz T10P devices in a standard 1U chassis, yielding 4 teraflops of single precision performance plus 16 GB of on-board memory. If the host has a couple of free PCIe 2.0 slots, two S1070 boxes can be attached, producing an 8 teraflop computer node in a 3U space. The large on-board 16 GB of memory (4 GB per T10P) will help minimize the number of host memory transfers, which slow down application performance when data sets are large.

A single S1070 draws 700 watts when heavily loaded, compared to about 550 watts for the previous generation S870 offering. But since NVIDIA has doubled the FLOPS, that represents much better performance per watt. At 700 watts, the company is pushing the upper end of the power envelope for a 1U box — most Xeon or Opteron servers are in the 400W-500W range. But NVIDIA believes most users they’re going after are more concerned with compute density and FLOPS/watt than they are their electric bill.

The C1060 card is for technical workstations and packs a single T10P GPU. With a slightly slower clock (1.33 GHz) on the GPU than the server offering, peak performance tops out at around 887 single precision gigaflops, with double precision proportionately less. The slower clock was necessary to keep the device inside of 160 watts, a more reasonable thermal envelope on a desktop.

NVIDIA hopes to parlay the new products into an expanded footprint in the HPC market. Although the company isn’t sharing unit sales of the first generation Tesla boxes, Geoff Ballew, product manager for the Tesla Server group, did say they have around 250 HPC customers on CUDA platforms spread across the usual suspects of HPC verticals: oil & gas, finance, medical, digital content, and research.

“Oil and gas is an area where we’ve had tremendous success,” says Ballew, “one, because the price of a barrel of oil keeps going up, so they’re very motivated to use new tools to find more oil. But it’s also been one where their problem is nicely aligned with our [solution], and they’ve been scratching their heads on how to get the performance they want out of traditional clusters.”

Examples of some of the larger Tesla installations include Hess, NCSA, JFCOM, SAIC, University of Illinois, University of North Carolina, Max Plank Institute, Rice University, University of Maryland, GusGus, Eotvas University, University of Wuppertal, IPE/Chinese Academy of Sciences, and a number of unnamed Cell phone manufacturers. Ballew assured me that he had a lot more customers that he couldn’t talk about yet.

NVIDIA has an even broader base of users that could drive future Tesla sales. The company estimates they have 70 million CUDA-capable GPUs — Tesla, GeForce, and Quadro — deployed and more than 60 thousand CUDA downloads. If the company can move some percentage of these grassroots customers onto Tesla platforms, they’ll have a steady supply of new customers.

The Tesla products announced today won’t go into production until August, so we’ll see only demo systems at ISC this week. But NVIDIA is hinting that Tesla-equipped supercomputers could appear on the November TOP500 list, with perhaps even a system that breaks into the top 20.

Seeking to reign in the tediousness of manual software testing, Pfizer HPC Engineer Shahzeb Siddiqui is developing an open source software tool called buildtest, aimed at automating software stack testing by providing the community with a central repository of tests for common HPC apps and the ability to automate execution of testing. Read more…

By Tiffany Trader

In just a few months time, Senegal will be operating the second largest HPC system in sub-Saharan Africa. The Minister of Higher Education, Research and Innovation Mary Teuw Niane made the announcement on Monday (Jan. 14 Read more…

By Tiffany Trader

If it's Nvidia GPUs you're after to power your AI/HPC/visualization workload, Google Cloud has them, now claiming "broadest GPU availability." Each of the three big public cloud vendors has by turn touted the latest and Read more…

Previous:

STAC (Securities Technology Analysis Center) recently released an ‘exploratory’ benchmark for machine learning which it hopes will evolve into a firm benchmark or suite of benchmarking tools to compare the performanc Read more…

By James Reinders

Quantum computing has lived so long in the future it’s taken on a futuristic life of its own, with a Gartner-style hype cycle that includes triggers of innovation, inflated expectations and – though a useful quantum system is still years away – anticipatory troughs of disillusionment. Read more…

By John Russell

Anyone who has checked a forecast to decide whether or not to pack an umbrella knows that weather prediction can be a mercurial endeavor. It is a Herculean task: the constant modeling of incredibly complex systems to a high degree of accuracy at a local level within very short spans of time. Read more…

By John Russell

Cray revealed today the details of its next-gen supercomputing architecture, Shasta, selected to be the next flagship system at NERSC. We've known of the code-name "Shasta" since the Argonne slice of the CORAL project was announced in 2015 and although the details of that plan have changed considerably, Cray didn't slow down its timeline for Shasta. Read more…

By Tiffany Trader

It’s been a good two weeks, AMD’s Gary Silcott and Andy Parma told me on the last day of SC18 in Dallas at the restaurant where we met to discuss their show news and recent successes. Heck, it’s been a good year. Read more…

By Tiffany Trader

For nearly two hours on Monday at SC18, Jensen Huang, CEO of Nvidia, presented his expansive view of the future of HPC (and computing in general) as only he can do. Animated. Backstopped by a stream of data charts, product photos, and even a beautiful image of supernovae... Read more…

By John Russell

Riding healthy U.S. and global economies, strong demand for AI-capable hardware and other tailwind trends, the high performance computing server market jumped 28 percent in the second quarter 2018 to $3.7 billion, up from $2.9 billion for the same period last year, according to industry analyst firm Hyperion Research. Read more…

By John Russell

As part of the run-up to SC18, taking place in Dallas next week (Nov. 11-16), Intel is doling out info on its next-gen Cascade Lake family of Xeon processors, specifically the “Advanced Processor” version (Cascade Lake-AP), architected for high-performance computing, artificial intelligence and infrastructure-as-a-service workloads. Read more…

By Tiffany Trader

Networking equipment powerhouse Mellanox could be an acquisition target by Microsoft, according to a published report in an Israeli financial publication. Microsoft has reportedly gone so far as to engage Goldman Sachs to handle negotiations with Mellanox. Read more…