NVIDIA Volta is being prepped for launch in the next generation supercomputers known as Summit and Sierra. Little is known about Volta GPU specifications but an analysis down by NextPlatform over the details of Summit supercomputer reveal that it can be an insanely fast chips capable of delivering multi-tflops compute power in the HPC market.

When NVIDIA announced their Pascal GP100 GPU at GTC 2016, they called it the largest chip endeavor in the history of humanity. With a R&D budget of over several Billion dollars, Pascal GP100 was indeed the great chip of 2016, aimed to power the HPC and datacenter market with performance never before seen in the graphics industry. NVIDIA also utilized Pascal GP100 GPUs inside their own DGX SaturnV supercomputer that is designed to help them build smarter cards and next generation GPUs (GPUs Designing GPUs).

Just a year after their successful Pascal launch in the HPC market, NVIDIA will be planning to introduce their next grand chip for the HPC market, codenamed Volta. Details of the chip first emerged back at GTC 2015 where NVIDIA showcased what the predict to be the estimated performance output of their upcoming chips. Do note that Pascal was not launched at that time. According to the slides presented that day, Volta would have twice of everything that Pascal has. Double the memory capacity, double the compute, higher efficiency and faster bandwidth.

We aren’t sure how much of that may end up being true but what NVIDIA estimated for Pascal was close to the final product (if not entirely the same). The only thing that Pascal currently lacks is the promised 32 GB capacity but that’s mostly an issue due to HBM production which has already ramped up and we can expect a full GP100 configuration with 32 GB capacity since that is entirely possible with the chip design. In short, VRAM limitation is due to production, not the chip design.

The latest details for the Summit Supercomputer have been confirmed and they are incredible if we look from a HPC perspective. The Summit Supercomputer has 5-10x improvement in application performance over the Titan supercomputer that featured the Kepler GK110 GPU architecture. The Titan was comprised of 18,688 nodes rated at 1.4 TF (per node). The Summit features around 4,600 nodes with a rated compute output of over 40 TF (per node).

Specifications comparison of Titan and Summit Supercomputer. (Image Credits: The Next Platform)

There’s 512 GB of DDR4 and additional HBM2 memory on each node. Titan in comparison had just 38 GB of DDR3 and 6 GB GDDR5 (per GPU) memory on each node. There’s also total of 800 GB NV memory per node. In total, the memory on Titan supercomputer was 710 TB, Summit peaks at over 6 Petabytes of memory (all DDR4 + HBM2 + Non-Volatile combined).

The Power9 chips will have 48 lanes of PCI-Express 4.0 peripheral I/O per socket, for an aggregate of 192 GB/sec of duplex bandwidth, as well as 48 lanes of 25 Gb/sec “Bluelink” connectivity, with an aggregate bandwidth of 300 GB/sec for linking various kinds of accelerators. These Bluelink ports are used to run the NVLink 2.0 protocol that will be supported on the Volta GPUs from Nvidia, and which have about 56 percent more bandwidth than the PCI-Express ports. IBM could support a lot of the SMX2-style, on-motherboard Tesla cards in a system, given all of these Bluelink ports, but remember it needs to allow the Volta accelerators to link to each other over NVLink so they can share memory as well as using NVLink to share memory back with the two Power9 chips. via The Next Platform

Each node will house 2 IBM Power9 CPUs and 6 NVIDIA Volta V100 GPUs. NVIDIA’s NVLINK2 interconnect will be fully integrated between these nodes. The system would consume 13 MW peak power which is just 4 MW more than the Titan supercomputer (9 MW) for over 10x the performance improvement.

NVIDIA Volta Tesla V100 – The Next-Generation Compute Powerhouse

NVIDIA previously stated through their roadmaps that NVIDIA Volta GV100 GPUs will deliver SGEMM (Single precision floating General Matrix Multiply) of 72 GFLOPS/Watt compared to 42 GFLOPs/Watt on Pascal GP100. Using the mentioned ration, a Volta GV100 based GPU with a TDP of 300W can theoretically deliver 9.5 TFLOPs of double precision performance, almost twice that of the current generation GP100 GPU. NVIDIA’s Tesla P100 cards also ship at 300W but the nodes are expected to feature around 40 TFLOPs of compute performance so it is possible that NVIDIA may use TDP configured variants for the Summit supercomputer.

Since six Volta V100 GPUs with a rated 300W TDP will go beyond the 40 TF node barrier, delivering around 57.2 TFLOPs which isn’t as claimed in the Summit specs sheet. A geared down version that runs with a TDP around 200W will manage 20-25% lower performance and deliver 7.6 TFLOPs and 38.2 GFLOPs/Watt which aligns with the Summit node specs.

Six of these Volta Tesla V100 GPUs can run with 45 TFLOPs compute which sounds more possible. There’s possibility that the final dual precision compute of Volta V100 may end up near 8-9 TFLOPs which would be an impressive feat for the graphics manufacturer.

Summit Supercomputer Specifications:

Supercomputer

Titan

Summit

Number of Nodes

18688

4608

Processors

1 Opteron
1 Kepler K20X

2 IBM Power9
6 NVIDIA Tesla V100

GPUs

18688 NVIDIA Tesla K20X

27648 NVIDIA Telsa V100

CPUs

18688 Opteron CPUs

9216 Power9 CPUs

Node Performance

1.44 TF

49 TF

Peak Performance

27 PF

200 PF

Peak OPs (Tensor)

N/A

3.3 ExaOps

Memory Per Node

38 GB DDR3 + 6 GB GDDR5

512 GB DDR4 + HBM2 (16/32 GB) + NVDIMM

NV Memory Per Node

0

800 GB (Flash based)

Total System Memory

710 TB

10 PB

System Interconnect

Gemini (6.4 GB/s)
PCIe 8 GB/s

Dual Rail EDR-IB (23 GB/s) / Dual Rail HDR-IB (48 GB/s)

NVLINK 300 GB/s

Interconnect Topology

3D Tours

Non-Blocking Fat Free

File System

32 PB, 1 TB/s Lustre

250 PC, 2.5 TB/s, GPFS

Peak Power Input

9 MW

13 MW

Furthermore, Volta GV100 may ship or exceed the promised 32 GB HBM2 capacity of Pascal GPUs and have bandwidth tuned around 1 TB/s. NVIDIA slides from GTC 2015 claim bandwidths of ~900 GB/s while Pascal currently operates with 732 GB/s.

The Looming Memory Crisis With HBM2

On further explaining the next generation GPU architectures and efficiency, Stephen W.Keckler (Senior Director of GPU Architecture) pointed out that HBM is a great memory architecture which will be implemented across Pascal and Volta chips but those chips have max bandwidth of 1.2 TB/s (Volta GPU). Moving forward, there exists a looming memory power crisis. HBM2 at 1.2 TB/s sure is great but it adds 60W to the power envelope on a standard GPU.

The current implementation of HBM1 on Fiji chips adds around 25W to the chip. Moving onwards, chips with access of 2 TB/s bandwidth will increase the overall power limit on chips which will go from worse to breaking point. A chip with 2.5 TB/s HBM (2nd generation) memory will reach a 120W TDP for the memory architecture alone, a 1.5 times efficient HBM 2 architecture that outputs over 3 TB/s bandwidth will need 160W to feed the memory alone.

This is not the power of the whole chip mentioned but just the memory layout, typically, these chips will be considered non-efficient for the consumer and HPC sectors but NVIDIA is trying to change that and is exploring new means to solve the memory power crisis that exists ahead with HBM and higher bandwidth. In the near future, Pascal and Volta don’t see a major consumption increase from HBM but moving onward in 2020, when NVIDIA’s next gen architecture is expected to arrive, we will probably see a new memory architecture being introduced to solve the increased power needs.

GPU Family

AMD Vega

AMD Navi

NVIDIA Pascal

NVIDIA Volta

Flagship GPU

Vega 10

Navi 10

NVIDIA GP100

NVIDIA GV100

GPU Process

14nm FinFET

7nm FinFET

TSMC 16nm FinFET

TSMC 12nm FinFET

GPU Transistors

15-18 Billion

TBC

15.3 Billion

21.1 Billion

GPU Cores (Max)

4096 SPs

TBC

3840 CUDA Cores

5376 CUDA Cores

Peak FP32 Compute

13.0 TFLOPs

TBC

12.0 TFLOPs

>15.0 TFLOPs (Full Die)

Peak FP16 Compute

25.0 TFLOPs

TBC

24.0 TFLOPs

120 Tensor TFLOPs

VRAM

16 GB HBM2

TBC

16 GB HBM2

16 GB HBM2

Memory (Consumer Cards)

HBM2

HBM3

GDDR5X

GDDR6

Memory (Dual-Chip Professional/ HPC)

HBM2

HBM3

HBM2

HBM2

HBM2 Bandwidth

484 GB/s (Frontier Edition)

>1 TB/s?

732 GB/s (Peak)

900 GB/s

Graphics Architecture

Next Compute Unit (Vega)

Next Compute Unit (Navi)

5th Gen Pascal CUDA

6th Gen Volta CUDA

Successor of (GPU)

Radeon RX 500 Series

Radeon RX 600 Series

GM200 (Maxwell)

GP100 (Pascal)

Launch

2017

2019

2016

2017

With the final configuration of Volta V100 and IBM Power9 CPUs in place, the Summit Supercomputer would be ranked as the top performing machine in the world with performance crossing the 250 Petaflops mark.