Today data scientists are using new techniques with machine learning to solve complex challenges. In applied data science speed kills and the Tesla V100 accelerator is built for speed. Data scientists always make trade-offs between accuracy and time. Thus, more powerful compute systems are required to crunch and analyze diverse data sets, and train exponentially more complex deep learning models in a practical amount of time.

Based on the new Volta GV100 GPU, NVIDIA claims the Tesla V100 accelerator is the workhorse of a new advanced HPC platform engineered for the convergence of HPC and AI for applied data science with machine learning, simulations as well as traditional research computational science.

Volta has 640 Tensor Cores to break the 100 teraflops per second barrier of performance (over a 5X increase compared to prior generation Pascal architecture) and connects multiple V100 GPUs at up to 300 GB/s. This allows a significant reduction in training time where past models taking weeks now take days or hours. Volta also sports the next generation NVLink connection to provide 2X the throughput, compared to the previous generation of NVLink - enabling improved model and data parallel approaches for strong scaling and application performance.

Volta is designed to be easier to program than prior GPUs and supports independent thread scheduling, for finer-grain synchronization and cooperation between parallel threads in a program - reducing time and effort required to get GPU programs running - allowing data scientists to spend time and brainpower on higher value work.

New Streaming Multiprocessor (SM) Architecture Optimized for Deep Learning Volta features a major new redesign of the SM processor architecture that is at the center of the GPU. The new Volta SM is 50% more energy efficient than the previous generation Pascal design, enabling major boosts in FP32 and FP64 performance in the same power envelope. New Tensor Cores designed specifically for deep learning deliver up to 12x higher peak TFLOPs for training. With independent, parallel integer and floating point datapaths, the Volta SM is also much more efficient on workloads with a mix of computation and addressing calculations. Volta’s new independent thread scheduling capability enables finer-grain synchronization and cooperation between parallel threads. Finally, a new combined L1 Data Cache and Shared Memory subsystem significantly improves performance while also simplifying programming.

Second-Generation NVLink™ The second generation of NVIDIA’s NVLink high-speed interconnect delivers higher bandwidth, more links, and improved scalability for multi-GPU and multi-GPU/CPU system configurations. GV100 supports up to 6 NVLink links at 25 GB/s for a total of 300 GB/s. NVLink now supports CPU mastering and cache coherence capabilities with IBM Power 9 CPU-based servers. The new NVIDIA DGX-1 with V100 AI supercomputer uses NVLink to deliver greater scalability for ultra-fast deep learning training.

Volta Multi-Process Service Volta Multi-Process Service (MPS) is a new feature of the Volta GV100 architecture providing hardware acceleration of critical components of the CUDA MPS server, enabling improved performance, isolation, and better quality of service (QoS) for multiple compute applications sharing the GPU. Volta MPS also triples the maximum number of MPS clients from 16 on Pascal to 48 on Volta.

Enhanced Unified Memory and Address Translation Services GV100 Unified Memory technology in Volta GV100 includes new access counters to allow more accurate migration of memory pages to the processor that accesses the pages most frequently, improving efficiency for accessing memory ranges shared between processors. On IBM Power platforms, new Address Translation Services (ATS) support allows the GPU to access the CPU’s page tables directly.

Cooperative Groups and New Cooperative Launch APIs Cooperative Groups is a new programming model introduced in CUDA 9 for organizing groups of communicating threads. Cooperative Groups allows developers to express the granularity at which threads are communicating, helping them to express richer, more efficient parallel decompositions. Basic Cooperative Groups functionality is supported on all NVIDIA GPUs since Kepler. Pascal and Volta include support for new Cooperative Launch APIs that support synchronization amongst CUDA thread blocks. Volta adds support for new synchronization patterns.

Maximum Performance and Maximum Efficiency Modes In Maximum Performance mode, the Tesla V100 accelerator will operate unconstrained up to its TDP (Thermal Design Power) level of 300W to accelerate applications that require the fastest computational speed and highest data throughput. Maximum Efficiency Mode allows data center managers to tune power usage of their Tesla V100 accelerators to operate with optimal performance per watt. A not-to-exceed power cap can be set across all GPUs in a rack, reducing power consumption dramatically, while still obtaining excellent rack performance.

Volta Optimized Software New versions of deep learning frameworks such as Caffe2, MXNet, CNTK, TensorFlow, and others harness the performance of Volta to deliver dramatically faster training times and higher multi-node training performance. Volta-optimized versions of GPU accelerated libraries such as cuDNN, cuBLAS, and TensorRT leverage the new features of the Volta GV100 architecture to deliver higher performance for both deep learning and High Performance Computing (HPC) applications. The NVIDIA CUDA Toolkit version 9.0 includes new APIs and support for Volta features to provide even easier programmability.

The Radeon Open Compute Platform (ROCm) is an open source platform for GPU computing that is language independent and brings modular software development to GPU computing. This provides a real cheaper alternative to Nvidia's CUDA and helps developers in coding compute-oriented software for AMD Radeon GPUs along with converting existing CUDA software to run on GCN hardware.

In the past you almost had to purchase the Intel / Nvidia combo for serious GPU computing (Nvidia doesn't have an x86 license). Now you have AMD Zen as an option on the CPU side and AMD Radeon on the GPU side of things.

For now the compute cards from AMD are the S-series, like the S9300 (2x4096 GCN cores). The fastest APU is stuck at 384 cores. AMD will soon release server based APU's, like Raven Ridge, which will be a much lower cost solution than buying a server CPU and discrete server GPU separately. You should be able to easily port over any previous code that was used for previous hardware.

With Zen AMD could offer a nice HPC ecosystem that has good communication between x86 CPU and GPU, large memory bandwidth and excellent I/O. While Intel may have a faster x86 CPUs and a faster storage medium in Octane - and Nvidia may have faster GPUs along with a more accessible language in CUDA - AMD could have a better integrated HPC system where the sum exceeds the performance of the individual Intel / Nvidia combo parts.

Of course this can be used for deep learning and neural network applications more efficiently at lower cost. Features include:

The Nvidia DGX-1 is a new HPC system (not just a server) that features the Tesla P100 accelerators for GPU computing. It includes 2x Intel Xeon E5-2698 v3 (16 core, Haswell-EP) and 8 P100s for 28,672 CUDA cores and 128GB of shared VRAM. The DGX-1 is rated to be able to hit 170 FP16 TFLOPs of performance (or 85 FP32 TFLOPs) inside of 3Us.

The P100 has a new form factor and connector requiring a completely new infrastructure to run. The 8 P100s are installed in a hybrid mesh cube configuration, making full use of the NVLink interconnect tooffer a significant amount of memory bandwidth between the GPUs. Each NVLink offers a bidirectional 20GB/sec up 20GB/sec down, with 4 links per GP100 GPU, for an aggregate bandwidth of 80GB/sec up and another 80GB/sec down.

The DGX-1 system runs Canonical’s Ubuntu Server and drivers for the Pascal GPUs created by Nvidia. Note that most hyperscalers are deploying large CPU / GPU clusters to train their neural networksusing Ubuntu. The system also includes Nvidia’s Deep Learning SDK and its DIGITS GPU training system as well as the CUDA programming environment and a bunch of nice machine learning frameworks all bundled and tuned for the Pascal GPUs. Nvidia invested heavily in NVLink, their higher-speed interconnect to enable fast memory access between GPUs, and unified memory between the GPU and CPU.

The downside is you are locked into the Intel / Nvidia combo for inefficient integrated x86 CPU / GPU computing (Nvidia doesn't have an x86 license). The lack of competition in this space is disconcerting.

The upside is that Intel has fast x86 CPUs and fast storage in Octane - and Nvidia has a nice accessible language in CUDA.

The DGX-1 allows high performance for deep learning and neural network applications. Features include:

2x Intel Xeon E5-2698 v3 (16 core, Haswell-EP)

8 P100s for 28,672 CUDA cores and 128GB of shared VRAM

High speed, high bandwidth interconnect for maximum application scalability