InfiniBand: When State-of-the-Art becomes State-of-the-Smart

In this special guest post from the Print ‘n Fly Guide to SC16, Scot Schultz from Mellanox writes that the company is moving the industry forward to a world-class off-load network architecture that will pave the way to Exascale.

Scot Schultz, Director of HPC/Technical Computing at Mellanox

Bringing together the most respected minds in high performance computing, networking, storage and acceleration technologies is always a key goal for the annual Supercomputing Conference. The goal is to not only to showcase innovations that will drive new scientific research and life changing discovery, but also to demonstrate how HPC plays a fundamental role in driving social and economic opportunities across the globe. For the past few years, since SC14 in New Orleans, the core theme and tag line of the conference has been ‘HPC matters’, which captures the essence of this event. In this spirit, I would like to focus on one of the hottest topics in HPC today, why the ‘interconnect matters’, and specifically around a smarter network, a new era of in-network computing, and the importance of offload networking architectures. Now more than ever, the network is a key enabler to solving our most challenging problems, and has in fact, revealed itself as the critical element in reaching Exascale. As many of us return to Salt Lake City for SC16 to discover the latest in HPC’s achievements and unveilings of the latest in technology, I would like to review this exciting time for an industry that maintains no boundaries. In particular, I’m going to place focus on the modern-day network, which is fundamental in achieving the understanding and discoveries of science and data.

Network architectures

Today, there are fundamentally two network architectures that exist. One architecture leverages the microprocessor to perform network operations and communication. This is known as on-load architecture. On-load interconnect technology is much cheaper and easier to implement, but an essential challenge centers around the CPU utilization; because the CPU must manage and execute network operations, it has less availability for applications, which is its primary purpose.

Off-load architecture, on the other hand, strives to overcome the performance bottlenecks of the CPU. Off-load networking performs all of the network functions, as well as complex communications operations, such as collective operations or data aggregation operations. This has the added advantage of increasing the availability of the CPU for compute functions and improving the overall efficiency of the system.

Mellanox is renowned for furthering the development and engineering resources towards a more effective and efficient interconnect, and today we’re weaving intelligence into the interconnect fabric, improving our proven network acceleration engines, and adding additional capabilities that further remove communication tasks from the CPU and dramatically increase system efficiency.

The benefits of off-load network architecture

In the transition to clustered-based systems from large SMP systems to the more modern Beowulf approach in the 1990s, maximizing performance was standardized on an off-load network architecture in the earliest generations with InfiniBand. While InfiniBand was not only an order of magnitude faster than Ethernet networks, the core benefit of RDMA was notably the first offload capability of modern day HPC systems. By removing the CPU and operating system semantics from the network data path, there proved to be a significant performance uplift for applications that was quickly adopted as the de-facto standard for high performance computing. Together, the industry rose to the challenge and standardized on the clear advantages of RDMA for high performance computing.

Today, native RDMA continues to be the most efficient method for scaling and reducing the overall runtime of the most complex applications in technical computing. Mellanox continues evolving the industry standards of InfiniBand and native RDMA while investing in complimentary acceleration engines, such as Core-Direct, GPUDirect RDMA, and even techniques that allow bare-metal, line-rate network performance to be achieved in virtualized environments.

Additional features to improve fabric scalability and resilience, beyond the base InfiniBand specifications, have been fundamental to achieving maximum application scalability and over-all system performance. These improvements include: support for adaptive routing, dynamically connected transport, InfiniBand routing capabilities, in-network computing and several other enhanced reliability features.

What is even more compelling, is that it is not just Mellanox moving the industry forward to a world-class off-load network architecture. In fact, we are just a leading voice in the HPC choir. A new-age of collaboration has taken hold to advance the concepts of off-load network architecture.

The era of Co-Design is here

Historically, increases in performance have been achieved around a CPU-centric mindset, with development of the individualhardware devices, drivers, middleware, and software applications in order to improve scalability and maximize throughput. This archaic model is becoming short-lived, and as a new era of Co-Design is moving the industry toward Exascale-class computing; the creation of synergies between all system elements is the next approach that can lead to significant performance improvements.

Mellanox, alongside many industry thought-leaders, is a leader in advancing the Co-Design approach. The key value and core goal is to strive for more CPU offload capabilities and acceleration techniques while maintaining forward and backward compatibility of new and existing infrastructures; and the result is nothing less than the world’s most advanced interconnect, which continues to yield the most powerful and efficient supercomputers ever deployed.

Unprecedented scalability and efficiency

Mellanox InfiniBand solutions are being used today in 50 percent of the published petaflop systems according to the June 2016 TOP500 supercomputing list. Our interconnect solutions are also deployed in the world’s fastest supercomputer at the National Supercomputing Center in Wuxi, China, which is com­prised of 41,000 nodes and delivers 93Pf of performance, utilizing more than 10,649,600 cores. Mellanox InfiniBand is the interconnect fabric of choice for the most efficient system in the TOP500 listed with a calculated system effi­ciency of 99.8 percent. As an industry-standards based interconnect technology, adoption continues to broaden with more than 41 percent of the overall TOP500 systems.

Mellanox is notably the only industry standards-based interconnect technology provider that has proven forwards and backwards interoperability across generations of technology. For example, “Pleiades” is one of the world’s most powerful supercomputers and represents NASA’s state-of-the-art technology for meeting the agency’s most demanding supercomputing requirements. The Pleiades system enables NASA scientists and engineers to conduct modeling and simulation and it depends upon multiple generations of Mellanox InfiniBand technology across 20,000+ compute nodes, including accelerated GPU-enhanced compute resources.

In-network co-processing; serious network capabilities

Mellanox EDR 100Gb/s InfiniBand technology builds on more than 15 years of experience in designing and innovating high-performance networks.

The addition of elements in the second generation 100Gb/s EDR ConnectX®-5, introduced more powerful capabilities for supporting in-network computing. These elements included: MPI tag matching in hardware, advanced adaptive routing capabilities, supporting out-of-order packets and in-network memory. The efforts now also extend to the switch hardware and software enhancements, including network management software, user interfaces, and enhanced communication libraries.

Switch-IB 2 is likewise, a member of the second generation EDR 100Gb/s family, and also includes features to improve upon fabric scalability and resilience. Above and beyond its improvements to the ultra-low latency and unmatched performance, it also introduced our “SHArP” (Scalable Hierarchical Aggregation Protocol) technology that improves the performance of collective operations by processing the data as it traverses the network. We have now eliminated the need to send data multiple times between endpoints. This new in-network co-processing paradigm is the latest in a Co-Design approach to Exascale, which decreases the amount of data traversing the network. When implementing the processing of collective communication algorithms directly from the network fabric, it gives additional scalability benefits, often more than 10X. SHArP dramatically frees up more valuable CPU resources for application computation and allows applications to scale. On-load networking, which taxes the CPU for every aspect of network operations will never be able to leverage in-network co-processing, provide matched performance characteristics, low latency or the scalability that off-load network architecture can provide.

What to expect next at SC16 in Salt Lake City?

At Mellanox, we have always maintained our commitment to leadership in high performance computing. This year, expect no different. We will demonstrate new capabilities that unlock new potential of HPC systems to address the most challenging workloads in science and data and drive new industries of tomorrow.

Expect nothing less than exceptional from Mellanox Technologies — Always a generation ahead.

###

This Industry Perspective is just one of the great features in the new Print ‘n Fly Guide to SC16 in Salt Lake City. Inside this guide you will find technical features on supercomputing, HPC interconnects, and the latest developments on the road to exascale. It also has great recommendations on food, entertainment, and transportation in SLC.

Resource Links:

Latest Video

Industry Perspectives

In this special guest post, Axel Huebl looks at the TOP500 and HPCG with an eye on power efficiency trends to watch on the road to Exascale. "This post will focus one efficiency, in terms of performance per Watt, simply because system power envelope is a major constrain for upcoming Exascale systems. With the great numbers from TOP500, we try to extend theoretical estimates from theoretical Flop/Ws of individual compute hardware to system scale." [Read More...]

White Papers

This pioneering study focuses primarily on the virtual performance of throughput workloads. Download the new white paper from VMWare that explores the possibilities of virtualizing HPC throughput in computing environments.