Category Archives: High Performance Computing (HPC)

InfiniBand and Ethernet are the leading interconnect solutions for connecting servers and storage systems in high-performance computing and in enterprise (virtualized or not) data centers. Recently, the HPC Advisory Council has put together the most comprehensive database for high-performance computing applications to help users understand the performance, productivity, efficiency and scalability differences between InfiniBand and 10 Gigabit Ethernet.

In summary, there are a large number of HPC applications that need the lowest possible latency for best performance or the highest bandwidth (for example Oil&Gas applications as well as weather related applications). There are some HPC applications that are not latency sensitive. For example, gene sequencing and some bioinformatics applications are not sensitive to latency and scale well with TCP-based networks including GigE and 10GigE. For HPC converged networks, putting HPC message passing traffic and storage traffic on a single TCP network may not provide enough data throughput for either. Finally, there is a number of examples that show 10GigE has limited scalability for HPC applications and InfiniBand proves to be a better performance, price/performance, and power solution than 10GigE.

High-performance computing provides an invaluable role in research, product development and education. It helps accelerate time to market, and provides significant cost reductions in product development and tremendous flexibility. One strength in high-performance computing is the ability to achieve best sustained performance by driving the CPU performance towards its limits. Over the past decade, high-performance computing has migrated from supercomputers to commodity clusters. More than eighty percent of the world’s Top500 compute system installations in June 2009 were clusters. The driver for this move appears to be a combination of Moore’s Law (enabling higher performance computers at lower costs) and the ultimate drive for the best cost/performance and power/performance. Cluster productivity and flexibility are the most important factors for a cluster’s hardware and software configuration.

A deeper examination of the world’s Top500 systems based on commodity clusters shows two main interconnect solutions that are being used to connect the servers for creating those compute powerful systems – InfiniBand and Ethernet. If we divide the systems according to the interconnect family, we will see that the same CPUs, memory speed and other settings are common between the two groups. The only difference between the two groups, besides the interconnect, is the system efficiency, or how many of CPU cycles can be dedicated to the application work, and how many of them will be wasted. The below graph list the systems according to their interconnect setting, and their measured efficiency.

Top500 Interconnect Efficiency

As seen, systems connected with Ethernet achieves an average 50% efficiency, which means that 50% of the CPU cycles are wasted on non-application work or are idle, waiting for data to arrive. Systems connected with InfiniBand achieve an above 80% efficiency average, which means that less than 20% of the CPU cycles are wasted. Moreover, the latest InfiniBand based systems have demonstrated up to 94% efficiency (the best Ethernet connected systems demonstrated 63% efficiency).

People might argue that the Linpack benchmark is not the best benchmark for measuring parallel application efficiency, and does not fully utilize the network. The graph results are a clear indication that even for the Linpack application, the network does make a difference, and for better parallel application, the gap will be much higher.

When choosing the system setting, with the notion of maximizing return on investment, one needs to make sure no artificial bottlenecks will be created. Multi-core platforms, parallel applications, large databases etc require fast data exchange and lots of it. Ethernet can become the system bottleneck due to latency/bandwidth and CPU overhead due to the TCP/UDP processing (TOE solutions introduce other issues, sometime more complicated, but this is a topic for another blog) and reduce the system efficiency to 50%. This means that half of the compute system is wasted, and just consumes power and cooling. Same performance capability could have been achieved with half of the servers if they were connected with InfiniBand. More data on different application performance, productivity and ROI, can be found at the HPC Advisory Council web site, under content/best practices.

While InfiniBand will demonstrate higher efficiency and productivity, there are several ways to increase Ethernet efficiency. One of them is optimizing the transport layer to provide zero copy and lower CPU overhead (not by using TOE solutions, as those introduce single points of failure in the system). This capability is known as LLE (low latency Ethernet). More on LLE will be discussed in future blogs.

On Tuesday, May 26, the Research Center Jülich reached a significant milestone of German and European supercomputing with the inauguration of two new supercomputers: the supercomputer JUROPA and the fusion machine HPC FF. The symbolic start of the systems were triggered by the German Federal Minister for Education and Research, Prof. Dr. Annette Schavan, the Prime Minister of North Rhine-Westphalia, Dr. Jürgen Rüttgers, and Prof. Dr. Achim Bachem, Chairman of the Board of Directors at Research Center Jülich as well as high-ranking international guests from academia, industry and politics.

JUROPA (which stands for Juelich Research on Petaflop Architectures) will be used Pan-European-wide by more than 200 research groups to run their data-intensive applications. JUROPA is based on a cluster configuration of Sun Blade servers, Intel Nehalem processors, Mellanox 40Gb/s InfiniBand and Cluster Operation Software ParaStation from ParTec Cluster Competence Center GmbH. The system was jointly developed by experts of the Jülich Supercomputing Center and implemented with partner companies Bull, Sun, Intel, Mellanox and ParTec. It consists of 2,208 compute nodes with a total computing power of 207 Teraflops and was sponsored by the Helmholtz Community. Prof. Dr. Dr. Thomas Lippert, Head of Jülich Supercomputing Center, explains the HPC Installation in Jülich in the video below.

HPC-FF (High Performance Computing – for Fusion), drawn up by the team headed by Dr. Thomas Lippert, director of the Jülich Supercomputing Centre, was optimized and implemented together with the partner companies Bull, SUN, Intel, Mellanox and ParTec. This new best-of-breed system, one of Europe’s most powerful, will support advanced research in many areas such as health, information, environment, and energy. It consists of 1,080 computing nodes each equipped with two Nehalem EP Quad Core processors from Intel. Their total computing power of 101 teraflop/s corresponds, at the present moment, to 30th place in the list of the world’s fastest supercomputers. The combined cluster will achieve 300 teraflops/s computing power and will be included in the rating of the Top500 list, published this month at ISC’09 in Hamburg, Germany.

40Gb/s InfiniBand from Mellanox is used as the system interconnect. The administrative infrastructure is based on NovaScale R422-E2 servers from French supercomputer manufacturer Bull, who supplied the compute hardware and the SUN ZFS/Lustre Filesystem. The cluster operating system “ParaStation V5″ is supplied by Munich software company ParTec. HPC-FF is being funded by the European Commission (EURATOM), the member institutes of EFDA, and Forschungszentrum Jülich.

This week I presented in the LS-DYNA user conference. LS-DYNA is one of the most used applications for automotive related computer simulations – simulations that are being used throughout the vehicle design process and decreases the need to build expensive physical prototypes. Computer simulation usage has decreased the vehicle design cycle from years to month, and is responsible for cost reduction throughout the process. Almost every part in the vehicle is designed with computer aided simulations. From crash/safety simulation to engine and gasoline flow, from air condition to water pumps, almost every part of the vehicle is simulated.

Today challenges in vehicle simulations are around the motivation to build more economical and ecological designs, how to do design lighter vehicles (less material to be used) while meeting the increased safety regulation demands. For example, national and international standardizations have been put in place, which provide structural crashworthiness requirements for railway vehicle bodies.

In order to be able to meet all of those requirements and demands, higher compute simulation capability is required. It is not a surprise that LS-DYNA is being mostly used in high-performance clustering environments as they provide the needed flexibility, scalability and efficiency for such simulations. Increasing high-performance clustering productivity and the capability to handle more complex simulations is the most important factor for the automotive makers today. It requires using balanced clustering design (hardware – CPU, memory, interconnect, GPU; and software), enhanced messaging techniques and the knowledge on how to increase the productivity from a given design.

For LS-DYNA, InfiniBand interconnect-based solutions have been proven to provide the highest productivity compared to Ethernet (GigE, 10GigE, iWARP). With InfiniBand, LS-DYNA demonstrated high parallelism and scalability, which enabled it to take full advantage of multi-core high-performance computing clusters. In the case of Ethernet, the better choice between GigE, 10GigE and iWARP is 10GigE. While iWARP aim to provide better performance, typical high-performance applications are using send-receive semantics which iWARP does not provide any added value with, and even worse, it just increase the complexity and the CPU overhead/power consumption.

If you want to get a copy of a paper that present the capabilities to increase simulations productivity while decrease power consumption, don’t hesitate to send me a note (hpc@mellanox.com).

High-performance clusters bring many advantages to the end user, including flexibility and efficiency. With the increasing number of applications being served by high-performance systems, new systems need to serve multiple users and applications. Traditional high-performance systems typically served a single application at a given time, but to maintain maximum flexibility a new concept of “HPC as a Service” (HPCaaS) has been developed. HPCaaS includes the capability of using clustered servers and storage as resource pools, a web interface for users to submit their job requests, and a smart scheduling mechanism that can schedule multiple different applications simultaneously on a given cluster taking into consideration the different application characteristics for maximum overall productivity.

HPC as a Service enables greater system flexibility since it eliminates the need for dedicated hardware resources per application and allows dynamic allocation of resources per given task while maximizing productivity. It is also the key component in bringing high-performance computing into cloud computing. Effective HPCaaS though, needs to take into consideration the application’s demands and provide the minimum hardware resources required per application. The scheduling of runs of multiple applications at once requires the proper balance of resources for each application proportional to their demands.

Research activities on HPCaaS are being performed at the HPC Advisory Council (http://hpcadvisorycouncil.mellanox.com/). The results show the need for high-performance interconnects, such as 40Gb/s InfiniBand, to maintain high productivity levels. It was also shown that scheduling mechanisms can be set to guarantee same levels of productivity in HPCaaS versus the “native” dedicated hardware approach. HPCaaS is not only critical for the way we will perform high-performance computing in the future, but as more HPC elements are brought into the data center, it will become an important factor when building the most efficient enterprise data centers.

The industry has been talking about it for a long time, but on March 30th, it was officially announced. The new Xeon 5500 “Nehalem” platform from Intel has introduced a totally new concept of server architecture for Intel-based platforms. The memory has moved from being connected to the chipset to be connected directly to the CPU, and the memory speed has increased. More importantly, PCI-Express (PCIe) Gen2 can now be fully utilized to unleash new performance and efficiency levels from Intel-based platforms. PCIe Gen2 is the interface between the CPU and memory to the networking that connects servers together to form compute clusters. With PCIe Gen2 now being integrated in compute platforms from the majority of OEMs, more data can be sent and received in a single server or blade. This means that applications can exchange data faster and complete simulations much faster, bringing a competitive advantage to end-users. In order to feed the PCIe Gen2, one needs to have a big pipe for his networking solutions, and this is what InfiniBand 40Gb/s brings to the table. No surprise that multiple server OEMs have announced the availability of 40Gb/s InfiniBand in conjunction with Intel announcement (for example HP and Dell).

With a digital media rendering application – Direct Transport Compositor, we have seen a 100% increases in frames per second delivery, while increasing the screen anti-aliasing at the same time. Other applications have shown similar level of performance and productivity boost as well.

The reasons for the new performance levels are the decrease in the latency (1usec) and the huge increase in throughput (more than 3.2GB/s throughput uni-directional on more than 6.5GB/s bi-directional on a single InfiniBand port). With the increase in the number of CPU cores, and new server architecture, bigger pipes in and out from the servers are required in order to keep the system balanced and to avoid creating artificial bottlenecks. Another advantage for InfiniBand is its ability to use RDMA and transfer data directly to and from the CPU memory, without the involvement of the CPU in the data transfer activity. This mean one thing only – more CPU cycles can be dedicated to the applications!

You may be asking yourself, how does this address my cluster computing needs? Does the Windows OFED stack released by Mellanox provide the same performance seen on the Linux OFED stack release?

Well, the Windows networking stack is optimized to address the needs of various HPC vertical segments. In our benchmark tests with MPI applications that require low-latency and high-performance, the latency is in the low 1us with bandwidth of 3GByte/sec uni-directional using the Microsoft MS-MPI protocol.

Mellanox’s 40Gb/s InfiniBand Adapters (ConnectX) and Switches (InfiniScale IV) with their proven performance efficiency and scalability, allow data centers to scale up to tens-of-thousands of nodes with no drop in performance. Our drivers and Upper Level Protocols (ULPs) allow end-users to take advantage of the RDMA networking available in Windows® HPC Server 2008.