The Network IS the Cluster: Infiniband and Ethernet Network Fabric Solutions for HPC (Part Two)

In part one, I introduced the two most popular HPC networking technologies -- Ethernet (GigE and 10GigE) and Infiniband. We also compared latency, bandwidth, and the N/2 performance of these technologies. While these numbers give a general feel for performance, there is no easy way to determine the actual performance of your application.

In part one, I introduced the two most popular HPC networking technologies — Ethernet (GigE and 10GigE) and Infiniband. We also compared latency, bandwidth, and the N/2 performance of these technologies. While these numbers give a general feel for performance, there is no easy way to determine the actual performance of your application. Benchmarking your application suite is really the only way to know for sure if a networking technology works best for your needs. Other issues, such as drivers, manageability, infrastructure, etc. may also be important in your choice.

In order to look at actual performance, I chose two popular commercial applications with published benchmark sets and results. As with most things HPC, the results are never quite what you would expect.

Application Performance

Now that we know the performance of the various interconnects (using micro-benchmarks) and the approximate per port or per node costs, let’s take a look at real application performance and see how well the interconnects perform.

I’m going to look at two Computer Aided Engineering (CAE) applications: Fluent and LS-Dyna. Fluent is a Computational Fluid Dynamics (CFD) application used in many industry sectors. LS-Dyna is what some people call a crash or impact code because it can model objects impacting each other using Finite Elements. It also has some other capabilities, but the impact analysis seems to have gained most notoriety. Who doesn’t like to see things smash together?

In the case of Fluent, the parent company ANSYS, provides benchmark examples and allows people to run them and post their results to Fluent’s Web site. In the case of LS-Dyna, also sold by ANSYS, there is a public Web site called Top Crunch that has many examples of impact problems for LS-Dyna. People are free to download the input data, run the case, and report the results back to Top Crunch.

Fluent Results

Fluent application size can vary, so the benchmark suite is broken in into groups – small, medium, and large. I’m going to look at the larger Fluent problems because they are closest in size to what people actually run. The smaller benchmarks actually complete quickly on today’s processors. Table One outlines the tests we will examine.

Table One – Fluent Example Problems

Problem

Number of cells

Cell Type

Models

Solver

FL5L1

847,764

hexahedral

RNG k-epsilon turbulence

coupled explicit

FL5L2

3,618,080

hybrid

k-epsilon turbulence

segregated implicit

FL5L3

9,792,512

hexahedral

RSM turbulence

segregated implicit

Problem size is measured by the the number of cells that must be solved. Notice that the problems start fairly small and grow to almost 10 million cells. Problems with less than a million cells can usually be solved on a desktop workstation. A general rule of thumb for Fluent is that it takes about 1 GB of memory per 1,000,000 cells.

In order to compare “apples to apples”, we need benchmark results that use the same node hardware, the same OS, the same version of Fluent, and use both GigE and Infiniband. Fortunately, there is one set of results that contains both GigE and Infiniband results (sorry, no 10GigE results). Hewlett Packard has posted results for version 6.2.16 for the same node configuration with GigE and Infiniband. The hardware is a bit on the older side, but the comparison is still valid. The hardware is:

Node Type: HP DL360

Processor: Intel EM64T

Processor Speed: 3.4 GHz

OS: Red Hat Enterprise Linux 3 (RHEL 3)

Interconnects: GigE and Voltaire IB (SDR)

Table Two contains the results for the FL5L1, FL5L2, and FL5L3 benchmarks for both GigE and IB over a range of processors. The numbers are the performance “ratings” of the run. So the larger the number the better the performance.

Table Two – Fluent Example Problems: Results

Number of Cores

Infiniband

GigE

% Improvement by IB

FL5L1

1

128

NA

NA

2

228.6

223.7

2.19%

4

441.8

420.3

5.12%

8

833.6

765.8

8.85%

16

1441.8

1222.5

17.94%

32

2333.6

1818.9

28.30%

FL5L2

1

94.2

NA

NA

2

167.6

160.6

4.36%

4

337.8

320

5.56%

8

658

612.7

7.39%

16

1249.4

1125.7

10.99%

32

2300.9

1746.3

31.99%

FL5L3

4

62.2

58.5

6.32%

8

121.3

116.1

4.48%

16

237.8

221.6

7.31%

32

448.1

400.2

11.97%

For the smaller problems (FL5L1 and FL5L2) Infiniband can make a large difference in performance as the number of processing cores increases. For example, for the FL5L2 problem at 32 cores, IB is 31.76% faster than GigE. But, for the larger problem, (FL5L3) Infiniband doesn’t give you that much of a performance boost even at 32 cores (~12%). A simple extrapolation shows that for larger problems you have to go to large core counts before you see Infiniband really dominate GigE.

The results also make sense when you consider the nature or the local communication pattern of a parallel CFD problem. The larger the problem the bigger the chunks of work for each core and hence the ratio of compute time to communication time is large. As the size of the problem gets smaller the ratio gets smaller and the influence of the interconnect becomes more apparent.

LS-Dyna Results

The LS-Dyna results from Top Crunch are using a bit more recent hardware. The results were posted by Penguin Computing, using this configuration:

Node Type: Penguin Computing Relion 1600

Processor: Intel Xeon 5160 (Woodcrest)

Processor Speed: 3.0 GHz

OS: NA

Interconnects: GigE and Silverstorm IB

This hardware is a bit more recent and the Xeon 5160 has a larger cache than most x86(_64) CPUs have had in the past — 4MB.

The test used two Top Crunch benchmarks. The first used the “neon_refined” case and represents a model of a neon car crashing into a rigid wall. The model has 532,077 elements and is the smallest of the 3 cases on TopCrunch.org. The second model is called “3car” and is a model of 3 cars traveling close together with the first car impacting a rigid wall, the second car impacting the first car, and the third car impacting the second car. The model has 785,022 elements.

Table Three presents the results for the two LS-Dyna problems using GigE and Infiniband. In all cases version mpp971.7600 of LS-Dyna was used. The results are actual run times, so the smaller the numbers, the faster the code runs.

Table Three – LS-Dyna Example Problems: Results

Number of Cores

Infiniband

GigE

% Improvement by IB

neon_refined

4

2886

3276

11.90%

8

1662

31776

6.42%

16

894

1090

17.98%

3car

4

42,656

NA

NA

8

1660

23705

29.96%

16

1135

13,866

18.14%

Notice that there is some improvement for IB over GigE. For the larger problem (3car), at eight cores, IB is about 30% faster than GigE. But recall that with this configuration, eight cores is only two nodes. So there isn’t that much MPI traffic on the network (the rest of the MPI traffic is handled shared memory). For the same problem, as the number of cores is increased, the performance gap between IB and GigE stays fairly narrow.

IB Doesn’t Look that Great – Why Buy It?

At this point you are probably saying (or at least thinking), It doesn’t look IB gives you a huge boost in performance for these two codes with these data sets – so why should I buy Infiniband, particularly when it’s not cheap? That’s an excellent question given that IB has 12-50 times less latency than GigE and eight times better bandwidth for SDR and 13 times better bandwidth for DDR compared to GigE. Also, the N/2 for IB is 16 TIMES better than GigE. So one would assume that IB should handily beat GigE in terms of performance or at the very least start to pull away from GigE as the number of cores is increased. Similar performance can be expected to 10GigE as well. Recall that the per node costs of IB were from $961-$1,482 – not exactly dirt cheap and considering GigE is present on most motherboards the per node cost is quite a bit less ($258-$944). In spite of the cost differences, let me explain why many people are choosing to buy Infiniband and 10GigE.

NIC Contention

Let’s review the hardware configurations for each application. In the case of Fluent, there are only two cores per node. That is, two cores sharing one GigE NIC or one IB HCA. In the case of LS-Dyna, we have 4 cores sharing one GigE NIC or one IB HCA. For both sets of hardware you have all of the cores competing for access to the interconnect (usually one interconnect per node). Because many codes are run with an MPI process running on each core for the same application, each core is likely to try to communicate at almost the exact same time. So the application that is running on each core is moving data, both send and receive, using the same interconnect interface. This isn’t too important for IB, because it can send and receive data at a very high rate with low latency (i.e. quick to get data on and off the network). But it could be important for GigE because while some configurations might be low latency, it has limited bandwidth.

In the case of the LS-Dyna hardware, there are four cores per node (dual-socket, dual-core) so the contention for the single GigE NIC has gone up. We also have quad-socket dual-core solutions and dual socket quad-core solutions (eight cores per node) that are available today. Soon we will have 16 cores per node. Let’s consider how this will effect our interconnect.

In the case of server motherboards we started off with two cores per GigE NIC, then we moved to dual-core CPUs (four core per GigE NIC), and are moving to quad-core CPUs quickly (eight cores per GigE NIC). Soon we will have eight-core CPUs which will give us 16 cores per GigE NIC. In the case of quad-socket boards, we could have up to 32 cores per node (eight cores per socket). During this rise in the number of cores, the performance of GigE has stayed exactly the same. Table Four below shows the impact of increasing the number of cores has with GigE.

Table Four – Per Core Bandwidth with a Single GigE NIC

Number of Cores

GigE Bandwidth (MB/s)

Per Core Bandwidth (MB/s)

2

120

60

4

120

30

8

120

15

16

120

7.5

32

120

3.75

So even for today’s nodes each core only gets a small fraction of the bandwidth from the GigE NIC (15-30 MB/s). If the cores are trying to communicate (send/receive) at the same time, then they will only get this very small bandwidth. Quad-cores is already here so you are getting from 30 MB/s (single-socket) to 7.5 MB/s (quad-socket). For the quad-socket case, the per core bandwidth is less than Fast Ethernet! This trend kind of feels like we are going backwards doesn’t it? The only way out of this is to use a faster interconnect, which is why Infiniband or 10GigE are the preferred interconnect on newer multi-core systems.

ISV Application Costs

The second reason people are moving to IB (and 10GigE) from GigE is the cost of ISV applications. The per-core cost of an ISV application such as Fluent and LS-Dyna is much higher than you might think. For example, a dual-socket dual-core node with a fair amount of memory will cost $5K-$8K. Remember that this node has four cores, which is about $1,250 to $2K per core. Most ISV applications charge per core and the cost per core can be about two to eight times the hardware cost! So if a node costs you $5K, the ISV applications can easily cost $10K-$40K per node! In addition, there are yearly support and upgrade fees. If we assume that that support costs are 10% of the purchase price per node, then over a three-year period, you pay $5K for the hardware and $23K-$52K per node for the application and the support! This means that a single application is 5.75-10.4 times the cost of the hardware!

Because the hardware is a small part of the overall cost of the hardware and applications it seems logical that if adding some hardware to a node to improve the performance might be a winner in terms of price/performance. This is where IB (and 10GigE) come into play. Let’s do a “what if” and determine if it’s worth including IB in a smaller cluster.

Let’s assume that we have 24 nodes with four cores per node. Let’s also assume the per node cost is $8K. So the total system cost so far is $192,000 with a basic GigE network. Let’s also choose one ISV application that costs $20K per node ($5K per core). This adds $480,000 to the system total which is now $672,000 (note that the ISV software costs are more than the hardware and I deliberately chose expensive per node prices).

If we add IB to the hardware at about $1,000 per node (a little high) then the hardware costs jump to $216,000 and the total system cost is now $696,000. Let’s also assume that the performance of the application jumps by 20% when I switch to IB. So I add $24,000 to the system cost for IB to add 20% better performance. Table Five summarizes these numbers and looks at the performance/price for two configurations: one with only GigE, and one with IB. I normalized the performance of GigE to be 1.0 on the system, so the IB performance is 1.2. Both configurations include the cost of the ISV applications.

Table Five – Price/Performance Comparison with GigE and IB

Item

Per Node Cost

Total

24 nodes of nodes+GigE

$8,000

$192,000

24 ports of IB

$1,000

$24,000

ISV Application

$20,000

$480,000

System 1 – GigE

Normalized Performance

Total Normalized Cost
(price/performance)

$672,000

1.0

$672,000

System 2 – IB

Normalized Performance

Total Normalized Cost
(price/performance)

$696,000

1.2

$580,000

From the above table, the Price/Performance for the GigE system is $672,000 and the Price/Performance for the IB system is $580,000. The IB system has about a 14% better price/performance than the GigE system. So even though the network is more expensive it allows the hardware to be used more efficiently.

What do You do With Leftover Bandwidth?

One final question that you may ask, So I paid extra to get IB for a system to get a better price/performance but now I have this leftover bandwidth what do I do with it? I think the answer to this question is make a stew, but I think I’ve been watching The Food Network too much. As I mentioned before, most applications don’t use the full bandwidth. With 10GigE or SDR IB, 10 Gbps is plenty of bandwidth for almost all applications and in many cases, there can be more than 50% of the bandwidth left over to do something with. Hmm….

The obvious answer is to use it for a file system. Even better, use it for a parallel file system. File systems, especially parallel ones, can use lots of bandwidth moving data back and forth to the node and to the storage. In the past, people have been dedicating an extra network just for storage. This makes a lot of sense in the case of GigE because it has such limited bandwidth. So people would put one or two GigE networks in their cluster for computational traffic (MPI traffic) and then one or two GigE networks in their cluster for storage traffic. This design means that each node could theoretically move 120-240 MB/s per node (this is just the client side, not the server side). This rate may have been enough a while ago, but some applications require much more IO bandwidth than that.

So users were faced with two problems: (1) they needed more computational bandwidth, and (2) they need more IO bandwidth. Because IB is such a price/performance winner for systems that use ISV applications and there is often leftover bandwidth, why not use it for IO?

For example, in the case of DDR IB, one-fourth of the bandwidth could be used for computation (about 5Mbps) which is likely to be more than enough, and use the rest (1.5 Mbps) for IO which is likely to be more than enough for most applications. This creates a win-win scenario for cluster designers. You put IB or 10GigE in the system because it’s a better price/performance solution than GigE, and then you use the extra bandwidth for IO (most likely a parallel file system). This design has the added benefit of reducing the cabling to a single cable (for the most part). We’ve gone from two to four cables per node to one. Not a bad trade off.

Summary

A recent study published by IDC predicted that the growth in Infiniband HCAs will increase at a compound annual growth rate of 29.3% from $62.3 million in 2006 to $224.7 million in 2011. On the switch side, IDC predicted an increase of 45.2% from $94.9 million in 2006, to more than $612 million by 2011. Equally important, IDC predicts that Infiniband will take over the I/O side of HPC.

I hope this has demonstrated that GigE is still a good interconnect for the applications examined here. But, as CPUs gain more cores at an accelerated rate, the bandwidth that each core receives will be decreasing. When we hit eight or 16 cores per GigE will just not cut it. Infiniband and 10GigE will be a necessity. The second reason that people are moving to Infiniband or 10GigE is that it is often a price/performance winner when the added costs of the ISV applications are included.

Plus, when you add IB to a system, you will have leftover bandwidth you can use. Many people apply this leftover bandwidth to a file system (usually parallel). Based on these three observations, we can see why IB is such a fast growing HPC interconnect. 10GigE will not be far behind. Enjoy your bandwidth.

Comments on "The Network IS the Cluster: Infiniband and Ethernet Network Fabric Solutions for HPC (Part Two)"