Experiments With Switchless 10GigE (GbE): The Bonded Loop

Note: Although this hack seems to work for point to point communications, when used
with many simultaneous messages, like MPI program, there are some stalls that reduce
performance. NAS benchmarks and HPL results for other similar switch-less designs will be posted soon.
Good news, for small numbers of nodes performance is quite good.

Modern Ethernet technology is based on network adapters and switches. Using Ethernet without a switch only happens in rare situations where a small number (e.g. two) systems need to be directly connected together. Such a connection is often called a "cross over" connection because a special cable may be needed.

The cost of adapters and switches follow a very predictable commodity pricing trend. At first the cost of systems is quite high and decreases as the sales volumes increase. Currently Gigabit Ethernet (GigE or GbE) enjoys low cost and wide availability from multiple vendors. Ten Gigabit Ethernet (10GigE or 10GbE) is now experiencing greater acceptance and thus decreased costs. Although volumes are growing, 10GigE still commands a high per port price (Adapter/Switch) and thus can be an expensive option for many small projects.

There are areas where small scale 10GigE networks can be useful. The Limulus Project has produced a true desk-side HPC cluster workstation that maximizes performance, heat, noise, and power to deliver 200+ CPU-GFLOPS. The current version uses low cost GigE and a small embedded switch to connect the four single processor compute nodes. As the nodes gain more cores and become faster, the need for a faster interconnect is desirable. Placing InfiniBand or 10GigE inside such system presents a problem due to the size and the high cost of even the smallest switches.

A solution that is commodity based (i.e. low cost), requires no switch, and fits within
the power/heat envelope has been elusive. The price reductions in dual port 10GigE cards, mentioned above, does offer an incentive to incorporate 10GigE in such a system. While there are several methodologies available, the lowest cost option consists of four two-port 10GigE adapters without a switch. This four node design can be pictured in Figure One below.

Figure One: A Unworkable Ethernet Loop

Using the Linux Ethernet Bridging capability, it is possible to turn each node into a "mini switch" forwarding any packets that are not intended for the node. This ring creates two 1-hop routes and four 0-hop routes. The problem with this design is that Ethernet does not tolerate loops. Thus, a fully connected system will not work. To solve the Ethernet loop problem, the Spanning Tree Protocol (STP) can be used on one of the nodes. Figure Two shows the result of turning STP on one for one of the nodes.

Figure Two: An Ethernet Loop with STP

While the Ethernet network will now route packets, the use of STP effectively cuts one of the links between nodes and introduces one 2-hop route, two 1-hop routes, and three 0-hop routes. The 2-hop route could cause a bottleneck and negate some of the advantages offered by adding 10GigE technology.

An interesting alternative to STP is to remove the bridging from one node and create a "bonded node" where the two 10GigE ports act like one interface. To the other nodes, the bonded node looks like an endpoint and breaks the Ethernet loop (i.e. the bonded interface does not forward packets) When bonded in mode 0, the interface can receive packets from either port and when transmitting, will alternate packets between the two ports (i.e. "round robin").
This situation is shown in Figure Three

Figure Three: A Bonded Ethernet Loop

The bonded loop removes the 2-hop issue created by STP and "effectively" introduces four 1-hop routes and two 0-hop routes. The word "effectively" is important because there can still be a 2 hop path, but for only half the data, the other half ends up taking a 0-hop route due to the round robin mode. Thus, the round robin path averages out to a 1-hop path. The tests below confirm this conclusion.

Test System

The head node is Sandy Bridge, i5-2400S with 4 GB RAM and the worker nodes are single dual core E6550, with 4 GB RAM. The 10Gig Adapters are Chelsio T420-SO-CR. All adapters are connected by SFP+ cable. (Thanks to Chelsio for donating the adapters).

In order to get a good test of the low level system capabilities, Open-MX was used as a transport layer. Open-MX is a freely available version of the Myricom MX library for user space communication between nodes. Both Open-MPI and MPICH2 support the Open-MX transport layer. Note, the low cost 10GigE Adapters often do not support advanced features like iWarp and Open-MX is good solution to achieve "iWarp performance" from low cost 10GigE adapters.

Also note, these tests have not been optimized. A standard 1500 Byte MTU was used for Open-MX and the node hardware is older CPU technology. Further tuning and newer hardware is expected to improve the overall performance.

Results

The omx_perf test program was used to measure the performance over all six possible routes. The best case is where a 0-hop communication occurs (headnode-n0, headnode-n2). The worst case includes any route with n1 because data will be sent from either port,thus creating an effective 1-hop path. (i.e. with n1-n0 or n1-n2 half the data is 0-hop and the other is 2-hop, n1-headnode is always two 1-hop paths, n0-n2 is a single 1-hop).

Figure Four shows the standard throughput vs block size for best and worst cases. The single Byte latency is 6.5 µs for the best case and 14.2 µs for the worst case (as expected the extra hop doubles the latency.)

Figure Four: Throughput vs Blocksize

The network signature graph is shown below in Figure Five. (This style of graph was popularized by the NetPipe benchmark).

Figure Five: Network signature graph (Throughput vs Latency)

Raw data for both cases can be found below.

Conclusion

As expected, the results show impressive performance for the 0-hop cases. The effective 1-hop case has doubled the latency, however, it is still much better than the best GigE performance using Open_MX. The throughput for 0-hop is close to wire speed, while the 1-hop case is noticeably less. Again, while not optimal the 1-hop case is much better than standard GigE. In addition, without the need to buy a 10GigE switch, the network performance gain over standard GigE for approximately $1200 is hard to argue. Further high level MPI benchmarks are planed.

Finally, it is possible to create a slightly higher cost star topology using a quad-port 10GigE adapter and single port adapters on the worker nodes. In this case, all communication from the hostnode to worker nodes would be 0-hops, but all worker to worker nodes would be 1 hop (i.e. three 0-hop and three 1-hop vs two 0-hop and four effective 1-hop for the ring topology described herein). This topology will be tested in the future as well.

Ultimately, how well the ring or star topology work depend on the application mix used on such a small system. At sixteen total cores, the Limulus personal cluster has enough processing power to do real work and also act as an HPC sandbox for education and development. The inclusion of high speed low latency 10-GigE is a welcome addition to this technology.