10GbE is Ready for Your Cluster. Or is it?

Linux Magazine has an article written by Dan Tuchler detailing why he thinks 10-gigabit Ethernet should be a more widely considered for vanilla HPC cluster installations. Considering the vast majority of of cluster installations fall outside of the realm of the Top500 list, many of us tend to forget that the average HPC user doesn’t have terabits of interconnect bandwidth. They’re simply using gigabit ethernet. Tuchler argues that this high comfort level with Ethernet technologies, coupled with the sinking costs of 10GbE make the technology ripe for an interconnect platform.

As a widely-used standard, Ethernet is a known environment for IT executives, network administrators, server vendors, and managed service providers around the world. They have the tools to manage it and the knowledge to maintain it. Broad vendor support is also a plus – almost all vendors support Ethernet.

I somewhat agree with Tuchler’s point of view. Five years ago 10GbE prices were so far out in the stratosphere that rarely would you ever have the funds to purchase a switch. The prices *are* finally coming down to reasonable levels. However, so are the prices of other common cluster interconnects such as Myrinet and Infiniband. Tuchler quotes $500 per port on 10GbE which is very close to the current Infiniband cost basis. So why go 10GbE when you can buy Infiniband with native RDMA capabilities and an integrated IP stack? [this is really a question folks, I’m not being sarcastic].

Feel free to leave your comments on this one. I’m interested to hear what the audience feels about this debate. For more info, read Dan’s article here.

In short, DDR IB seems to come in slightly less than 10GE per port, at twice the data rate, while SDR IB is less than half the cost for the same data rate. I suspect the per-packet CPU utilization is also better on the IB side, though the software complexity is greater.

Disclaimer: I work for an HPC vendor (SiCortex) which has little interest in 10GE/MX/IB as a cluster interconnect since we have our own built in.

Thanks to the Jeffs for a great series of comments! Like I said, its been several years since I had the pleasure of quantifying the costs of various interconnect technologies [when I did this, 10GbE was $1200+ per port].

The IB switch pricing for 24 port DDR switches is now sitting around $4k +/- some. So it is roughly (today) $167/port. Add in a 2m CX4 cable at $60-ish, and a DDR NIC at $500-ish, and connecting 1 server to one port on a 24 port switch will run you under $750/port in total.

Do the same analysis for 10 GbE. NICs about $500-ish, same CX4 cable. But the switches are still not cheap. The best price we have seen/heard anywhere per port (not CX4, so you have added transceiver costs, not a wise move IMO) is about $750. Rumors abound of $500/port switches somewhere.

So from a pure price play, 10 GbE still costs more, and will until inexpensive switches start coming out. Once that happens, we would expect that they would start taking over for IB. Until that happens, I don’t expect to see much change.

The 10 GbE stack is much easier to deal with than IB. Building OFED has been, up until very recently, a crap-shoot on anything but a small range of specific distro kernels. This was an unfortunate outgrowth of how OFED developed, but the situation is/has been improving. Unfortunately, you won’t get good things like NFS-over-RDMA without the modern kernels, which, are not supported (officially) by OFED (c.f. http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-docs/README.txt)

What we have found in testing is that single thread 10 GbE performance is ok, though multi-thread is quite good. Latency from our testing is comparible with normal IB latencies. But it is ethernet, so NFS does just work, without any appeal to RDMA to get it going. And TCP/IP just works, and works well, again, without suffering the (major) performance degradation of doing IP over IB.

But is all this worth the significantly higher price?

That decision we must leave to the consumer. 10 GbE price is a problem for HPCC systems, and hopefully someone is working at a way to lower the costs so it is reasonable. Because the price of IB is reasonable, and without a good reason to switch, it likely won’t happen (the stack pain is annoying, but livable, and the entire process of support is automatable).

Just my thoughts. We support everything. The Delta-V we showed in Pervasive Software’s booth was running iSCSI over 10 GbE as a target. We got a sustained 500 MB/s and 1800 IOPs out of it for their use. It works, it just costs more.

Hmmm …. I must be missing something here. $167/port (IB) is higher than $400/port (10GbE)? MB’s also have IB HCAs (we are working on bids with units like this now). 10 GbE on MBs are relatively recent as compared to IB on MBs.

Also, the Arista switches need a transceiver, which the on-board MB NICs (10 GbE/IB HCA) don’t, nor do the CX based IB switches. The SFP’s do add to the cost per port.

Again, the aren’t needed on IB. So the cost is higher. If I am wrong, please, by all means, show the analysis.

Yeah, the cost is still an issue for some, and this is an interesting take on alternatives. My company actually has a product for expanding ports on testing equipment only (SPANs and TAPs). Seems we need a similar product for throughput traffic though, and at a lower cost than these.

1. 10GbE and SDR IB are *NOT* the same data rate! This is a common
marketing misnomer. With 10GbE, you can actually push darn close to a
rate of 10Gb on the wire for large messages. IB uses 8/10 encoding,
so you automatically lose 20% of the bits on the wire to protocol
overhead — you’re down to 8Gb. Similarly, DDR is really only 16Gb of
delivered data performance; QDR is really 32Gb. I believe that this
point was also made in the original article.

2. RDMA actually gets you very little in terms of MPI (regardless of
whether it’s IB or iWARP or …). What an MPI implementation really
wants is hardware offload/assistance for message passing progress,
particularly of large messages. If that hardware assistance comes in
the form of RDMA, ok, fine. But to be blunt, MPI’s semantics are
better matched to other forms of hardware offload.

3. Indeed, with today’s OpenFabrics MPI implementations (including
Open MPI), RDMA’ing an entire large message all at once can be quite
expensive in terms of resource usage. Open MPI is capable of sending
large messages either as a single large RDMA or a number of smaller
sends and/or RDMAs. Which way works best for you is likely
application-specific: it depends on factors such as (but not limited
to):

– how much registered memory your application is using in other
pending communications
– what the frequency of your communication is
– how many peers you’re sending to
– what communication/computation overlap you need
– how often you invoke MPI functions that trip the internal
progression engine
– …etc.

So don’t get hung up on specific technologies like RDMA. RDMA is not
the be-all/end-all technology for HPC. In some cases, it’s not even a
very good technology (!). Hardware offload is what is key (IMNSHO),
and there are many different flavors to choose from.

But at the end of the day, what you want to know is what will perform
well *for your application*. For example, here’s a very, very
coarse-grained set of questions that may start you down an analysis
path for your needs: for your application(s)…

…and be sure to multiply that out if you plan to put more than one
active network port in each server — especially as core counts go up!
It’s insane (IMNSHO) to have one network port for 16 cores and assume
that you won’t drop off overall network performance when all 16 MPI
processes (or even 8… or possibly even 4!) are simultaneously
pushing either large messages or large numbers of small messages.
Also make sure you get the math and server topology right such that
you can actually push (N x one_port_bandwidth) with the desired
latency into your fabric, yadda yadda yadda…

My point of this long ramble: it’s not about RDMA. It’s not even
[entirely] about price. Look into exactly what you’re going to use
your HPC resources for — what problems are you going to solve and how
*EXACTLY* you are going to solve them. Which MPI will you use? What
application(s)? What communication pattern(s) do they use? What
network topology fits that? Do you *need* low latency? Do you *need*
high bandwidth? In short: find the best technologies that fit your
needs, not the coolest/hottest/bigger-than-your-rival’s technologies.
Spend a little time on a quantitative analysis of your needs; you’ll
save lots of money over the long run because you’ll get a solution
that works best for exactly what you’re trying to do.

GAMMA is neat, but the best kept secret in HPC is OpenMX (www.open-mx.org). It’s a software implementation of Myricom’s hardware MX NICs — it uses the Linux ethernet driver, so it works with whatever ethernet NIC you have (1Gb or 10Gb).

Specifically, MX is just frames over ethernet, regardless of whether they are being pushed via software or hardware. In a datacenter (i.e., HPC cluster), frames over ethernet is all you need — you don’t need the huge/complex TCP stack (or other network stacks).

I *STRONGLY* encourage everyone to give OpenMX a whirl; let’s get the bugs shaken out and get people using it. I saw some *very* promising latency numbers out of OpenMX and “reasonable” 10Gb NICs (I’m not going to quote numbers because I’m a vendor and I don’t want my ran-it-in-the-lab numbers to be taken authoritatively); I’ve even heard anecdotal stories of real-world MPI apps getting nice speedup over *1* (yes, *one*) GbE!

Note that OpenMX and MX are both API and wire-line compatible, so you can have OpenMX on one side and MX on the other (nifty!). Therefore, Open MPI natively supports OpenMX because — well, it’s just MX, and we’ve supported that for a long time.

Arista Networks switches can be used with SFP+ 1X twinax copper cables, which are less expensive than Infiniband 4X CX4 cables. The SFP+ twinax copper cables have an SFP+ connector on each end, so no additional transceiver is required.

(1) I think Jeff Squyres makes some great points. It’s not always about price, it’s about performance and what you are doing with the fabric. There are lots of things that go into a good solution.

(2) Pricing out small switches for 10GigE shows that 10GigE is approaching IB in price. Where it gets really fun, and Joe alluded to this, is when you start talking about larger fabrics. In my experience when you get to larger fabrics, the price per port goes up faster than the port count (i.e. it’s gets pretty darn expensive). Plus, if you’re running TCP, you have to start worrying about spanning tree latencies if you go for multi-tier switching (I know Woven says they have a solution for this but to be honest I don’t know much about it. I think there are others that claim to have fixed this problem – just haven’t seen much on this yet). So building multi-tier TCP fabrics is not really pretty from a micro benchmark perspective.

(3) One other quick point – the GAMMA charts are from Doug Eadline. He and I were working on a project over at ClusterMonkey (shameless plug) and Doug was testing GAMMA. The results are pretty cool (IMHO).

Doug is now testing Open-MX. I agree with Jeff S. that Open-MX is pretty nifty and an under -ppreciated possibility for people. GAMMA doesn’t allow you to mix TCP and GAMMA traffic on the same port. However Open-MX does allow you to mix traffic. Doug is still testing and there were a few little weird things happening in performance testing, but I think Doug has most of those ironed out. He’s working on an article for ClusterMonkey in the near future to present his results.

(4) I personally think using something other than TCP gives Ethernet some new life. Open-MX or GAMMA over GigE allows applications to scale a bit further and run faster. Running non-TCP over 10GigE is also something to seriously consider. There are still some issues about fabric configuration, but you can drop the latency for 10GigE to some pretty low levels.

(5) IB is still running strong within HPC even for smaller systems. SDR, while not 10Gb/s (it’s 8Gb/s) is priced pretty low and is very attractive for smaller systems. DDR (at 16Gb/s as Jeff reminded all of us) might perhaps come down in price with QDR (32Gb/ as Jeff pointed out) coming into the marker more and more now. Pretty amazing performance with IB.

(6) IB is also great for smaller systems with ScaleMP. With ScaleMP you can take the cluster nodes, connected with IB, and it appears like a large SMP system to the OS. You don’t have to install IB drivers or anything like that – ScaleMP takes care of that for you. You just run your MPI codes with something like shmemm as the device (no need for IB) and they run just fine. Pretty cool stuff.

There’s some really great information on this thread – I’m enjoying the exchange, and learning a lot.

Just to clarify on the cost topic – we see several major server vendors beginning to incorporate 10 Gig Ethernet chips on the servers. Whether this makes them “free” or not is a matter of opinion, but certainly this will help drive 10 GE chip volumes up and push the costs way down, as happened with 1GE. The cable, whether CX4 or passive SFP+, is in the range of $50 +/- and coming down. And switches now list for under $500 a port: my company, BLADE Network Technologies, makes both blade-server resident and Top of Rack switches that list for under $500 a port (yes, that’s why we are so interested in observing 10GE adoption). Certainly there are cases for other interconnects as well – it’s good that users have choices.

[…] the latest developments in 10GbE technology with respect to the HPC market. We had quite a heated discussion some time ago on the same subject. Traditionally Ethernet has been considered a low cost and good […]

Resource Links:

Latest Video

Industry Perspectives

AI is a game changer for industries today but achieving AI success contains two critical factors to consider — time to value and time to insights. Time to value is the metric that looks at the time it takes to realize the value of a product, solution or offering. Time to insight is a key measure for how long it takes to gain value from use of the product, solution or offering. [READ MORE…]

White Papers

The new HPC, inclusive of analytics and AI, and with its wide range of technology components and choices, presents significant challenges to a commercial enterprise. Ultimately, the new HPC brings about opportunities that are worth the challenges. This whitepaper from Lenovo outlines new markets, workloads and technologies in HPC, as well as how its own products and tech are addressing the solution needs of the “new HPC.”