25GbE

As technology marches forward new challenges arise that were not previously an issue. Consider as mankind moved from walking to horseback we cleared trails where there was once brush covered paths. As we transitioned from horseback to carriages those paths needed to become dirt roads, and the carriages added suspension systems. With the move from carriages to automobiles, we further smoothed the surface traveled by adding gravel. As the automobiles moved faster, we added an adhesive to the gravel creating paved roads. With the introduction of highways, we required engineered roads with multi-layered surfaces. Each generation reduced the variability in the road surface by utilizing new techniques that enabled greater speed and performance. The same holds true for computer networks.

Over the past three decades as we transitioned from 10Mbps to 25Gbps Ethernet we’ve required many innovations to support these greater speeds. The latest of these being Forward Error Correction (FEC). The intent of FEC is to reduce the bit error rate (BER) as the cable length increases. In 2017 we saw the ratification of the IEEE 25GbE specification which provides two unique methods of FEC. There is BASE-R FEC (also known as Firecode) and RS-FEC (known also as Reed Solomon). Both of these FEC algorithms introduce additional network latency as the signal is decoded, BASE-R is about 80 nanoseconds while RS-FEC is about 250 nanoseconds. The complexities don’t end here though, it turns out there are three different Direct Attach (DA) cable types with varying levels of quality, from good, to best we have:

CA-25G-L: up to 5m, requires RS-FEC

CA-25G-S: up to 3m, lower loss, requires either RS-FEC or BASE-R FEC

CA-25G-N: up to 3m, even lower loss, can work with RS-FEC, BASE-R FEC, or no FEC

But wait there’s more, if you order now we’ll throw in auto-negotiation (AN) and link training (LT) as both are required by the 25GbE IEEE standard (10GbE didn’t need these tricks). So what does AN actually negotiate? Two things, link speed and which type, if any, FEC will be utilized. It should be noted that existing 25GbE NICs that have been on the market likely only support one type of FEC. As for LT, it helps to improve the quality of the 25GbE link itself. It turns out though that the current generation of 25GbE switches came out prior to AN being worked out so support is at best poor to mixed. Often manual switch and adapter configuration are required. Oh, and did I mention that optical modules don’t support AN/LT? Well, they don’t, but some will support short links with no FEC.

So where does this leave people who want to deploy 25GbE? You need to be careful that both your network switch and server NICs will work well together. We strongly advise that you do a proof of concept prior to a full deployment. Not all 25G server NICs do both AN/LT because their chips (ASICs) were designed and fabricated prior to the completion of the IEEE specification for 25GbE last year. Solarflare’s 25GbE X2522 server NICs which debut next month include support for all the above, in fact, when initially powered up they will begin by:

First looking at cable, is it SFP or SFP28?

If it’s SFP28 it will attempt AN/LT, then 25G no AN/LT, then 10G

If it’s a 25G link, then it will try and detect which FEC is being used by the switch

Additionally, the server administrator can manually override the defaults and select AN/LT and the FEC type and setting (auto, on, off).

I grew up in New York, and remember listening to Sy Sims on TV say “an educated consumer is our best customer…”

P.S. I’d like to give a special thanks to Martin Porter, Solarflare’s VP of Engineering, for pulling all this together into a few slides.

So Mellanox’s Connect-X 4 line of adapters are hitting the street, and as always tall tales are being told or rather blogged about concerning the amazing performance of these adapters. As is Mellanox’s strategy they intentionally position Infiniband’s numbers to imply that they are the same on Ethernet, which they’re not. Claims of 700 nanoseconds latency, 100Gbps & 150M messages per second. Wow, a triple threat low latency, high bandwidth, and an awesome message rate. So where does this come from? How about the second paragraph of Mellanox’s own press release for this new product: “Mellanox’s ConnectX-4 VPI adapter delivers 10, 20, 25, 40, 50, 56 and 100Gb/s throughput supporting both the InfiniBand and the Ethernet standard protocols, and the flexibility to connect any CPU architecture – x86, GPU, POWER, ARM, FPGA and more. With world-class performance at 150 million messages per second, a latency of 0.7usec, and smart acceleration engines such as RDMA, GPUDirect, and SR-IOV, ConnectX-4 will enable the most efficient compute and storage platforms.” It’s easy to understand how one might actually think that all the above numbers also pertain to Ethernet, and by extension UDP & TCP. Nothing could be further from the truth.

From Mellanox’s own website on February 14, 2015: “Mellanox MTNIC Ethernet driver support for Linux, Microsoft Windows, and VMware ESXi are based on the ConnectX® EN 10GbE and 40GbE NIC only.” So clearly all the above numbers are INFINIBAND ONLY, today three months after the above press release still the fastest Ethernet Mellanox supports is 40GbE, and this is done with their own standard OS driver only. This by design will always limit things like packet rate to 3-4Mpps, and latency to somewhere around 10,000 nanoseconds, not 700. Bandwidth could be directly OS limited, but I’ve yet to see that so on these 100Gbps adapters Mellanox might support something approaching 40Gbps/port.

So let’s imagine that someday in the distant future the gang at Mellanox delivers an OS-bypass driver for the Connect-X 4 and that it does support 100Gbps. What we’ll see is that like the prior versions of Connect-X, this is Mellanox’s answer to doing both Infiniband & Ethernet on the same adapter, a trick they picked up from now defunct Myricom who achieved this back in 2005 delivering both Myrinet & 10G Ethernet on the same Layer-1 media. This trick allows Mellanox to ship a single adapter that can be used with two totally different driver stacks to deliver Infiniband traffic over an Infiniband hardware fabric or Ethernet over traditional switches directly to applications or the OS kernel. This simplifies things for Mellanox, OEMs, and distributors, but not for customers.

Suppose I told you I had a car that could reach 330MPH in 1,000 feet, pretty impressive. Would you expect that same car to work on the highway, probably not, how about on a NASCAR track? No, because those that really know auto racing immediately realize I’m talking about a beast that burns five gallons of Nitromethane in four seconds, yes a 0.04MPG, top-fuel dragster. This class of racing is analogous to High-Performance Computing (HPC), where Infiniband is king and the problem domain is extremely well defined. In HPC we measure latency using zero byte packets and often attach adapters back to back without a switch to measure percieved network system latency. So while 700 nanoseconds of latency sounds impressive it should be noted that no end user data is passed during this test at this speed, just empty packets to prove the performance of the transport layer. In production, you can’t actually use zero byte packets because they’re simply the digital equivalent of sealed empty envelopes. Also to see this 700 nanoseconds you’ll need to be running Infiniband on both ends, along with an Infiniband supported driver stack that bypasses the operating system, note this DOES NOT support traditional UDP or TCP communications. Also to get anything near 700 nanoseconds you have to be using Infiniband RDMA functions, back to back between two systems without a network switch, and with no real data transferred, it is a synthetic measurement of the fabric’s performance.

The world of performance Ethernet is more like NASCAR, where cars typically do 200MPH and run races measured in the hundreds of miles around closed loop tracks. Here the cars have to shift gears, brake, run for extended periods of time, refuel, handle rapid tire changes, and maintenance during the race, etc… This is not the same as running a top-fuel drag racer once down a straight 1,000-foot track. The problem is Mellanox is notorious for stating their top-fuel dragster Infiniband HPC numbers to potential NASCAR class high-performance ethernet customers, believing many will NEVER know the difference. Several years ago Mellanox had their own high-performance OS-Bypass Ethernet stack that supported UDP & TCP called VMA (Voltaire Messaging Accelerator), but it was so fraught with problems that they spun it off as an open source project in the fall of 2013. They had hoped that the community might fix its problems, but since it’s seen little if any development (15 posts in as many months). So the likelihood you’ll see 700 nanosecond class 1/2 round trip UDP or TCP latency with Mellanox anytime in the near future would be very surprising.

Let’s attack misrepresentation number two, an actual ethernet throughput of 100Gbps. This one is going to be a bit harder without an actual adapter in my hand to test, so just looking at the data sheet, several things do jump out. First ConnectX 4 uses a 16-lane PCIe Gen3 bus which typically should have an effective unidirectional PCIe data throughput of 104Gbps. On the surface, this looks good. There may be an issue under the covers though because when this adapter is plugged into a state of the art Intel Haswell server the PCIe slot maps to a single processor. You can send traffic from this adapter to the other CPU, but it first must go through the CPU it’s connected to. So sticking to one CPU, the best Haswell processor has two 20 lane QPIs with an effective combined unidirectional transfer speed of 25.6GB/sec. Now note that this is all 40 PCIe lanes combined, the ConnectX 4 only has 16 lanes so proportionally about 10.2GB/sec is available, that’s only 82Gbps. Maybe they could sustain 100Gbps, but this number on the surface appears somewhat dubious. These numbers should also limit Infiniband’s top end performance for this adapter.

Finally, we have my favorite misrepresentation, 150M messages per second. Messages is an HPC term and most people that think ethernet will translate this to 150M packets per second. A 10GbE link has a theoretical maximum packet rate of 14.88Mpps. There is no way their ethernet driver for the ConnectX 4 could ever support this packet rate, even if they had a really great OS-bypass driver I’d be highly skeptical. This is analogous to saying you have an adapter capable of providing lossless ethernet packet capture on ten 10GbE (14.88Mpps/link) links at the same time. Nobody today, even the best FPGA NICs that cost 10X this price, will claim this.

Let’s humor Mellanox though, and buy into the fantasy, here is the reality that will creep back in. On Ethernet, we often say the smallest packet is 64 bytes so 150Mpps * 64 bytes/packet * 8 bits/byte is 76.8Gbps, that is less than the 82Gbps we mentioned above so that’s good. There are a number of clever tricks that can be used to bring this many packets into the host CPU into user space while optimizing the use of the PCIe bus, but more often than not these require that the NIC firmware is tuned for packet capture, not generic TCP/UDP traffic flow. Let’s return to the Intel Haswell E5-2699 with 18 cores at 2.3Ghz. Again for performance, we’ll steer all 150Mpps into the single Intel socket supporting this Mellanox adapter. Now for peak performance, we want to ensure that packets are going to extremely quiet cores because we know that both the OS & BIOS settings can create system jitter which kills performance and determinism. So we profile this CPU and find the 15 least busy cores, those with NOTHING going on. Now if we assume Mellanox was to have an OS Bypass UDP/TCP stack that supported a round-robin method for doling out a flood of 64-byte packets this would mean 10Mpps/core or 100 nanoseconds/packet to do something useful with each packet. That’s 250 clock ticks on Intel’s best processor. Unless you’re hand coding in assembler it’s going to be very hard to get that much done.

So when Mellanox begins talking about supporting 25GbE, 50GbE or 100GbE you need only remember one quote from their website “Mellanox MTNIC Ethernet driver support for Linux, Microsoft Windows and VMware ESXi are based on the ConnectX® EN 10GbE and 40GbE NIC only.” So please don’t fall for the low latency, high bandwidth or packet rate Mellanox Ethernet hype, it’s just hog wash.

Update, on March 2, 2015, Mellanox posted an Ethernet only press release that claimed this adapter supported 100GbE, and using the DPDK interface in testing they could achieve 90Gbps with 75Mpps over the 100G link (roughly wire-rate 128 byte packets).

Before diving into each of these options we should set some groundwork. Most performance I/O adapters these days are inserted into the motherboard in a third generation PCI Express (PCIe Gen3) slot that is 8 lanes wide. The theoretical performance of this slot is 64 gigabits/second (Gbps), but after encoding & overhead the effective data rate is more like 52Gbps. Also, it should be noted that on Intel systems PCIe slots have a preference to CPU sockets. So data coming from a PCIe slot that is “wired” to “Socket 0” but is destined for a core on the CPU in “Socket 1” will see a measurable degradation in performance. Most applications will likely not care, but if performance is your specialty you should look into this. You see those bits have to travel a much longer path to reach that distant core. If you’re really interested in achieving the optimum performance you should evenly split your I/O across slots mapped to each CPU socket.

Beyond 10GbE the two currently approved standards which are 40GbE and 100GbE. Many of the NIC companies are already shipping products that support 40GbE, and most of the performance switch vendors support both 40GbE & 100GbE connections. The reluctance of the NIC companies to go beyond 40GbE is bound to the common 8 lane PCIe Gen3 slots that most NIC cards are installed into. As mentioned above the slots these cards go into supports roughly 52Gbps in each direction. So while a dual port 40G NIC can deliver up to 80Gbps by definition, the card can only bring data into the motherboard at 52Gbps so the card by definition is roughly 35% over subscribed. This is why we’re not going to see any 100GbE NICs in existing servers. For 100GbE NIC companies will require a 16-lane PCIe Gen3 slot or a future 8-lane PCIe Gen4 slot, as both should sustain roughly 104Gbps. So you’ll have to wait for Intel’s next tock ( a major step forward) and the delivery of Skylake, the successor to Broadwell, for real 100GbE NIC systems to appear.

So what about 20GbE, is this something to consider? Well, 20GbE is something HP cooked up working with QLogic that they’d delivered as a product for their blade system. It never really gained any traction outside of that platform. Normally 20Gbps is simply achieved by bonding both ports on a dual port 10GbE adapter together. This can be done several ways and is very common place. This will likely go no further as a hardware option.

Now 25GbE is a horse of a different color, and it is seeing some adoption, but mostly at the top of rack switch level. To better understand this 100GbE is actually four 25GbE lanes, so fracturing this into 25GbE is actually somewhat logical. Arista Networks, Google, Microsoft, Broadcom & Mellanox are all working the switch side of this. In September of 2014, Broadcom announced their StrataXGS Tomahawk chip, which supports 128 ports of 25GbE, 50 ports of 50GbE or 32 ports of 100GbE. So these switches are really close, and we may even see them at SC14 this week. In October Emulex joined the 25GbE Consortium so clearly, there will soon be some NICs in this space. At this time no vendors have announced 25GbE NICs.