RoCE Horror, Volume 5: PFC Free

A colleague recently pointed me to “11 Myths of RoCE.” Previously, RoCE’s version proliferation had put me in mind of a certainboxing moviehexology. But this article’s remarkable assertions brought to mind a more cultish classic, theRocky Horror franchise, which is many things: a parody, a tribute, a stage musical and a movie, often with live performance art. Its characters are never quite what they seem.

Not all the article’s 11 myths seem like myths (as in, “does anyone really think that?) but, in fairness, once upon a time debunking 5 or 7 myths was plenty. Now, thanks to another cult parody tribute, you must goup to 11, spawning extra myths to debunk. For the sake of enjoyment, awilling suspension of disbeliefcomes in handy.

Also, to appreciate a good parody, familiarity with the original will help. I can only recap the species of RoCE I know of:

RoCE, the original. “RDMA over Converged Ethernet” premiered in 2010, produced and directed by theIBTA. A link layer protocol that runs directly over “lossless” Ethernet using PFC (Priority Flow Control), RoCEv1 is definitely not routable. Since iWARP does RDMA over TCP, RoCE is faster. Okay, check.

The first sequel, RoCEv2, a.k.a. “routable RoCE”, debuted in 2014, with UDP/IP added to the cast. V2 adds the optional “Congestion Notification Packet,” or CNP, to exploit the IETF’s Explicit Congestion Notification (ECN) end-to-end TCP flow control scheme. Using CNPs assumes RFC 3168 routers plus sender and receiver algorithms, but these apparently didn’t make the final editing cut.

Next up, another Microsoft (Azure) production, also presented in a SIGCOMM paper (that says RDMA needs lossless and PFC). The paper (which weirdly says the C in RoCE is “commodity”) seems pro-RoCE but also lists several RoCE-related problems, including “livelock” and “deadlock”. The unlikely solution involved plenty of special vendor code, plus setting Ethernet’s PFC field from IP’s DSCP field. It’s nota “layer violation”, it’s a “feature!” As an oft-referenced “large network,” it’s arguably a “de facto” standard. One competitor snarks that it’s “RoCEv4.” The IBTA disavows the term, but it’s a bit sticky, or at leasttacky.

Well, that’s where I thought things stood as I studied the article… 2 to 4 versions of RoCE, all needing lossless Ethernet for good performance. So I was boggled that the first “myth” of RoCE was that it needs a lossless network. Wait… what? Without lossless, RoCE needs to retransmit, just like iWARP. Then the article soon says that RoCE beats other Ethernet-based RDMA like iWARP. With my “parody bit” still not turned on, I shook my head, and slogged through a dismissal of “deployment difficulties”. As if. Dell EMC’s Erik Smith posted “the level of complexity required to properly configure itto avoid issues with congestion spreading.” (Erik’s blog isn’t official, butthis related videois.) “Interoperability between vendors is unreliable” is another supposed myth, though there’s no standard for CNP algorithms and vendors are free to choose their own yet they must interoperate?

As I grumbled about these pseudo-myths, I was startled to hear from another colleague about a quiet (art film?) new RoCE production, disavowing PFC. Whoa! Time to extend that earlier recap!

This new “implementation” uses new CNP sender and receiver algorithms (and format?) to enable UDP-based RoCE to do “selective retransmit” (likeTCP in RFC 2018). This oxymoronic RoCE runs on vanilla Ethernet, and outperforms RoCE over PFC-based lossless Ethernet.

Darn, they were right! RoCE has morphed again, and its need for lossless Ethernet is, bizarrely, now a myth. And I’d bought the myth! Hilarious! What a knee slapper!

In self-defense, the acronym itself says “converged Ethernet,” an old synonym forDCB, which uses PFC. This latest non-PFC RoCE is clearly a new version. I shall call him RoCEv5. (Side note: I asked a Broadcom contact, who told me that their RNICs cannot, ahem, interoperate with this new mode.) A briefInigo Montoyamoment is understandable among observers. What, exactly, does “RoCE” mean?

The “myth buster” article says:

“RoCE” started in 2010: v1, directly on converged PFC Ethernet and can’t scale

But each bullet is only true for one version, and they are all different! The ambiguous language glosses over RoCE’s lack of a stable, well-specified version and adds to confusion about the protocol.

In all honesty, this new PFC-free version is actually a good thing. I hope that the RoCE (re-)inventors can get the word out and make v5 sit still. Maybe they can even write a fully specified, interoperable standard and give it a less oxymoronic name!

It is worth recognizing, though, that when the RoCE crowd criesU.N.C.L.E.on lossless Ethernet, they are staking a claim for pretty good performance on non-deterministic, best-effort infrastructure. It’ll work great much of the time, but now and then it won’t work as well. That’s good stuff for a number of non-mission-critical applications. Mission critical Enterprise Storage is just not one of those applications.

Thinking about new names for the (hopefully stable) RDMA over UDP "draft standard". Various options (though I doubt I have any influence) :

RDMA over Vanilla Ethernet (RoVE)

RDMA over undifferentiated Ethernet (RouE)

RDMA over "standard" Ethernet (RosE)

RDMA over UDP over Ethernet (RouD or RoUDe, pronounced rowdy)

RoUDe (rowdy) seems like the best choice. RoVE isn't bad, but not a great connotation. RosE is a great acronym, but what is standard Ethernet? I thought at first that RouE meant roux... but I had the spelling wrong and either way there is a messy connotation.

You have no obligation to provide any ideas, suggestions, comments or other feedback regarding content on the Site, Brocade and Broadcom's products or any information posted on the Site (collectively, “Contributions”). However, any Contributions You voluntarily provide may be used in Brocade and Broadcom products and related specifications or other documentation. Accordingly, if You do make any Contributions on this Site, You agree that Brocade and Broadcom may freely use, disclose, reproduce, license, distribute and otherwise commercialize the Contributions in any Brocade or Broadcom product, technology, service, specification or other documentation, as well as file for, register or otherwise assert copyright, trademark, patents and any other intellectual property rights in and to the Contributions.