Lately I’ve been thinking about the potential applicability of OpenFlow to massively scalable data centers. A common building block of a massive cloud data center is a cluster, a grouping of racks and servers with a common profile of inter-rack bandwidth and latency characteristics. One of the primary challenges in building networks for a massive cluster of servers (600-1000 racks) is the scalability of the network switch control plane.

click to enlarge

Simplistically speaking, the basic job of a network switch is to make a forwarding decision (control plane) and subsequently forward the data toward a destination (data plane). In the networking context, the phrase “control plane” might mean different things to different people. When I refer to “control plane” here I’m talking about the basic function of making a forwarding decision, the information required to facilitate that decision, and the processing required to build and maintain that information.

To make forwarding decisions very quickly, the network switch is equipped with very specialized memory resources (TCAM) to hold the forwarding information. The specialized nature of this memory makes it difficult and expensive for each switch to have large quantities of memory available for holding information about a large network.

click to enlarge

Given that OpenFlow removes the control plane responsibilities from the switch, it also removes the scale problems along with it and shifts the burden upstream to a more scalable and centralized OpenFlow controller. When controlled via OpenFlow, the switch itself no longer makes forwarding decisions. Therefore the switch no longer needs to maintain the information required to facilitate those decisions, and no longer needs to run the processes required to build and maintain that information. All of that responsibility is now outsourced to the central over-aching OpenFlow controller. With that in mind, my curiosity is piqued about the impact this may have on scalability, for better or worse.

Making a forwarding decision requires having information. And the nature of that information is composed by the processes used to build it. In networks today, each individual network switch independently performs all three functions; building the information, storing the information, and making a decision. OpenFlow seeks to assume all three responsibilities. Sidebar question: Is it necessary to assume all three functions to achieve the goal of massive scale?

click to enlarge

In networks today, if you build a cluster based on a pervasive Layer 2 design (any VLAN in any rack), every top of rack (ToR) switch builds forwarding information by a well-known process called source MAC learning. This process forms the nature of the forwarding information stored by each ToR switch which results in a complete table of all station MAC addresses (virtual and physical) in the entire cluster. With a limit of 16-32K entires in the MAC table of the typical ToR, this presents a real scalability problem. But what really created this problem? The process used to the build the information (source mac learning)? The amount of information exposed to the process (pervasive L2 design)? Or the limited capacity to hold information?

The OpenFlow enabled ToR switch doesn’t have any of those problems. It doesn’t need to build forwarding information (source mac learning) and it doesn’t need to store it. This sounds promising for scalability. But by taking such a wholesale ownership of the switch control plane, does OpenFlow create a new scalability challenge in the process? Consider that a rack with 40-80 multi-core servers could be processing thousands of new flows per second. With the ToR sending the first packet of each new flow to the OpenFlow controller, how do you build a message bus to support that? How does that scale over many hundreds of racks? Assuming 5,000 new flows per rack (per second) – across 1000 racks – that’s 5,000,000 flow inspections per second to be delivered to and from the OpenFlow controller(s) and the ToR switches. Additionally, if each server is running an OpenFlow controlled virtual switch, the first packet of the same 5,000,000 flows per second will also be perceived as new by virtual switch and again sent to the OpenFlow controller(s). That’s 10,000,000 flow inspections per second, without yet factoring the aggregation switch traffic.

click to enlarge

In networks today, another cluster design option that provides better scalability is a mixed L2/L3 design where the Layer 2 ToR switch is only exposed to a unique VLAN for it’s rack only, with the L3 aggregation switches straddling all racks and all VLANs. Here the same MAC learning process is used only now the ToR is exposed to much less information, just one VLAN unique to the 40 or 80 servers in it’s rack. Therefore, the resulting information set is vastly reduced, easily fitting into typical ToR table sizes and virtually eliminates the ToR switch as a scaling problem. OpenFlow controlling the ToR switch here would not add much value here in terms of scalability. In fact, in this instance, OpenFlow may create a new scalability challenge that previously did not exist (10,000,000 flow inspections per second).

In either the pervasive L2 or mixed L2/L3 designs discussed above, the poor aggregation switches (straddling all VLANs, for all racks in the cluster) are faced with the burden of building and maintaining both Layer 2 and Layer 3 forwarding information for all hosts in the cluster (perhaps 50-100K physical machines, and many more virtual). While the typical aggregation switch (Nexus 7000) may have more forwarding information scalability than a typical ToR, these table sizes too are finite, creating scalability limits to the overall cluster size. Furthermore, the processing required to build and maintain the vast amounts of L2 & L3 forwarding information (e.g. ARP) can pose challenges to stability.

Here again, the OpenFlow controlled aggregation switch would outsource all three control plane responsibilities to the OpenFlow controller(s); building information, maintaining information, and making forwarding decisions. Given the aggregation switch is now relieved of the responsibility of building and maintaining forwarding information, this opens the door to higher scalability, better stability, and larger clusters sizes. However, again, a larger cluster size means more flows per second. Because the aggregation switch doesn’t have the information to make a decision, it must send the first packet of each new flow to the OpenFlow controller(s) and wait for a decision. The 5,000,000 new flows per second are now inspected again at the aggregation switch for potential total of 15,000,000 inspections per second for 5,000,000 flows through Agg, ToR, and vSwitch layers. How do you build a message bus to support such load? Should the message bus be in-band? Out-of-band? If out-of-band, how does that affect cabling and does it mean a separate switch infrastructure for the message bus?

One answer to reducing the flow inspection load would be reducing the granularity of what constitutes a new flow. Rather than viewing each new TCP session as a new flow, perhaps anything destined to a specific server IP address is considered a flow. This would dramatically reduce our 5,000,000 new flow inspections per second number. However, by reducing the flow granularity in OpenFlow we also equally compromise forwarding control of each individual TCP session inside of the main “flow”. This might be a step forward for scalability, but could be a step backward in traffic engineering. Switches today are capable of granular load balancing of individual TCP sessions across equal cost paths.

click to enlarge

Finally, in networks today, there is the end-to-end L3 cluster design where the Agg, ToR, and vswitch are all Layer 3 switches. Given that the aggregation switch is no longer Layer 2 adjacent to all hosts within the cluster, the amount of processing required to maintain Layer 2 forwarding information is greatly reduced. For example, it is no longer necessary for the Agg switch to store a large MAC table, process ARP requests, and subsequently map destination MACs to /32 IP route table entries for all hosts within the cluster (a process called ‘ARP glean’). This certainly helps the stability and overall scalability of the cluster.

However, the end-to-end L3 cluster design introduces new scalability challenges. Rather than processing ARP requests, the Agg switch must now maintain L3 routing protocol sessions with each L3 ToR (hundreds). Assuming OSPF or IS-IS as the IGP, each ToR and vswitch must process LSA (link state advertisement) messages from every other switch in its area. And each Agg switch (straddling all areas) must processes all LSA messages from all switches in all areas of the cluster. Furthermore, now that the Agg switch is no longer L2 adjacent to all hosts, this also means the Load Balancers may not be L2 adjacent to all hosts either. How does that impact the Load Balancer configuration? Will you need to configure L3 source NAT for all flows? The Load Balancer may not support L3 direct server return, etc.

The problem here is not so much the table sizes holding the information (IP route table) as much as it is the processes required to build and maintain that information (IP routing protocols). Here again, OpenFlow would take a wholesale approach of completely removing all of the table size and L3 processing requirements. With OpenFlow, the challenges of Layer 2 vs. Layer 3 switching and the associated scalability trade-offs is irrelevant and removed. That’s good for scalability. However OpenFlow would also take responsibility for forwarding decisions, and in a huge cluster of millions of flows that could add new challenges to scalability (discussed earlier).

In considering the potential message bus and flow inspection scalability challenges of OpenFlow in a massive data center, one thought comes to mind: Is it really necessary to remove all control plane responsibilities from the switch to achieve the goal of scale? Rather than taking the wholesale all-or-nothing approach as OpenFlow does, what if only one function of the control plane was replaced, leaving the rest in tact? Remember the three main functions we’ve discussed; building information, maintaining information, making forwarding decisions. What if the function of building information was replaced by a controller, while the switch continued to hold the information provided to it and therefore make forwarding decisions based on the provided information?

click to enlarge

In such a model, the controller acts as a software defined network provisioning system, programming the switches via an open API with L2 and L3 forwarding information, and continually updating upon changes. The controller knows the topology, L2 or L3, and knows the devices attached to the topology and their identities (IP/MAC addresses). The burdening process of ARP gleans or managing hundreds of routing protocol adjacencies and messages is offloaded to the controller, while the millions of new flows per second are forwarded immediately by each switch, without delay. Traditional routing protocols and spanning tree could go away as we know it, with the controller having a universal view of the topology, much like OpenFlow. The message bus between switch and controller still exists however the number of flows and flow granularity are of no concern. Only new and changed L2 or L3 forwarding information traverses the message bus (along with other management data perhaps). This too is also a form of Software Defined Networking (SDN), in my humble opinion.

Once a controller and message bus is introduced into the architecture for SDN, the foundation is there to take the next larger leap into full OpenFlow, if it makes sense. A smaller subset of flows could be exposed to OpenFlow control for testing and experimentation, and moving gradually towards full OpenFlow control for all traffic if desired. Indeed we may find that granular OpenFlow processing and message bus scale is a non-issue in massive internet data centers.

A least for now, as you can see, I have a lot more questions than I have answers or hype. But one thing is for sure: things are changing fast, and it appears software defined networking, in one form or another, is the way forward.

Brad Hedlund is an Engineering Architect with VMware’s Networking and Security Business Unit (NSBU), focused on network & security virtualization (NSX) and the software-defined data center. Brad’s background in data center networking begins in the mid-1990s with a variety of experience in roles such as IT customer, systems integrator, architecture and technical strategy roles at Cisco and Dell, and speaker at industry conferences. CCIE Emeritus #5530.

Nice article Brad. Good insight into where things are headed. Doesn’t the concept of externally programmed switches seem like it would work really well with virtual switches? Virtual switches probably send out a significant amount of heterogeneous traffic, so a system where the vSwitch was authorized to set the handling instructions for it’s clients would hold promise.

Seems to me that Openflow is clearly gonna be the emerging technology for Traffic Engineering in all aspects, but not in Data Centers more in WAN environments with all these MPLS based services. About “new flow” characterization though, I fully agree that it will be challenging to handle each and every flow in very scalable environments.

OpenFlow could provide some marginal benefits because it could download TCAM entries on demand (for example, destination IP or MAC address would be downloaded into a switch TCAM only when a new flow would go toward that destination), but these things could have been done (and have been done – vSwitch, NX6K1 in host mode …) in existing architectures.

Hi Ivan,
You’re right, an OpenFlow controlled switch would also need TCAM resources too, to hold the flow table. I could have been more clear about that. You make some excellent points (as usual).

That said, the reasoning for OpenFlow or some other approach to network programmability still remains: the processes stressing control plane scalability (and stability) such as spanning tree, ARP, and IP routing protocols go away as we know it. Rather than the switch programming itself with resource intense processes, the external controller with a global view of the network takes on that responsibility.

That, any way you do it, is good for scalability.

Question here is; is the OpenFlow approach of per flow centralized forwarding decisions the most efficient way to accomplish that in a massively flow dense scale out data center?

It seems to me that (network programmability) could be achieved in way that off loads control plane processing (as OpenFlow does), but also distributes the subsequent forwarding decision responsibility.

Thanks for making a trip from the Ivory Tower to visit my humble abode.

Actually, OpenFlow can do anything you want and more. It’s very limited in its scope (TCAM downloading API), which is not necessarily a bad thing, but its TCAM data structures cover many forwarding paradigms (flow, dMAC, s+dMAC, IP prefix …). That richness is also one of my issues: it will take a while before we see full OF 2.0 implementation in hardware. Another issue: the missing elephants like MAC-in-MAC (802.1ah) and IPv6 (yeah, they are no better than a certain 5-letter vendor I like to bash for lack of IPv6 support ) You should really go through the OpenFlow standard – not hard to read and pretty enlightening.

As for scalability: let’s just say that central controller architectures have been proposed numerous times in the past and always failed. I have yet to become convinced that this time it will be any different. BTW, this has nothing to do with OF; OF is just a low-level API, how you use it is a totally different story.

-I’m not sure that I’ve heard anyone argue that the correct way to do SDN in a datacenter is to do flow setups on a per connection basis, particularly because as you have stated, the numbers are large. That isn’t to say that it can’t or shouldn’t be done, but I think the consensus right now is that switch TCAM space is more of the limiting factor than receiving, making and sending out the decisions from software when using this model.

-When using more coarse flow entries to ease the controller/TCAM burden, one can still perform traffic engineering by simply inserting more granular flow entries into switches with a higher priority than the coarse entries.

-I’m not sure I understand your suggestion at the end. You suggest letting the switch handle the forwarding decision, based on information provided by the controller. However for the switch to be able to make the forwarding decision it had to have its flow table programmed before the packet arrives. In OpenFlow terminology this is just considered proactive routing, rather than reactive. If you have the information already available, populate the flow table on the switch with it so you don’t have to make a decision per flow. With OpenFlow you can do reactive, proactive, or some combination of the two, while only requiring the switch to implement the OpenFlow spec.

I’ll echo some of the earlier comments w.r.t. flow granularity. But my (somewhat jaded) view of OpenFlow is that it’s all about the economics. Building data center networks as you describe is pushing the boundaries of even today’s most sophisticated gear. IMO, OpenFlow’s impact won’t be so much about doing the bigger things better, it’s about doing the simpler things cheaper. White box switches with OpenFlow controllers at a fraction of the cost. Success!

“The OpenFlow enabled ToR switch doesn’t have any of those problems. It doesn’t need to build forwarding information (source mac learning) and it doesn’t need to store it. This sounds promising for scalability. … With the ToR sending the first packet of each new flow to the OpenFlow controller, how do you build a message bus to support that?”

What you are describing here sounds something like a throw back to fast switching only distributed across boxes. First packet of every flow is “punted” to openflow. You aren’t moving just the control plane you are also ripping the forwarding plane in half. However this is not my understanding of openflow at all.

Using 6500s as an analogy as everyone seems to know 6500s. My understanding of openflow is that it would effectively be a remote MSFC(or many pooled and virtualized, etc) but what you are describing is remote MSFC and PFC, and elimination of DFCs. All LCs would effectively become something like a Nexus 2K hanging off a FEX port to the openflow master. You are not just moving control plane decision making but also data plane decision execution. But if this is what openflow intended then it would be a API designed around forwarding quires and responses not description of whole forwarding tables to be programed on remote hardware.

While I can see some possible merits to the remote control plane design there are a few things I either can’t get my head around or just don’t seem to be much of an advantage.

1) The control plane “bus”. To connect your openflow servers to the LCs you would need some out of band network(I can’t see how doing it in-band would be anything other than insane). What would this openflow bus/network look like. How many extra devices would be needed to operate this network. Would it combine with an out of band management network or become yet another network. How would you troubleshoot connectivity/etc on the openflow bus network(As former Cisco TAC I’m cringing at the though of it).

2) I’m pretty sure that scale limit today for most boxes is the cost/heat/cooling, etc of TCAM, CAM, and other data plane components not the control plane. Moving the control plane to a remote box does not seem to help this in any way as you still need to store the forwarding table on the LC.

3) Would two control plane instances sitting next to each other speak directly to each other or still send control plane packets down to the LC’s to then be forwarded over to the other LC and punted up to the other control plane. Wow that’s convoluted. It would seem more efficient to speak directly to each other but now that your control plane is out of band it no longer tells you if the data plane is up between two boxes and you need something else ala BFD, CFM, etc to do that. But now you have a mini control plane back on the forwarding boxes. ugh. I think I’m making my head hurt.

After our chat I reviewed this post, interesting ideas thrown here. I still have trouble imagining this out-of-band network. I can only think of a separate “classical” network being in charge of transporting all the control plane information from the edge nodes to the central controller. Any thoughts on how this issue could be addressed?

I also have a separate issue with this vision. If forwarding decisions are taken from a central point, and we use a basic L3 flow-based approach, I have to understand that each single flow would always follow the same path across the network, right? Traffic from server A towards server Z would always flow across the same links. What kind of link load balancing could we see here? How about FIP snooping support for FCoE deployments?

Finally, it would be nice to know how convergence times would be affected by these ideas. Any move of a MAC address from one part to another of a DC (or even between DCs) would trigger a change in the controller information base, that would need to be redistributed to every switch in the network for them to take the proper forwarding decision. The loss of a link woud mean a complete redesign of the network topology. Is STP still the proposed control plane for these environments? I’m not sure what the impact in convergence times would be.

The OpenFlow 1.1 spec has added support for ECMP. So perhaps the flow entry would point to a logical ECMP bundle and let the switch hash out which underlying physical link gets the flow. Just guessing, I have not looked into the details here yet.

Convergence is another interesting topic. A switch with an affected link would need to report that to the controller over the message bus. To prevent sub optimal routing, the controller would then need to invalidate or change the flow entries in all other switches via the message bus.

STP goes away. As do IP routing protocols. Since the controller has a global view of the network, you don’t need switch-to-switch protocols anymore.

As you can see, the message bus becomes a critical part of the infrastructure. How does it scale? In band, or out of band? How does it cope with failures? All important questions…

Establishing the control path, whether in or out of band, is currently outside of the OpenFlow spec. In practice, traditional distributed protocols are used to establish the base connectivity, and they dictate the convergence properties of the network. For example, if a traditional IGP is used in-band (it’s sole purpose is to establish connectivity to the controller) and a link breaks affecting the control traffic, the following happens. (i) the IGP must converge (ii) the topology change is signaled to the controller in some manner (iii) the controller must react. Clearly this is strictly slower than a traditional IGP with no additional mechanism.

For this reason, often OpenFlow deployments use an out of band L3 network (yup, more gear and cabling). The rational is that it’s relatively easy to build a simple L3 network, but perhaps more difficult to build a network which does whatever the special-purpose OpenFlow network is being built to do.

I’d also like to echo Dave’s comments that forwarding the first packet of a flow to the controller is definitely a no-go in the data center. OpenFlow solutions in this space generally compute state at the controller on physical network change, and proactively push all state to the network (without ever seeing data traffic). So the question is, what is a best way to compute that state: With modern distributed system practices and tools (think BigTable, PAXOS and other distributed coordination primitives, etc.)? or with a collection of network protocols? In both cases, the computation and state can be fully distributed. Regarding the latter, one might argue that that coupling the distribution model (one control element per forwarding element), and worrying about low-level details like protocol headers is awfully limiting in the day of modern distributed systems.

I agree that one-query-per-flow to the controller is not scalable for Data Center deployments. If I understand correctly, what you suggest is that the controller would have a global view of the network, build a Forwarding Information Base, and then proactively distribute this info to the node switches who would program the received paths in their local TCAM, right?

This is similar to what Brad argued in his post on OpenFlow not needing to assume all functions of the Control Plane, and indeed is the approach that Cisco Express Forwarding has taken for ages in routers. It has never been taken further than, say, 18 cards in the same box, perhaps 22 if you consider two C6513s in VSS. How would this model scale in a hundreds of cards/nodes model? Has this ever been tested? The distribution of this FIB over the out-of-band network should also be lightning-fast in order to avoid inconsistencies: you don’t want out-of-sync TCAMs in your network. Would this imply that this OOB network would need to be a super-low-latency network, similar to those required in HFT scenarios?

Another challenge I’m seeing here is the integration of storage traffic in these networks. How could we take into account SAN fabric separation? Would this controller have another “brain” behind the curtain dedicated to the SAN architecture, with two halves, each in charge of one fabric, so we get rid of traditional FibreChannel? This last integration is what puzzles me most…

1. OpenFlow, even v1.0, allows implementations that go along the lines you talk about.
a. The switch can look at incoming flow rules and implement them any way it sees fit. Thus, an
OpenFlow switch can look at rules of the form “if DestMAC=X, send to port Y” and implement it
by putting entries in the (non-TCAM) FDB. Similarly, rules based on Destination IP address
or subnet can be implemented using the on-board L3 routing engine. This can create the
situation you envision “… In such a model, the controller acts as a software defined network
provisioning system, programming the switches via an open API with L2 and L3 forwarding
information …”
b. OpenFlow allows a “LOCAL” action that can be used (with a suitable switch) to build a hybrid
implementation, with SOME control-plane on the switch, and some off-loaded to the controller.
For example an OF rule can match incoming LACP control BPDU frames and apply the LOCAL
action to them, allowing the switch to manage aggregated link membership locally, while
the OF controller makes the decision of which frames will be sent to the LAG.
This can be done even with OF 1.0.
This can conceivably be used as the catch-all action for any unknown flows, allowing on-board
logic to look at the flows and decide IF they will be forwarded to the controller, or locally
handled.
c. Further, OF also allows a “NORMAL” action which allows the OF controller to explicitly send
desired traffic to be handled by the “normal” switch forwarding path, who is configured
using a non-OF on-board control-plane SW. Again, this could be used to handle all unknown
flows, thus making OF handle all “special cases” while allowing routine stuff to be handled by
the current mechanisms.

2. It seems to me the discussion here about forwarding decision misses an important dimension. It is
not just a decision of WHERE to send a matching frame, but also a question of handling attributes
along the way. So we need information and mechanisms to do (and impose) Bandwidth allocation,
Sharing and Limits. Again we need to decide if this is done locally, by the OF controller, or a hybrid
of both.
OpenFlow as it is now does not yet cover this well (It has a “send to queue X” action, but
no facilities to either control per-queue BW or to do Per-Flow Traffic measurement and Shaping),
Given that flows by necessity use shared HW resources (Ports, Queues, buffers, etc.) it follows
that handling this aspect is not amenable to purely “per-flow logic” and it will not be enough to
just add OpenFlow on-match actions.

Perhaps someone can shed some light. Every discussion/white paper I have read discussed OpenFlow solely from the perspective of how the control plane is completely separated from the data plane and how forwarding decisions are made. Much of the discussion is esoteric and very high-level. For example, we are told that the Controller will have a global view of the network which allows it to make forwarding decisions, but how that global view is achieved is never explained.

In short, the philosophical and scientific discussions regarding the Chi of Networking are interesting, but what I would like to see is a paper that climbs down from the ivory tower to address the more mundane aspects of building a data center network and how OpenFlow impacts that project.

What most engineers/administrators are curious to know is how exactly will OpenFlow change the existing paradigm of data center network deployments vis-a-vis VLANs, routing protocols, mCAST, spanning of L2 domains, Spanning Tree, etc. What does the total separation of the control plane and its centralization on a server portend for the future ?

Lastly, and perhaps this sounds silly, but why is this new approach characterized as “Software Defined”? Isn’t all networking “Software Defined”? The OS includes all the protocols that drive the implementation of L2/L3 unicast, multicast and broadcast forwarding; security; management; etc.

Joe,
The OpenFlow evangelists certainly have their work cut out in really laying out why this fundamental change in networking is so important, and needed now for the real world. There is a business case that needs to be made, there is a technical case that needs to be made, and I haven’t seen anything terribly convincing yet. I have my own ideas. I can imagine some pretty cools things you could do with OpenFlow in powering turn-key solutions.
Perhaps this is the key objective of the Open Networking Summit — which I have plans to attend — for the OpenFlow evangelists to make their case and rally the industry. My ears and mind are open, and I’m optimistic they will make some great strides. There are lots of really smart people behind this. Time will tell!

Hi Brad
We have been working on the Openflow for our PhD project. I need some ideas;
1. Is how to make this centralized system to be distributed system
2. How much is going to cost each switch and the hybrid switches?
3. If we need to establish this network for more or less 2000 hosts. How many switches we needed to design this (minimum number).

From what I understand from Openflow implementation, the flow definition may not be like what we are counting above.
A frame is matched into one or more flow table store inside the switch. Matching can be based on Vlan, Ip field, mpls header, layer 4 port, layer 2 mac etc. But it can have “dont care” fields. Conditional matching is also considered if multiple flow table is used.

So unless we are interested in a very granular traffic engineering that match too many fields, the flow table may not be so big. If the network is purely used for forwarding (without any manipulation, control) table is equal to ~ total number of mac address

If security feature is used, then it may add flow table for source subnet & destination subnet with action drop. If traffic engineering is used, normally only one or two additional field is used or the “metadata” field can be used internally so it only add but not by multiple times.

This is what i understand, im still new to openflow so correct me if im wrong