Switch stacking for campus design: There’s a better way

We often receive the following campus design question: “do you support switch stacking?” This is a fair question, as many of the legacy vendors have promoted stacking designs for the past decade. It’s popular enough that people ask for it, so we must support it, right?

Well, the popular option isn’t always the best one, and switch stacking designs are a very good example of that philosophy. So when people ask if we support stacking, we think to ourselves “heck, no” before politely telling them that we do not because better options exist.

“Perfection is attained, not when there is nothing more to add, but when there is nothing more to take away.”

At Cumulus Networks, we believe that simplicity is the corner-stone of network design.

Or, to say it another way, complex designs fail in complex ways (shoutout to Eric Pulvino for that quote!). Our former Chief Scientist, Dinesh Dutt, gave an excellent explanation around the importance of simple building blocks in his Tech Field Day 9 Presentation (6min 50 seconds in).

Let’s address a little history on switch stacking and then break down the major technical downfalls of a stacking design, the stacking protocol itself, and then compare and contrast that design with Cumulus’ preferred alternative.

Why do switch stacks exist?

For stacks to exist at all, there must have been a problem (or problems) that wasn’t being solved by existing technologies… some missing piece or pain that wasn’t being addressed which forced the ire of customers and the invention of a new management technology. And indeed there were numerous problems that stacks were created to address — however, as we’ll see, these problems are old problems that are all solved today in much more appropriate ways that don’t require the use of a switch stack.

Too many boxes to manage

25-30 years ago all networks were collections of single small standalone switches. Customers’ networks were a bit smaller then, due to the relatively limited sprawl and high price of compute, but the networks themselves were comprised of a lot of simple switches. While functional, it was also painful to drive. Each switch needed to be managed individually, which required a whole lot of network admins and keyboard time (And that’s all you had back then. More switches always meant more keyboards and network admins. There was essentially a linear relationship between switches and people (15:1) to manage the switches). As a result of this trend, it was customers that drove the big networking vendors of the day to build bigger boxes so customers could own fewer, but bigger and more dense switches and they could lessen their administrative burden and headcount.

ASIC hardware limitations — Making lemonade switchstacks with lemons

When switch stacks and chassis systems were invented back in the 90’s, the availability and capability of switching ASICs themselves was very different from what exists today. At that time, you had the entire 6500 Chassis with the SUP32, which had 32Gbps of forwarding capacity. If you wanted lots of density you HAD to have lots of ASICs. There simply were no single ASICs that had the capability to handle aggregated density at the access layer. To address customer demands for density, vendors essentially had to link multiple ASICs to build switch stacks and chassis systems.

What has changed?

Firstly, the software. Today we have new tools to handle the problem of configuration management. Specifically, purpose-built automation and configuration management tools. These tools allow companies like Google to operate with efficiencies of 1 person managing 40,000 servers (40000:1). With the same Linux operating model in Cumulus as you have in the server paradigm, we see folks achieving higher efficiencies today within our customer base.

Secondly the hardware, now we have products like the Tomahawk 3 ASIC with 12.8Tbps of forwarding capacity in a single switch. A single ASIC that could terminate (128) 100Gb ports or (12800) 1Gb ports if you could somehow handle the breakouts. Those same hardware drivers for stacks are no longer relevant.

The primary driver for switch stacks continues to be “simplified” management but we do not believe the goal of simplified management has ever been consistently delivered and a high price has been paid in search of the goal when you consider the tens of thousands of lines of extra code and software complexity that has been introduced into the network in search of it.

Stacking management plane

The management plane of a networking stack is where a lot of the complexity is hidden. A stacking protocol will span and synchronize the management planes of multiple switches into a single, logical switch.

Often, this requires that each switch must be running the same version of code. It also requires that each switch must be from the same vendor, and often the same model or same family, as the other members of the stack. This can limit your options on the speeds and feeds, and it sometimes involves forcing code upgrades that may not be warranted.

The upgrade process is where a lot of the stress of managing a stack comes into play. The underlying philosophy is that there will be a rolling upgrade across all of the stack members. This is accomplished by an ISSU (in-software service upgrade) across the stack that will gracefully reboot line cards, followed by routing engine cards, all without interrupting traffic on upgraded, and stack members that are ready to be upgraded, all while not losing connectivity to the stack’s management IP address.

That’s what is supposed to happen. But if we’re serious, is that what we see happen in our real-world experience? Of course not. Sadly, the reality is that this upgrade process is far more complicated, stress-inducing and rarely occurs the way the marketing flyer told us it would.

In a past life, I accepted that this was just how things were done, folded my hands in prayer and often rebooted all of the stack members at the same time, hoping that I had enough console cables in case something went horribly wrong. Ask any seasoned engineer who has worked on stacking products if their customers wanted them on-site during these upgrades? You will often get a shoulder shrug and an acknowledgement that this was just part of the job.

Surely, you think to yourself, there must be an easier way! And that’s exactly why Cumulus Networks chooses to support alternate designs over stacking. Why support a design that’s such a needlessly complicated hassle just because it’s popular, when more appropriate solutions exist?

So, what’s the alternative to switch stacking? We believe that the way around the complexity of a stacking protocol shared management plane, running the same version of code and ISSU, is to jettison the complexity and run each switch as an independent actor.

Let’s take a deeper dive into what we recommend:

There are two designs that we would recommend with this physical topology. First, you could do a complete L3 design. In this design, each IDF would announce locally attached VLANs via OSPF or BGP. The IDFs would have routed, ECMP-capable connections back to the MDF. Each MDF would then receive a default route from an upstream connection and pass along the routing updates from the downstream IDF switches.

The other design would be a mix of L3 and L2. In this design, each of the IDF switches would be independent; they are connected with an LACP connection back to a pair of MDF switches that are running CLAG/Multi-Chassis LAG for L2 redundancy. The gateways are advertised from the MDF using VRR for L3 redundancy.

In terms of managing both designs, we would recommend a network automation tool, such as Ansible, to configure all of the IDF and MDF ports. An Ansible playbook would be the single point of management for all of the switches. That way, you don’t have to configure either design in a box-by-box manner.

In the Cumulus designs, a single IDF switch will not bring down the other IDF connections. Each switch can run different versions of code independent of the other switches, and be rebooted without an adverse effect to the IDF and the MDF.

A hidden benefit of this design is that, not only will this provide stability, but it will also give you options that you did not have before. For example, one can run any vendor switch in each IDF — this gives you the option of what speeds and feeds you want to provide to your end users or to compute nodes.

Hopefully, if there’s one helpful bit of advice you take away from this, it’s to not go along with something just because it’s popular — always be on the lookout for innovation! Speaking of innovation, if this blog post has you interested in leveraging automation to eliminate box-by-box configuration, we’ve got just the podcast for you. Tune into episode one of Kernel of Truth, a Cumulus Networks podcast, for an in depth discussion of how incorporating automation into your skillset can optimize your network and bolster your career goals. Happy listening!

Share this blog post!

Eric Pulvino is a Senior Consulting Engineer on our Professional Services team. Before he became Cumulus Curious(TM) he worked for Cisco consulting on large service provider networks from various household names. Today he works with customers in all stages of the open networking pipeline from initial product training, on to architecture and design, as well as the deployment and operation phases. He is not sure if he loves Linux or Networking more but is happy to work at Cumulus Networks where he doesn't have to choose. When not on-the-clock, he is frequently annoying his family, writing all kinds of python-based home automation.

One Comment

Hi Mike and Eric,
Thanks for the interesting article. You addressed three important points for campus networks: device management in general and for core and distribution layers the stretching of L2 domains and hardware redundancy. I am missing one consideration though: stacking (or working with chassis) allows for redundant uplinks of a lot of ports per closet in the access layer using only two fibers. In the design you suggest I might and up doing per switch LACP from the access layer to the distribution layer, needing a lot of fibers for redundancy, this will drive up the cost a lot. By using chassis switches I can stick to your design and use less fibers, but I can often by stacking cheaper than chassis switches. As is often the cae with design, no solution is perfect.
Kind regards, Jaap de Vos