A common theme in your talks is that L2 does not scale. Do you mean that Transparent (Learning) Bridging does not scale due to its flooding? Or is there something else that does not scale?

As is oft the case, I’m not precise enough in my statements, so let’s fix that first:

There are numerous layer-2 protocols, but when I talk about layer-2 (L2) scalability in data center context, I always talk about Ethernet bridging (also known under its marketing name switching), more precisely, transparent bridging that uses flooding of broadcast, unknown unicast, and multicast frames (I love the BUM acronym) to compensate for lack of host-to-switch- and routing (MAC reachability distribution) protocols.

Dismal control plane protocol (Spanning Tree Protocol in its myriad incarnations), combined with broken implementations of STP kludges. Forward-before-you-think behavior of Cisco’s PortFast and lack of CPU protection on some of the switches immediately come to mind.

TRILL (or a proprietary TRILL-like implementation like FabricPath) would solve most of the STP-related issues once implemented properly (ignoring STP does not count as properly scalable implementation in my personal opinion). However, we still have limited operational experience and some vendors implementing TRILL might still face a steep learning curve before all the loop detection/prevention and STP integration features work as expected.

Every broadcast frame flooded throughout a L2 domain must be processed by every host participating in that domain (where L2 domain means a transparently bridged Ethernet VLAN or equivalent). Ethernet NICs do perform some sort of multicast filtering, but it’s usually hash-based and not ideal (for more information, read multicast-related blog posts written by Chris Marget).

Finally, while Ethernet NICs usually ignore flooded unicast frames (those frames still eat the bandwidth on every single link in the L2 domain, including host-to-switch links), servers running hypervisor software are not that fortunate. The hypervisor requirements (number of unicast MAC addresses within a single physical host) typically exceed the NIC capabilities, forcing hypervisors to put physical NICs in promiscuous mode. Every hypervisor host thus has to receive, process, and oft ignore every flooded frame. Some of those frames have to be propagated to one or more VMs running in that hypervisor and further processed by them (assuming the frame belongs to the proper VLAN).

In a typical every-VLAN-on-every-access-port design, every hypervisor host has to processes every BUM frame generated anywhere in the L2 domain (regardless of whether its VMs belong to the VLAN generating the flood or not).

You might be able to make bridging scale better if you’d implement fully IP-aware L2 solution. Such a solution would have to include ARP proxy (or central ARP servers), IGMP snooping and a total ban on other BUM traffic. TRILL as initially envisioned by Radia Perlman was moving in that direction and got thoroughly crippled and force-fit into the ECMP bridging rathole by the IETF working group.

Lack of addressing hierarchy is the final stumbling block. Modern data center switches (most of them using the same hardware) support up to 100K MAC addresses, so other problems will probably kill you way before you reach this milestone.

Finally, every L2 domain (VLAN) is a single failure domain (primarily due to BUM flooding). There are numerous knobs you can try to tweak (storm control, for example), but you cannot change two basic facts:

A software glitch in a switch that causes a forwarding (and thus flooding) loop involving core links will inevitably cause a network-wide meltdown (due to lack of TTL field in L2 headers);

A software glitch (or virus/malware/you-name-it), or uncontrolled flooding started by any host or VM attached to a VLAN will impact all other hosts (or VMs) attached to the same VLAN, as well as all core links. A bug resulting in broadcasts will also impact the CPU of all layer-3 (IP) switches with IP addresses configured in that VLAN.

IPv6 uses L2 multicast instead of L2 broadcast. Multicast and broadcast frames are flooded in the same way (unless the switches use IGMP and/or MLD snooping, in which case they can limit IPv4 and/or IPv6 multicast flooding), and the impact on hypervisors would be the same.

Ivan Pepelnjak, CCIE#1354 Emeritus, is an independent network architect. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.

The author

Ivan Pepelnjak (CCIE#1354 Emeritus), Independent Network Architect at ipSpace.net,
has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced internetworking technologies since 1990.