I’m occasionally running on-site design workshops and although I don’t keep track, my guess is that at least 25% of the companies I’m running workshops in experienced a more-or-less catastrophic bridging-caused meltdown in not-so-distant past. Sometimes it stays within a data center and impacts the performance of all hosts attached to the affected VLAN, sometimes they manage to bring down two data centers (hooray for Stretched VLANs).

It might be selection bias. Customers engaging me usually run complex environments that might be by definition prone to weird failures. On the other hand, at least some of them run well-managed environments, and they got a bridging loop even though they did all the right things.

It might be confirmation bias. I keep telling people how dangerous large L2 environments are, so I might remember those workshops where they told me “yeah, that happened recently” better than others.

Or it could be that the vendors are truly peddling broken technology (of course only because the customers ask for it, right?) and we’re paying the price of CIOs or high-level architects making decisions based on glitzy PowerPoints and “impartial” advice from $vendor consultants.

Related posts by categories

10 comments:

I think your have an infatuation with layer 2. The problem is noobs hear this then disable spanning tree. Of course it is going to crash. If you disable spanning tree you still need a path protection protocol.The biggest problem in data centre meltdowns is poor facilities management. People don't do the work right: https://www.iotforall.com/doing-data-center-work-right-checklist/

When a network is over a certain size (I always maintained that the spot is somewhere between 500 and 750 nodes), something is broken somewhere all the time. That is normal.

However, human and automated responses to these events and states differ greatly, and are very dependent on the corporate and engineering culture.

In places that are quick to assign blame, and look for culprits, this leads to design paralysis of always doing what vendors, consultants, and architects propose. Because, that’s where blame will inevitably end.

In places that understand that if humans do work, humans will error, and there will be outages, situation is different.

So, what does this have to do with meltdowns? Pretty much everything.

Large networks, operated by humans, or human-designed automation will melt down. The real question to ask is not how common these are (they are), rather how common is a repeated visible meltdown.

In a case of a first environment I described, I’d be willing to bet it was common.

Obviously Type-A organizations will gladly continue failing and blame everyone else... but do you think that Type-B organizations eventually evolve toward a sane applications + infrastructure stack, or do they stay stuck in some well-managed local minimum like "yeah, we have to do long-distance VLANs, but at least they work reasonably well"?

If B determines that root cause is “unreasonably large L2 domain”, then they would address it.

Problem there, again is really not the L2 domain, rather the reason it’s in place, and let’s be frank — it’s VMware, and its historically not-quite-optimal idea of how networks work.

3 years ago, I’d sit and argue that’s an unsolvable problem. These days, with things like Kubernetes, hybrid and on-premise clouds, and dare I beat our own drum: Anthos, reasons to rely on ancient concepts of vMotion and VMs are few and far between.

Marko - "reasons to rely on ancient concepts of vMotion and VMs are few and far between" you'd be surprised... this is the norm in "modern" enterprise. Actually migrating from L2 to L3 leaf-spine with NSX-T overlay is the state of art...

The author

Ivan Pepelnjak (CCIE#1354 Emeritus), Independent Network Architect at ipSpace.net, has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced internetworking technologies since 1990.