Path diversity and network capacity

Imagine driving down highway 101 in the Bay Area from San Francisco down to San Jose. Suppose that you see brake lights somewhere just north of the airport, and your GPS is kind enough to tell you that severe traffic is ahead. So you hop off 101 and shoot over to 280 South. You know that 280 is actually a longer route in terms of mileage, but there is little doubt in the moment that you will get south faster. You eventually end up having to get onto highway 85. After one exit of stop-and-go traffic, you get off the freeway and decide to take surface roads the rest of the way to your destination.

Two things happen when you decide to forego 101 and use 280 instead. First, your trip is made better because, while longer, the path you choose is actually faster. Second, everyone on 101 benefits (albeit very slightly) because there is one less car on 101. The not-so-subtle subtext here is that your experience is improved because it is less dependent in this case on the distance traveled and more dependent on the duration of that travel.

This example plays itself out in networks everyday. Except in networks, we have artificially limited the choices that are available.

In most networks, regardless of the routing protocols that are in use, the available paths are determined by the Shortest Path First (SPF) algorithm. This bit of computational math dates back to the 1950s. It underpins virtually everything we do in networking today. And to be fair, for something conceived so long ago, it has held up surprisingly well. In fact, leaf-spine architectures are quite dependent on ECMP, which is, in turn, directly dependent on Dijkstra’s work.

One of the issues with using SPF is that it essentially limits the possible paths between end points to those that have the fewest number of hops. Using my highway traffic analogy from earlier, this would basically mean that you don’t have the option to take 280 because it is not the shortest path.

Aside from removing longer but better alternative paths, this has an impact on capacity for the entire system. By not allowing you to select a different path, your car is added to the thousands of other cars on 101. Not only do you not benefit from the alternate route, but all your fellow drivers must now deal with increased load on the path they have selected.

So how does this intersect network architecture?

When we architect a network, we do so with a couple of things in mind. First, we care a lot about the number of access ports (think of these as the freeway onramps in my running example). And then we care a lot about capacity between points in the architecture (the number of lanes in the freeways). If we do not know with any precision what traffic looks like (and any kind of application workload mobility makes this even harder), then we have to plan for an any-to-any traffic pattern. Essentially, we have to over-provision the network so that it stands up to worst-case loads.

The challenge here is that very few companies have the freedom to just indiscriminately build out their networks. Capacity comes with additional gear, and that equipment adds not only to capital costs but also to ongoing maintenance costs.

I have heard a few times that bandwidth is cheap, so architects should just throw bandwidth at all problems. But how can it be that equipment is simultaneously cheap and too expensive (see bare metal switching)?

At some point, people have to start to ask how they increase the utilization on their networks. The question is, if networks are only capable of using a small number of possible paths, how can utilization ever increase without creating unpalatable choke points in the network?

The answer to solving the capacity/utilization question lies in path diversity. Ultimately, networks need to be able to use as much of the available capacity as possible. This means that traffic needs to not be sprayed across a small number of equal-cost paths but rather fanned out across a much larger set of non-equal-cost paths. By considering paths that are longer in terms of number of hops, networks can at least have a fighting chance to delivering an improved experience.

But I would by lying if I said that all that was required was making use of non-equal-cost pathing algorithms. Having the option to fan traffic out across more paths helps, but once those paths are not equal, routers and switches cannot just spray traffic evenly across them. There needs to be some intelligence applied so that traffic is placed on paths that deliver whatever is required.

The temptation here is to think that everything will want to take the fastest possible path, but that is not always the only issue to consider. In my freeway analogy, you might conclude that every car should hop on 280 if they are headed all the way south. But what if I told you that my car is a Compressed Natural Gas car? Getting gas for me is an adventure because there are very few fill-up stations, and running out of gas is a tow. For me, I might really care about the shortest traveled distance. My colleague in a hurry to get to a meeting might optimize for time. A family with young children trying to get to dinner might care most about predictability of arrival time. There are all kinds of motivations.

So it is with applications and networking. Some applications really care about latency (database transactions, for example). Others might be quite noisy and need a lot of bandwidth (for instance, data replication in storage clusters). Some are more sensitive to loss or jitter (like voice applications). It could be that some tenants are more concerned with security and isolation of traffic (especially financial and healthcare transactions). The point is that whatever the consideration, whoever is responsible for fanning traffic out across the various non-equal-cost paths must be able to match the traffic to the appropriate path.

But how does the network know what is important?

This is why application abstractions matter. Without having a way to describe what is important to an application, the best that can happen is that traffic will be randomly sprayed across paths. This is, in part, why SPF has been suitable for so long. Minus the work required to describe applications, there was no need for doing anything with non-equal-cost paths. Now that SDN (and the abstractions it brings) is here, the question architects need to be asking is: how much more can I get out of my network?

[Today’s fun fact: In the White House, there are 13,092 knives, forks and spoons. But try to find just one spork…]

I didn’t mean to imply that the examples were mutually exclusive. My database/loss example had Hadoop in mind. If you are trying to design a system for fast transactions, you might care a lot about latency within the storage cluster. Certainly jitter and latency could impact voice. Loss might impact content streaming as well. I was a little bit loose in my description of potential impacts.