Sharing state between host and upstream network: LACP part 3

So far in the previous articles, we’ve covered the initial objections to LACP a deep dive on the effect on traffic patterns in an MLAG environment without LACP/Static-LAG. In this article we’ll explore how LACP differs from all other available teaming techniques and then also show how it could’ve solved a problem in this particular deployment.

Ships in the night

An important element to consider is LACP is the only uplink protocol supported by VMware that directly exchanges any network state information between the host and its upstream switches. An ESXi host is also sortof a host, but also sortof a network switch (in so far as it does forward packets locally and makes path decisions for north/south traffic); here in lies the problem, we effectively have network devices forwarding packets between each other, but not exchanging much in the way of state. That can lead to some bad things…

This is also the reason I chose to write this post, I’ve seen manyothersdescribeindetail LBT vs etherchannel/LACP (Nice articles @vcdxnz01, btw), but none that go into much detail on the implications of this particular point.

The main piece of information of interest is topology change. For example. If you remove a physical NIC from a (d)VirtualSwitch, how is the network notified of this change? If a switch loses all its uplinks or is otherwise degraded, how is this notified to the hosts?

The intent here is to give the host and switches sufficient information on the current topology so they can dynamically make the best path decisions, as close to the traffic source as possible.

Without LACP, the network will need to make link forwarding decisions independently based on:

1. link state (physical port up / down)
2. mac learning

It also means that if the switch or host wants to influence each other to use an alternative path, the only mechanism available is to bring the link administratively down.

How lack of topology change notification could cause problems

Consider the following scenario:
When one of the 10G VMNICs is removed from the vDS of the ESX (using a vCenter), in some cases it takes a long time (of the order of minutes) for the traffic to switch over. It seems strange given that MLAG should switchover the traffic in the order of seconds (usually a lot less).

What could explain this behavior? Assuming the switches are configured per the network vendor’s best practice (ie host-facing bonds) and the VMware consultant had made similar configurations/recommendations for the host config. In this case, LBT configured.

In this scenario, host-facing LACP bonds had been setup, with LACP bypass enabled. LACP bypass effectively is a “Static bond” setup, until the first LACP frame is received, then it reverts to LACP from then on. This mode is normally used to allow PXE booting and initial configuration of the host, since LACP config can only be applied once the ESXi host is licensed, added to vCenter then a DVS and configured.

Figure 1a: MLAG with Static bonds, ESX with LBT.

Figure 1b: Traffic path between VM1 and VM8

The ESXi host had not been configured with LACP or IP HASH. Figure 1a shows this base topology, assuming initial MAC learning has already occurred.

With both ESXi physical NICs / uplinks in the same vSwitch, VMs (and vmkernel interfaces) could be pinned to either link but return traffic could still be received via either physical adapter, this is the default behavior of ESXi vSwitches. Figure 1b and 1c show this traffic path between VM1 and VM8.

Figure 1c: Traffic flow from VM8 to VM1

The problem comes when the topology changes, say by removing an uplink from the vSwitch. The switches are completely oblivious to this change, as the host hasn’t messaged it in any way.Figure 1c: Traffic flow from VM8 to VM1.

Figure 1d: The failure scenario

The packet is sent out the configured uplink1 successfully to the destination VM, but the reply path could come via NIC2, which is not part of the vSwitch, so the packet will be dropped by ESX.

In my mind there are two ways of looking at the problem:

“That’s a configuration mistake”: the host config and switches don’t match, so of course there will be a problem, change the config of either the host or the switches!

Shouldn’t the host message the switches somehow that I’m no longer using this port as an uplink?

Change the configEasy! There’s a couple of options:

Fix the host-config: Add the uplink back to the vSwitch or shutdown the uplink.

Fix the switch-config: Remove the dual-connected config from the ToRs (and accept the consequences of orphan ports described in Part 2).

Message the topology change
This can either be achieved in a couple of ways:

Manually shut down the physical uplink, so switch2 no longer uses that path.

Enable LACP and let the LACP driver take care of it (let’s explore that a little further)

How LACP enables topology exchangeThe LACP driver on the switches and the driver on the ESX host are exchanging information / status using a LACP “Data Unit” frame.

The important part is the other end of an LACP link is able to make forwarding decisions, based on the information it receives in the LACP-DU. This then provides a mechanism for an endpoint to message a change in link state and have the other side do what’s appropriate.

If a DU is not received, or an incorrect/unexpected DU is received, the link will normally be removed from the bond and it will immediately stop forwarding via that link. Let’s explore that in this particular scenario.

Figure 2a

In figure 2a (above), both ports are members of the same uplink group and LACP-DU’s flow to/from both ToR switches.

Figure 2b

Then a topology change happens at the host, which is described in figure 2b.

Vmnic2 is removed from the LACP uplink group at the host “ESX1”

Switch2 fails to receive a DU within the timeout window, the port is forced “proto-down”

ESX1 is now treated as a singly connected host, vmnic2 is not used. Figure 2c shows the traffic flow in the forward direction.

Figure 2c

Figure 2d

Figure 2d shows the reverse traffic flow. Note that it will correctly use the peerlink.

Other scenarios
It should hopefully go without saying, but having a message protocol to advertise changes goes both ways; The network can also inform the host of any changes upstream.

For example, in the case of an MLAG daemon failure, or a split-brain scenario, you would not want the hosts forwarding assuming both links are active and working as normal. LACP allows the switches to advertise such a scenario, without necessarily having to tear down one of the local links itself (which each individual switch could easily get wrong).

Figure 3a

In a true split-brain scenario, one valid approach is to treat the two switches as independent again. Remember, an LACP bond can only form across a link links advertising the same SystemID, different systemID’s lets the host know it is wired to two separate switches. The host LACP driver can then make a decision which of the links to disable. This is the ideal scenario as in a true split brain, the switches may not be fully aware if its peer is or isn’t up up and forwarding. Having the host make the decision effectively adds a witness to the scenario to make the tiebreaker decision.

Figure 3b

According to the LACP spec (and our testing of several host bonding drivers confirms this), when a host receives an LACP-DU with a new system ID (222222), while the other link(s) are still up with the shared system ID (111111), the link with the changed system ID will be removed from the bundle and brought down. This is what would happen during a peerlink failure as shown above.

Wrapping up
Ok, well that was more of a novel than I originally planned to write. Hopefully I’ve done a little to bridge the gap between host networking and the implications with the upstream network.

The summary I’d like to present is this: In a fundamentally Active-Active network fabric, an Active-Active host connectivity option with standard state exchange mechanism is the way to go.

Share this blog post!

Doug Youd is one of our veteran Systems Engineers, he’s been with Cumulus since soon after our launch in 2013. Prior to joining Cumulus, he’s worked for a range of organizations architecting and implementing cloud infrastructure. At Cumulus he’s an SE for our US-West region and helps deliver real-world web scale networking solutions to some of our most cutting-edge customers. His expertise with virtualization technology often sees him post about intersections between networking and that virtualized stacks. 'Outside of work, he can be found tinkering with cars and occasionally racing at local tracks.