Introducing Redistribute Neighbor

The existing landscape

Are we feeding an L2 addiction?

One of the fundamental challenges in any network is placement and management of the boundary between switched (L2) and routed (L3) fabrics. Very large L2 environments tend to be brittle, difficult to troubleshoot and difficult to scale. With the availability of modern commodity switching ASICs that can switch or route at similar speeds/latency, smaller L3 domains become easier to justify.

There is a recent strong trend towards reducing the scale of L2 in the data center and instead using routed fabrics, especially in very large scale environments.

However, L2 environments are typically well understood by network/server operations staff and application developers, which has slowed adoption of pure L3-based fabrics. L3 designs also have some other usability challenges that need to be mitigated.

This is why the L2 over L3 (AKA “overlay” SDN) techniques are drawing interest; they allow admins to keep provisioning how they’re used to. But maybe we’re just feeding an addiction?

Mark Burgess recently wrote a blog post exploring in depth how we got here and offering some longer term strategic visions. It’s a great read, I highly encourage taking a look.

Existing L3 Fabric Options

Option 1: Move the L3 boundary to the Rack/ToR Level with Subnets

Typically a routed fabric would segment the network such that each rack is assigned a subnet boundary; inter-rack connections would use a default gateway to cross this subnet boundary and be routed.

One trade-off with this approach is additional complexity managing the IP subnets and limiting the IP migration scope to a single rack. Often this approach is too rigid as service movement across rack boundaries are common (as with vMotion) and most applications will not survive a change in the IP address when moving racks, as this resets L4 state.

Additionally, hosts will typically have redundant L2 connections to a pair of Top-of-Rack (ToR) switches, so L2 tricks like MLAG or stacking gain a foothold and often expand out of the rack over time.

Option 2: Routing Configured Down to Host

Another approach is to run a routing protocol at the host to advertise prefixes (usually /32 host routes) directly into the L3 fabric. This allows IP mobility between racks, relaxing the subnet-per-rack limitation. However the trade-off is additional complexity managing a routing daemon on all hosts and the scalability of such a solution.

Routing at the host also allows multiple links to each host, without using something like MLAG. Hosts simply advertise prefixes via both ToRs; remote hosts see two routes and load balance across both paths using ECMP.

Where Does that Leave Us?

Ideally, there would be an option that has the configuration simplicity of option 1, but with the IP mobility, ECMP and other dynamic properties of option 2.

Introducing Redistribute Neighbor

Redistribute neighbor provides a mechanism that allows IP subnets to span racks without forcing the end hosts to run a routing protocol. Cumulus Linux uses the existing concept of redistributing one protocol into another to help simplify the transition to L3 fabrics.

The components are quite simple:

ARP: Get a list of local neighbors.

Redistribution: Push those into the routed fabric as /32 host routes

Getting the Local Neighbor List (basically just ARP)

The first problem to solve at the L2/L3 boundary is compiling a list of IP addresses that are hosted in the southbound L2 domain. The challenge is to accurately compile and update this list of reachable hosts (or neighbors). Luckily, existing commonly-deployed protocols are available to solve this problem.

ARP is used by hosts to resolve MAC addresses when sending to an IPv4 address. Hosts will build an ARP cache table of known MAC – IPv4 tuples as they receive or respond to ARP requests. In Linux, this is stored as kernel-level IPv4 neighbor table. Similarly, IPv6 uses neighbor discovery (ndisc) to resolve the MAC to IPv6 address; this mapping is stored in an IPv6 neighbor table.

If L2/L3 boundary is moved to the ToR, and with the ToR acting as the default gateway for the hosts within the rack, it’s ARP cache table will contain a list of all hosts that have ARP’d for their default gateway. In many scenarios, this table contains all the L3 information you need; what’s missing is a mechanism of formatting and syncing this table into a routing protocol. This is primarily what redistribute neighbor does.

The Cumulus routing team wrote a small Python module (python-rdnbrd) that takes the ARP table, applies some basic filtering/formatting and puts them into an arbitrary Linux route table. This table can then be referenced in the routing protocol, which is where redistribution comes in (I’ll get to that in a sec).

We also added a few other tricks. For example, watching the physical interface an ARP was seen from — if that drops, then pull the ARP entry immediately (rather than wait for a timeout, especially when the failed interface is part of a bridge). This makes it react more quickly to failures.

Redistribution

For those new to this (server admins/architects like myself, perhaps), in routed L3 land, prefixes or summarizations will often be redistributed into other routing domains or between protocols. For example, a common practice is:

Distribute routes of locally hosted public IP addresses from an IGP (OSPF, for example) into an EGP (BGP, usually).

Since we now have an accurate/updated list of hosts, we just need to advertise reachability to those IP addresses into the routing fabric. Other hosts on the fabric will then be able to use this new path to access the hosts in the fabric, if multiple equal-cost paths are available, traffic can load-balance across the available paths natively (ECMP).

Cumulus Linux uses an enhanced Quagga build (and we regularly upstream our patches) as our routing suite. One of the enhancements is “import table”. This command imports the kernel table we populated previously into Quagga’s RIB and pushes it into another routing protocol.

So what about the hosts?

While most hosts should work just fine, we’ve only tested extensively with Linux-based host OS’s so far. On Linux, the config is pretty trivial.

There are 3 key pieces to make this work most effectively:

/32 IP’s on the links:This helps ensure traffic goes via the default-gateway on the ToRs , not between local nodes on a rack-local L2 segment.

on-link: used to force multiple gateway route w/o consistency checking. This is needed to force a gateway outside of the IP range on the interface (/32 in this case).

ifplugd: used to change the nexthops of the default route, when the physical link goes down.

We’re using a similar topology on the hosts as the trick we use for OSPF unnumbered. A /32 loopback IP is defined, which is provisioned on the interfaces to force them to come up normally.

The ifplugd package is used to withdraw routes; since Linux’s default behavior is to leave routes in place, even if the interface is down. This is obviously undesirable in this topology.

That’s all folks!

Well there you have it, one more (slightly creative) way to do networking. If you’re interested, try it out for yourself in the Cumulus Workbench; or talk to your Cumulus account team to set up a guided demo of it.

Share this blog post!

Doug Youd is one of our veteran Systems Engineers, he’s been with Cumulus since soon after our launch in 2013. Prior to joining Cumulus, he’s worked for a range of organizations architecting and implementing cloud infrastructure. At Cumulus he’s an SE for our US-West region and helps deliver real-world web scale networking solutions to some of our most cutting-edge customers. His expertise with virtualization technology often sees him post about intersections between networking and that virtualized stacks. 'Outside of work, he can be found tinkering with cars and occasionally racing at local tracks.

4 Comments

Is this dependent on the hosts to generate traffic in order for the ARP table on the ToR to get populated? If the host with IP 10.1.0.1/32 sends an ARP request for its gateway IP, the ToR should keep that IP/MAC mapping in its ARP cache. But, if the host has not generated any traffic within the ARP cache timeout of the ToR and the ToR has a /32 defined on its south-facing interface(s), e.g. 10.1.0.254/32, then the ToR would not know to send a ARP request for 10.1.0.1 out that interface, no?

Hmm, interesting…
So the host has to ARP outside of its subnet ? That’s a pretty huge violation… And all the servers of your datacenter need to be hacked this way ?
What would you do if you couldn’t hack the servers ? (windows servers for example?)

How does this work for silent/multiple IP’ed hosts like VIPs on a LB/SQL cluster IPs/web server virtual IPs (even not behind a LB) that may not speak unless spoken to? Is there any approach for seeking out these hosts within the fabric or, even at worst case, manually tying them down somewhere? Kind of defeats the automated approach being taken, but curious what the recommended strategy is for these, since they are pretty common in typical networks.

Ryan,
Thanks for your comment and sorry for the length of time in replying.

In short, you are completely correct, silent hosts will be an issue. You could ‘search them out’ with a arp-scan or something along those lines. Most LB’s will/can send a gratuitous arp when first raising the IP though, so it’s perhaps less common in my experience.

Would be very curious to discuss further, though. We have a public slack if you want to ping me there (@cnidus)