[RETIRED] Routing on the Host: An Introduction

In order to build more resilient data centers, many Cumulus Networks customers are leveraging the Linux ecosystem to run routing protocols directly to their servers. This is often referred to as routing on the host. This means running layer 3 protocols like OSPF (Open Shortest Path First) or BGP (Border Gateway Protocol) directly down to the host level, and is done in a variety of ways, by running Quagga:

{{table_of_contents}}

Why Route on the Host?

Why do customers do this? Why should you care?

Simplifying Troubleshooting

Troubleshooting layer 2 network problems in the data center has been a persistent challenge in modern networks, so expanding the layer 3 footprint further into your data center by routing on the host alleviates many issues described below.

Consider a network where layer 2 MLAG is configured between all devices. Although this is a common data center design, and can be deployed on Cumulus Linux, it suffers from a number of shortcomings.

Traceroute is not effective, since it only shows layer 3 hops in the network; this design uses layer 2 devices only. All traceroute outputs, regardless of the path taken, only show the layer 3 exit leafs. There is no way to determine which spine is forwarding traffic.

MAC address tables become the only way to trace down hosts. For the diagram above, to hunt down a particular host you would need to run commands to show the MAC addresses on the exit leafs, the spine switches and the leaf switches. If a host or VM migrates while troubleshooting, or a loop occurs from a misconfiguration, you may have to show the addresses multiple times.

Duplicate MAC addresses and MAC flaps become frustratingly hard to track down. Orphan ports and dealing with MLAG and non-MLAG pairs increase network complexity. The fastest way to find a specific MAC address is to check the MAC address table of every single network switch in the data center.

Proving load balancing is working correctly can become cumbersome. With layer 2 solutions, LACP (Link Aggregation Control Protocol) is very prevalent, so you need to have multiple bonds/Etherchannels between the switches. Performing a simple ping doesn't help because the hash remains the same for layer 2 Etherchannels, which are most commonly hashed on SRC IP, DST IP, SRC port and DST port. In the end, you need multiple streams that hash evenly across the LACP bond. This often means you must buy test tools from companies like Spirent and Ixia.

With a layer 3 design, you can run ip route show and see all of the equal cost routes. It's possible to use tools like mtr and scamper and see all possible ECMP routes; that is, what switches are being load balanced.

Three or More Top of Rack Switches

With solutions like Cisco's vPC (virtual Port Channel), Juniper's MC-LAG (Multi-Chassis Link Aggregation) or Arista's MLAG (Multi-chassis Link Aggregation), you gain high availability by having two active connections. Cumulus Networks has feature parity with these solutions with its own MLAG implementation.

High availability means having two or more active connections. However, with high density servers, or hyper-converged infrastructure deployments, it is common to see more than two NICs per host. By routing on the host, three or more ToR (top of rack) switches can be configured, giving much more redundancy. If one ToR fails, you only lose 1/total ToR switches, whereas with a layer 2 MLAG solution, you lose 50% of your bandwidth.

Clear Upgrade Strategy

By routing on the host, you gain two huge bonuses:

Ability to gracefully remove a ToR switch from the fabric for maintenance

More redudnancy by having multiple ToRs (3+)

Let's expand on these two points. With layer 2 only (like MLAG), there is no way to influence routes without being disruptive (that is, some traffic loss must occur). With OSPF and BGP, there are multiple load balanced routes via ECMP (Equal Cost Multipath) routing. Since there is routing, it is possible to change these routes dynamically.

For OSPF, you can increase the cost of all the links making the network node less preferable.

With BGP, there are multiple ways to change the routes, but the most common is prepending your BGP AS to make the switch less preferable.

Both BGP and OSPF make the ToR switch less preferable, removing it as an ECMP choice for both protocols. However, the link doesn't get turned off. Unlike layer 2, where the link must be shut down and all traffic currently being transmitted is lost, a routing solution notifies the rest of the network to no longer send traffic to this switch. By watching interface counters you can determine when traffic is no longer being sent to the device under maintenance, so you can safely remove it from the network with no impact on traffic.

Because routing on the host uses three or more ToRs, this reduces the impact of a ToR being removed from service, either due to expected maintenance or unexpected network failure. So, instead of losing 50% of bandwidth in a two ToR MLAG deployment, the bandwidth loss can be reduced to 33% with three ToRs or 25% with four.

The redundancy with layer 3 networks is tremendous. In the image above, the network on the left can still operate even if 3 out of 4 ToR switches are down. That is 4N redundancy. The best case for the network on the right is 2N redundancy, no matter what vendor you choose. Layer 3 allows applications to have much more uptime with no risk for outages.

Application Availability

Often when deploying a new application, server or service, there can be a delay between when the new device or service is available and when it is integrated with the network. This is typically a result of the additional configuration required to set up layer 2 high availability (HA) technologies on the upstream switches, which is often a manual process.

Using layer 3 and routing on the host eliminates this delay entirely. Tight prefix list control coupled with authentication can be leveraged on leaf and spine switches to protect the rest of the network from the downstream servers and what they are allowed to advertise into the network. Server admins can be in control of getting their service on the network within the bounds of a safe framework setup by the network team. This is similar to how service providers treat their customers today.

Similarly, when an application or service moves from one part of the network to another, the application team has the ability to advertise the newly moved application quickly to the rest of the network allowing for more agility in service location.

A service or application can be represented by a /32 IPv4 or /128 IPv6 host route. Since that application depends on that /32 or /128 being reachable, the application is dependent on the network. Usually this means the ToR or spine is advertising reachability. If the application is migrated or moved (for example, by VMware vMotion or KVM Migration), the network may need substantial reconfiguration to advertise it correctly. Usually this requires multiple steps:

Removing the host route from the previous ToR, spine or pair of ToRs or spines so it is no longer advertised to the wrong location.

Adding the host route to the new ToR, spine or pair of ToRs or spines so it is advertised into the routed fabric.

Checking connectivity from the host to make sure it has reachability.

These steps are often done by different teams, which can also cause problems. When routing on the host this is done automatically by Quagga advertising, the host routes no matter where the host is plugged in.

Multi-vendor Support

One problem with layer 2, especially around MLAG environments, is interoperability. This means if you have 1 Cisco device and 1 Juniper device, they can't act as an MLAG pair. This causes a problem known as vendor lock-in where the customer is locked into a vendor because of propritary requirements. One huge benefit of doing layer 3 is that by using OSPF or BGP, the network is adhering to open standards that have been around a long time. OSPF and BGP interoperability is highly tested, very scalable and has a track record of success. Most networks are multi-vendor networks where they peer at layer 3. By designing the network down to the host level with layer 3, it is now possible to have multiple vendors everywhere in your network. The following diagram is perfectly acceptable in a layer 3 environment:

Host, VM and Container Mobility

When routing on the host, all VMs, containers, subnets and so forth are advertised into the fabric automatically. This means the only the subnet on the connection between the ToR and the router on the host needs to be configured on the ToR. This greatly increases host mobility by allowing minimal configuration on the ToR switch. All the ToR switch has to do is peer with the server.

If security is a concern, the host can be forced authenticate to allow BGP or OSPF adjacencies to occur. Consider the following diagram:

In the above diagram the Quagga configuration does not need to change, no matter what ToR you plug it into. The only configuration that needs to change is the subnet on swp1 and eth0 (configured under /etc/network/interfaces, which is not shown here). This greatly reduces configuration complexity and allows for easy host mobility.

BGP Unnumbered Interfaces

Cumulus Networks enhanced Quagga with the ability to implement RFC 5549. This means that you can configure BGP unnumbered interfaces on the host. In addition to the benefits of not having to configure every subnet described above, you do not have to configure anything specific on the ToR switch at all, so you don't have to configure an IPv4 address in /etc/network/interfaces for peering.

BGP unnumbered interfaces enables IPv6 link-local addresses to be utilized for IPv4 BGP adjacencies. Link-local addresses are automatically configured with SLAAC (StateLess Address AutoConfiguration). This address is derived from an interface's MAC address and is unique to each layer 3 adjaency. DAD (Duplicate Address Detection) keeps duplicate addresses from being configured. This means the configuration remains the same no matter where the host resides. There is no specific subnet used on the Ethernet connection between the host and the switch.

Along with implementation of RFC 5549, Quagga has a simpler configuration, allowing novice users the ability to quickly configure, understand and troubleshoot BGP configurations within the data center. The following illustration shows a single attached host using BGP unnumbered interfaces:

Why Have Networks not Done this in the Past?

If routing on the host has a lot of benefits, why has this not happened in the past?

Lack of a Fully-featured Host Routing Application

In the past, there were no enterprise grade open routing applications that could be installed easily on hosts. Cumulus Networks and many other organizations have made these open source projects robust enough to run in production for hundreds of customers. Now that applications like Quagga have reached a high level of maturity, it is only natural for them to run directly on the host as well.

Cost of Layer 3 Licensing

Many vendors have many license costs based on features. Unfortunately, vendors like Cisco, Arista and Juniper often want to charge more money for layer 3 features. This means that designing a layer 3-capable network is not as simple as just turning it on; the customer is forced to pay additional licenses to enable these features.

The licensing is often confusing (for example, "What is the upgrade path?" "Do I need additional licenses for BGP vs OSPF?" "Does scale affect my price?"), even when the cost is budgeted for. Routing is not something that should cost additional money for customers when buying a layer 3-capable switch. At Cumulus Networks our licensing model is simple, concise and publicly available.