Overview

In today’s modern Datacenter, the physical router is essential for building a workable network design. As in the physical infrastructure, we need to provide similar functionality in virtual networking. Routing between IP subnets can be performed in a logical space without traffic going out to the physical router. This routing is performed in the hypervisor kernel with a minimal CPU and memory overhead. This functionality provides an optimal data-path for routing traffic within the virtual infrastructure. Distributed routing capability in the NSX-v platform provides an optimized and scalable way of handling East – West traffic within a data center. East – West traffic is the communication between virtual machines within the datacenter. The amount of East – West traffic in the datacenter is growing. The new collaborative, distributed, and service oriented application architecture demands a higher bandwidth for server-to-server communication.

If these servers are virtual machines running on a hypervisor, and they are connected to different subnets, the communication between these servers has to go through a router. Also, if a physical router is used to provide routing services the virtual machine communication has to go out to the physical router and get back in to the server after the routing decisions have been made. This is obviously not an optimal traffic flow and is sometimes referred to as “hair pinning”.

The distributed routing on the NSX-v platform prevents the “hair-pinning” by providing hypervisor level routing functionality. Each hypervisor has a routing kernel module that performs routing between the Logical Interfaces (LIFs) defined on that distributed router instance.

The distributed logical router possesses and manages the logical interface (LIF). The LIF idea is similar to interfaces VLAN on a physical router. But on the distributed logical router, the interfaces are called LIFs. The LIF connects to the logical switches or distributed port groups. A single distributed logical router can have a maximum of 1,000 LIFs.

DLR Overview

DLR Interfaces type

With the DLR we have three types of interfaces. These are called Uplink, LIFs and Management.

Uplink: This is used by the DLR Control VM to connect the upstream router. In most of the documentation you will see, it is also referred to as “transit”, and this interface is the transit interface between the logical space to the physical space. The DLR supports both OSPF and BGP on its Uplink Interface, but cannot run both at the same time. OSPF can be enabled only on single Uplink Interface.

LIFs: LIFs exist on the ESXi host at the kernel level; LIFs are the Layer 3 interface that act as the default gateway for all VM’s connected to logical switches.

Management: DLR management interface can be used for different purposes. The first one is to manage the DLR control VM remote access like SSH. Another use case is for High Availability. The last one is to send out syslog information to a syslog server. The management interface is part of the routing table of the control VM; there is no separate routing table. When we configure an IP address for the management interface only devices on the same subnet as the Management subnet will be able to reach the DLR Control VM management IP, and the remote device will not be able to contact this IP.

DLR Interface Type

Note: If we just need the IP address to manage the DLR remotely we can SSH to the DLR “Protocol Address” explain later in this chapter, there is no need to configure new IP address for management interface.

Logical Interfaces and virtual MAC’s and Physical MAC:

Logical Interfaces (LIFs) including IP address of the DLR Kernel module inside the ESXi host. For each LIF we will have an associated MAC address called virtual MAC (vMAC). This vMAC is not visible to the physical network. The virtual MAC (vMAC) is the MAC address of the LIF and is the same across all the ESXi hosts and is never seen by the physical network, only by virtual machines. The virtual machines use the vMAC as their default gateway MAC address. The physical MAC (pMAC) is the MAC address of the uplink through which traffic flows to the physical network, and in this case when the DLR needs to route traffic outside of the ESXi host it is the Physical MAC (pMAC) address that will be used.

In the following figure, inside esxcomp-01a that is an ESXi host, we have the DLR kernel module, this DLR instance will have two LIFs. Each LIF is associated with a logical switch VXLAN 5001 and 5002. From the perspective of VM1, the default gateway is LIF1 with IP address 172.16.10.1, VM2 has a default gateway that is LIF2 172.16.20.1 and vMAC is the same mac address for both LIFs.

The LIFs IP address and vMAC will be the same across all NSX-v hosts for the same DLR instance.

DLR and vMotion

When VM2 is vMotioned from esxcomp-01a to esxcomp-01b, VM2 will have the same default gateway (LIF2), which is associated with vMAC, and from the perspective of VM2 nothing has been changed.

DLR Kernel module and ARP table

The DLR does not communicate with the NSX-v Controller to figure out the MAC address of VMs. Instead it sends an ARP request to the entire ESXi host VTEP’s members on that logical switch The VTEP’s that receive this ARP request forward it to all VMs on that logical switch.

In the following figure, if VM1 needs to communicate with VM2, this traffic will route inside the DLR kernel module at escomp-01a, this DLR needs to know the MAC address of VM1 and VM2. The DLR will then send an ARP request to all VTEP members on VXLAN 5002 to learn the MAC address of VM2. In addition to this, the DLR will also keep the ARP table entry for 600 seconds, which is called its aging time.

DLR Kernel module and ARP table

Note: The DLR instance may have different ARP entries between different ESXi hosts. Each DLR Kernel module maintains its own ARP table.

DLR and local routing

Since the DLR instance is distributed, each ESXi host has a route instance that can route traffic. When VM1 need to send traffic to VM2, theoretically both DLR in esxcomp-01a and esxcomp-01b can route the traffic as in the following figure. In NSX-v the DLR will always perform local routing for VMs traffic!

When VM1 sends a packet to VM2, the DLR in esxcomp-01a will route the traffic from VXLAN 5001 to VXLAN 5002 because VM1 has initiated the traffic.

DLR Local Routing

The following illustration shows that when VM2 replies back to VM1, the DLR at esxcomp-01b will route the traffic because VM2 is near to the DLR at esxomp-01b.

Note: the actual traffic between the ESXi hosts will flow via VTEP’s.

DLR Local Routing

Note: the actual traffic between the ESXi hosts will flow via VTEP’s.

Multiple Route Instances

The Distributed Logical Router (DLR) has two components, the first one is the DLR Control VM that is a virtual machine and the second one is the DLR Kernel module that runs in all ESXi hypervisor. This DLR Kernel module, which is called, route-instance has the same copy of information in each ESXi host. The Route-instance works at the kernel level. We will have at least one unique route-instance of the DLR kernel module inside the ESXi host but not limited to just on ESXi host.

The following figure shows two DLR control VMs, with the DLR Control VM1 on the right and DLR Control VM2 on the left. Each Control VM has its own route-instance in the ESXi hosts. In esxcomp-01a we have the route-instance1, which is managed by the DLR control VM1, and route-instance 2, which is managed by the Control VM2, and the same also applies to escomp-01b. The DLR instance has its own range of LIFs that it manages. The DLR control VM1 manages the LIF in VXLAN 5001 and 5002. The DLR control VM2 manages the LIF in VXLAN 5003 and 5004.

Multiple Route Instances

Logical Router Port

Regardless of the amount of route-instances we have inside the ESXi hosts we will have one special port called the “Logical Router Port” or “vdr Port”.

This port works like a “route in stick” concept. That means all routed traffic will pass through this port. We can think of route-instance like vrf lite because each route-instance will have its own LIFs and routing table, even the LIFs IP address can overlap with others.

In the following figure we have an example of an ESXi host with two route-instances where in route-instance-1 we have the same IP address as route-instace-2, but with a different VXLAN.

Note: Different DLRs cannot share the same VXLAN

DLR vdr port

Routing information Control Plan Update Flow

We need to understand how a route is configured and pushed from the DLR control VM to the ESXi hosts. Let’s look at the following figure to understand the flow.

Step 1: An end user configures a new DLR Control VM. This DLR will have LIFs (Logical interfaces) and a static or dynamic routing protocol peer with the NSX-v Edge Services gateway device.

Step 2: The DLR LIFs configuration information is pushed to all ESXi hosts in the cluster that have been prepared by the NSX-v platform. If more than one route instance exists, the DLR LIFs information will be sent to that instance only.

At this point VM’s in a different VXLAN (East – West traffic) can communicate with each other.

Step 3: The NSX-v Edge Services gateway (ESG) will update the DLR control VM about new routes.

DLR Control VM communications

The DLR Control VM is a virtual machine that is typically deployed in the Management or Edge Cluster. When the ESXi host has been prepared by the NSX-v platform, one of the VIB’s creates the control plane channel between the ESXi hosts to the NSX-v controllers. The service demon inside the ESXi host which is responsible for this channel, is called netcpad, and which is also more commonly referred to as the User World Agent (UWA).

The netcpad is responsible for communication between the NSX-v controller and ESXi host learns MAC/IP/VTEP address information, and for VXLAN communications. The communication is secured and uses SSL to communicate with NSX-v controller on the control plane. The UWA can also connect to multiple NSX-v controller instances and maintains its logs at /var/log/ netcpa.log

Another Service demon called the vShield-Statefull-Firewall is responsible for interacting with the NSX-v Manager. This service daemon receives configuration information from the NSX-v Manager to create (or delete) the DLR Control VM, create (or delete) the ESG. Beside that, this demon also performs NSX-v firewall tasks: Retrieve the DFW policy rules, gather the DFW statistics information and send them to the NSX-v Manager, send audit logs and information to the NSX-v Manager. Part of host preparation processes SSL related tasks from the NSX-v Manager.

The DLR control VM runs two VMCI sockets to the user world agents (UWA) on the ESXi host it is residing on. The first VMCI socket is to the vShield-Statefull-Firewall service daemon on the host for receiving update configuration information from the NSX-v Manager to the DLR control VM itself, and the second to netcpad for control plane access to the controllers.

The VMCI socket provides the local communication whereby the guest virtual machines can communicate to the hypervisor where they reside but cannot communicate to the other ESXi hosts.

On this basis the routing update happens in the following manner:

Step (1) DLR Control VM learn new route information (from the dynamic routing as an example) to update the NSX-v controller,

Step (2) the DLR will use the internal channel inside the ESXi01 host called the “Virtual Machine Communication Interface” (VMCI). VMCI will open a socket to transfer learned routes as Routing Information Base (RIB) information to the netcpa service daemon.

Step (3) The netcpa service demon will send the RIB information to the NSX-v controller. The flow of routing information passes through the Management VMkernel interface of the ESXi host, which means that the NSX-v controllers do not need a new interface to communicate to the DLR control VM. The protocol and port used for this communication is TCP/1234.

Step (4) NSX Controller will forward the DLR RIB to all netcpa service daemons on the ESXi host.

Step (5) netcpa will forward the FIB’s to the DLR route instance.

DLR Control VM communications

DLR High Availability

The High Availability (HA) DLR Control VM allows redundancy at the VM level. The HA mode is Active/Passive where the active DLR Control VM holds the IP address, and if the active DLR Control VM fails the passive DLR Control VM will take ownership of the IP address (flip event). The DLR route-instance and the interface of the LIFs and IP address exists on the ESXi host as a kernel module and are not part of this Active/passive mode flip event.

The Active DLR Control VM sync-forwarding table to secondary DLR Control VM, if the active fails, the forwarding table will continue to run on the secondary unit until the secondary DLR will renew the adjacency with the upper router.

The HA heartbeat message is sent out through the DLR management interface. We must have L2 connectivity between the Active DLR Control VM and the Secondary DLR Control VM. IP address of Active/Passive assign automatic as /30 when we deploy HA. The default failover detection mechanism is 15 seconds but can be lowered down to 6 seconds. The heartbeat uses UDP Port 694 for its communication.

DLR High Availability

You can also verify the HA status by running the following command:

DLR HA verification command:

$ show service highavailability

$ show service highavailability connection-sync

$ show service highavailability link

Protocol Address and Forwarding Address

The Protocol address is the IP address of the DLR Control VM. This Control Plane actually establishes the OSPF or BGP peering with the ESG’s. The following figure shows OSPF as example:

Protocol Address and Forwarding Address

The following figure shows that the DLR Forwarding Address is the IP address that uses as the next-hop for ESG’s.

Protocol Address and Forwarding Address

DLR Control VM Firewall

The DLR Control VM can protect its Management or Uplink interfaces with the built in firewall. For any device that needs to communicate with the DLR Control VM itself we will need a firewall rule to approve it.

For example SSH to the DLR control VM or even OSPF adjacencies with the upper router will need to have a firewall rule. We can Disable/Enable the DLR Control VM firewall globally.

Note: do not confuse DLR Control VM firewall rule with NSX-v distributed firewall rule. The following image shows the firewall rule for DLR Control VM.

DLR Control VM Firewall

Creating DLR

First step will be to create the DLR Control VM.

We need to go to Network and Security -> NSX Edges -> and click on the green + button.

Here we need to specify Logical (distributed) Router

Creating DLR

Specify the User and Password, we can Enable SSH Access:

DLR CLI Credentials

We need to specify where we want to place the DLR Control VM:

place the DLR Control VM

We need to specify the Management interfaces and Logical Interface (LIF)

Management Interface is for access with SSH to Control VM.

Lif interface needed to be configure Second Table below “Configure Interfaces of this NSX Edge”

Categories

Roie Ben Haim is a Senior Member of Technical Staff who specializes in Networking and Security at VMware and who is currently focused on implementing solutions, which incorporate VMware’s NSX platform as well as integrating with various Cloud platforms on VMware’s infrastructure.
Roie works in VMware’s Consulting (PSO) team whose focus is on the delivery of Networking Virtualization and Security solutions. In this role Roie provides technical leadership in all aspects, including the installation, configuration, and implementation of VMware’s products and services. This is also includes being involved from the inception of these project, through requirements assessment, design and deployment phases and then into production which ensures continuity for VMware’s customers.
Roie has over a 15 years of experience working on data center technologies, and providing solutions for global enterprises, which primarily focus on Network and Security.
A highly motivated and enthusiastic MSc graduate Roie holds a wide range of industry leading certificates, including his most recent Network Virtualization (VCDX-NV). Cisco CCIE x2 (DC/SEC) and Juniper JNCIE-SP.
Roie is not only a strong team member, but is also able to demonstrate his skills and experience working in various fields.
As a well known and respected blogger, Roie maintains an impressive blog at:
http://routetocloud.com