This post could also be titled “How to build a healthy, long-lasting relationship with your system administration team”. One of the most important (and overlooked) pieces of deploying VMware ESX in a network is handling an upstream network failure. Because larger organizations have segregated network and system administration teams, the switchport tends to be the demarcation of responsibility. Where this particularly fails is in the perceived reaction of a network component failure, be it an upstream switch or router.

With the increased push towards server consolidation and deployment of VMware, the “routed is better” mantra has become muted by the layer 2 requirements of virtual machine mobility. A virtualized server also can present cable density issues with each server possibly needing 6 NICs (2 x Production, 2 x VMKernal, 1 x Backup, 1 x iLO). From a network design perspective, a VMware deployment screams for a top of rack switching model. Top of rack switching and VMware ESX physical NIC (pNIC) failure detection methods can present some interesting challenges.

VMware ESX allows for two options to detect a upstream network failure: Beaconing Probing and Link Status. Here is an in-depth summary on both methods:

Basically, beacon probing is pretty awful if you’re a network admin. It will send broadcasts out each physical interface of the ESX server for EACH vlan configured (if using dot1q tagging which you should be). So that is:

p number of physical servers x n number of pNICs per server x v number of vlans = broadcast storm

Link status is the preferred failure detection method but it will only track the state of the local link (between the ESX server and the switch). This tells the ESX server nothing about the switch’s ability to forward frames. This is where link-state tracking comes in. Link-state tracking will convey the switch’s upstream link-state to the local link of the ESX server by creating a logic gate between upstream and downstream links.

Suppose you have the following loop-free network topology deployed in your data center:

The network detection failure method configured on the ESX server is link status. Most likely your ESX server is sending frames out both interfaces due to the particular load balancing configuration but in this case we are only interested in frames sent to the switch on the left. In the event the left switch’s uplink fails, we will experience a black hole situation for some of our traffic leaving the ESX server:

By utilizing link status as our ESX failure method detection, the ESX server merely tracks physical link state at layer 1 and the ability of the upstream switch to forward frames is not taken into account:

Link-state tracking configured on the switch will convey this uplink failure to the link directly connected to the ESX server. Let’s get our switch configured correctly (which is stupidly simple):

First, define your link state group globally:

Switch(config)#link state track 1

Then define your upstream links within the link state group:

interface GigabitEthernet1/0/1

link state group 1 upstream

Lastly, define your downstream links:

interface GigabitEthernet1/0/2

link state group 1 downstream

Now the upstream link state will be conveyed to the downstream links which will cause the link to the ESX server to be shutdown in the event the upstream switch link goes down. Interfaces are coupled in :

Once the upstream link failure occurs and the interface is marked as down, the resulting action created by link state tracking is to bring down all downstream interfaces:

By bringing down the physical state of the interfaces to the ESX servers, the action by ESX link status tracking will be to initiate a pNIC failover event:

This will in turn create an long and happy relationship between network and system administrators and eliminate another instance of finger pointing when redundancy fails to function correctly.

If you can tell me of a more understated topic on the CCIE Routing and Switching v4.0 lab blueprint than Optimized Edge Routing (OER), I’ll buy you a beer. This was quietly snuck into the blueprint in between policy-based routing and redistribution, both fairly straightforward topics. Should be no big deal right? False.

OER removes rigidness of standard IP routing where typical routing metrics are derived from physical layer measurements and in turn, dictates a generic routing policy for all traffic. OER does this by gathering higher-lever performance metrics through IP SLA and Netflow information and uses this to determine the most optimal exit point for certain destination prefixes or traffic classes. Once the ideal exit point has been decided, routing policy is dynamically updated to influence the specific traffic class.

Navigating through the configuration guide for OER can be daunting but configuration can be broken down into 5 steps:

1. Profile

The selection of a subset of traffic to optimize performance

Learns the flows passing through the router with the highest delay or throughput

Statically configure class of traffic to performance route

2. Measure

Once traffic has been profiled, metrics need to be generated against it. This is down through:

Passive monitoring – measuring performance of a traffic flow as the flow is traversing the data path

“Avoiding these types of problems is really quite simple: never announce the information originally received from routing process X back into routing process X.”

And it truly is that simple. Always mark/tag/color routes based on their source routing domain and when redistributing, select which routes to redistribute. After all, routes are merely destination information. It’s all about who needs to know and from whom they need to know it.

IPv6 unique local unicast addresses are the equivalent of IP version 4 RFC 1918 space in most ways and are formatted in the following fashion:

7-bit Prefix – FC00::/7

1-bit Local bit (position 8 ) – Always set to “1”…for now

40-bit “kinda-almost-unique” Global ID

16-bit Subnet-ID

64-bit Interface ID

The intention and scope of these addresses is for unicast-based intra/inter-site communication. The definition of a “site” within the plethora of IPv6 RFCs is slightly ambiguous but in the case of RFC 4193, the demarcation of a “site” is between ISP and customer. According to the RFC, unique local unicast addresses are permitted to be used between “sites” i.e. customer-to-customer VPN communication but the FC00::/7 prefix is to be filtered by default at any site-border router. This space is not intended to be advertised to any portion of the internet.

Now the interesting portion of this RFC is the recommended algorithm for generating a realistically unique yet theoretically common 40-bit Global ID for your local unicast addresses. Section 3.2.2 recommends the following:

Obtain the current time of day in 64-bit NTP format

i.e. reference time is C029789C.45564D4E

Obtain an EUI-64 identifier from the system running this algorithm

i.e. bia of C201.0DC8.0000

Concatenate the time of day with the system-specific identifier in order to create a key

Also included in the RFC is sample probabilities of IPv6 address prefix uniqueness depending on the number of peer connections to a site. It’s safe to say if you experience an overlap using this method to assign Global IDs, play the damn lottery. While this method almost eliminates any overlap possibility between sites, the Global IDs generated with this method are hardly “pretty” numbers and there will undoubtedly be folks assigning Global IDs of ::1/40. If you have ever went through a merger/acquisition with IPv4, do yourself a favor and follow academia for assigning your Global IDs.