Archive for the ‘Networking’ Category

If you’ve purchased this: https://www.fs.com/products/48948.html – or something similar on fs.com, than you’ll likely plug this twinax in and you’ll never get the link to come up, even though your listed device is there in the compatibility matrix. There is a resolution: set your duplex and speed manually on both ends.

I have seen this around the Internet and a lot of the information is old, and incorrect at the date of this blog post. Thus, what I will link here is the official support matrix Cisco provides for the Cisco Nexus 9000 series switches and the support Cisco Nexus FEX models. One thing to note, if you hover over the “YES” in the cell you choose, it will provide you with the supported connectivity options.

Now, to answer the question above, yes, you can connect the Cisco Nexus 2348UPQ to the Cisco Nexus 93180YC-EX using 40GbE from the FEX uplink ports to the 40GbE ports on the 93180YC-EX. If you are reading the guidelines and limitations where it states something like this “FEX is not supported on 40GbE ALE uplinks” then you need to understand the Cisco Nexus 93180YC-EX does not use the older generation ALE, instead it uses the newer, 2nd generation, Cloud Scale ASIC LSE (Leaf-and-Spine Engine); thus, this limitation does not apply.

Never thought I would be writing about how to utilize IPv6 in 2017 because of all the excellent material on the Internet; however, I have discovered a few things:

There are still technologies which have horrible support for IPv6 (including new stuff)

There are people still resistant to implementing it

There is material on the Internet which shows up early in Google searches which references deprecated standards

Without any further delay, I am going to outline a few items you should keep in mind when deploying your IPv6 network:

Subnet mask size

In IPv6, barring a few exceptions like point-to-point links, you should always utilize a /64 for each deployed subnet. Why? Well, if you wanted to use DHCPv6 you’ll find Microsoft’s implementation won’t even allow you to change from a /64 and even a DHCPv6 server in Linux, while it will actually run with a mask larger than a /64, it will only hand out a /64. Also, you’ll find the use of anything larger than a /64 breaks a lot of the auto-discovery mechanisms in the switch/router, namely around EUI-64, and just doesn’t make sense.

What subnet size should I get from ISP/provider/administrator?

If you’re not going to “own” your IPv6 network, that is you’re not getting an assignment with an ASN to advertise, you’re either looking to obtain a public block of addresses for use and/or you’re internal and need your networking administrators to assign you a prefix which you can further subnet yourself. There is a standard most follow to assign prefixes to “customers”.

An ISP, for instance, may have numerous /32’s (or maybe a bit larger) assigned to them for their use to distribute to customers. Lets call them ISP and you work for “company” and you’re an internal IT organization within “company” who uses “ISP”. Your company would request from the ISP an IPv6 block assignment. From one of the ISP’s /32’s you’ll get, lets say, a /48 just for the hell of it. This is how your company can break it down internally for assignment:

65,536 = /64’s

32,768 = /63’s

16,384 = /62’s

8192 = /61’s

4096 = /60’s

2048 = /59’s

1024 = /58’s

512 = /57’s

256 = /56’s

128 = /55’s

64 = /54’s

32 = /53’s

16 = /52’s

8 = /51’s

4 = /50’s

2 = /49’s

How your company doles these out, is up to them. However, almost no one is going to just directly carve out /64’s from the assigned /48 block, that is stupid. Generally, you’re looking to summarize and aggregate where possible throughout your network and we’ll assume you’re in location “A” at “company”.

We’ll go ahead and assume the company has decided each location is assigned a /58, which gives each location a total of 64 available /64’s to use. As you see, no different than standard IPv4 in the sense of ensuring proper aggregation, except now you’re no longer having to worry about the size of a VLAN’s subnet mask, you’ll always use /64.

What about private IPv6 address space?

If you do not want a Globally Unique IPv6 address you can indeed have what is called a “Unique Local IPv6 address = ULA”. There is a guide on how to properly generate these addresses, which includes a variable which references the time and date, along with other factors to ensure absolute uniqueness.

Why does this matter with private address space? Have you ever been involved with a merger/acquisition, or having to aggregate two offices together which use the same private IPv4 subnet range? I need not say anymore because this can be a PITA! Thus, ULA, when done right, ensures this will never happen; however, there is absolutely nothing stopping you from selecting your own, basic, prefix.

IPv6 ULA uses the FC00::/7 prefix, divided into two groups:

fc00::/8 – The idea for this prefix is to be administered by some authority, but no one can agree to it, so just forget about it

fd00::/8 – Is defined for the generation of /48 prefixes only, using the last 40 bits to generate a random, unique, prefix, according to the algorithm in RFC4193

You will want to use option 2 and you can use online generation tools like those from SiXXs or use a tool from another resource, either way, make sure it generates a proper /48 prefix for you and is, by some degree, RFC4193 compliant.

Finally, your company’s IT department is likely to have this /48 already and is almost very likely to have assigned you a prefix according to the same standards for which they’ll dole out their Globally Unique IPv6 addresses; thus, no additional explanation needed.

I won’t delve into this much more other than you absolutely must make sure your DNS infrastructure is setup for IPv6 AAAA-record and IPv6 PTR-record solution or you WILL have issues!

One area to ponder is the hostnames that’ll resolve when you’re in a dual-stack environment. Do you want the same hostname to return on both a A-record and AAAA-record? Well, some say no, some say yes. Me? I say you should discuss this with your vendor to ensure their solution doesn’t have a problem with this, especially in a dual-stack environment. I was told, by co-workers who know more about Vmware vCenter than I do right now, this is a problem and the returned hostnames must be different when using dual-stack based environments.

Always research and question IPv6 support on your devices

This goes for hardware and software vendors, many have made claims their stuff works with IPv6; however, what, if any, testing was done isn’t known and there are a variety of scenarios to consider. For instance:

Does it support native IPv6 from installation-to-operation?

Does it support dual-stack, from installation-to-operation?

How does it handle DNS requests in dual stack?

Does the system start with IPv6 AAAA requests and then fails over to IPv4 A-record requests?

If so, what is the timeout if a AAAA record is not available and it must try for an IPv4 A-record?

Is the order of DNS resolution preference configurable? (Can you choose to have IPv4 A-records first?)

What forms of address configuration are available for IPv6? (SLAAC, static, DHCPv6?)

What IPv6 address types are supported? (Globally Unique and/or ULA?)

Are there specific “sections” of configuration which cannot support IPv6?

For instance, in Cisco NX OS, you cannot reference an IPv6 address for use on a vPC peer keep-alive link.

More questions will come to mind, but these are from experience and I can promise you are a lot of reasons why most IPv6 implementations in the enterprise, and data center, fail. Question all vendors!

This is it for now, hope this clears up some stuff for you out there who’re thinking about their IPv6 implementation

DNSMASQ is both a DNS and DHCP server that is quick and efficient to run on Linux systems and is likely already running on your Linux box. If you’re in need of a quick DHCP server to run your environment to serve multiple DHCP scopes for different subnets in your VLAN, of which we all know the best practice is subnet == VLAN == Broadcast domain, then DNSMASQ is your go to guy and I prefer it over the ISC DHCPD server. This quick tutorial will go over the basics of how to get this setup and running and assumes you’re not going to utilize the DNS service.

Once this is complete, enable your DHCP service to start automatically. You should also check your systems firewall/IPTABLES service(s) to ensure you have created rules to allow UDP traffic over port 67 and port 68, or you can just flush your IPTABLES and/or disable your firewall, your choice, this isn't a security blog so I'll leave the choice to you, the person who knows their environment better.

First, allow me to say these indeed to do exist, the RJ-45 based 10GBaseT SFP+ modules, a company called Methode Electronics manufactures both a SFP+ based module and a X2-RJ-45; however, we’ll only really talk about why a RJ-45 based 10GBaseT SFP+ transceiver still isn’t practical for lengths beyond 30m, with present technology.

The issues
The number one issue we have, with the current technology today in 2017, is the number of transceivers required for distances greater than 30m using 10GBaseT SFP+ modules. The incredible number of transistors will consume an enormous amount of energy per port and the heat generated by the operation of such modules will be monumental, to say the least. Also, with distances greater than 30m, the amount of heat generated needs to be pulled away from the circuitry and will require large heat sinks, which will increase the bulk of the switch itself or careful consideration of airflow characteristics around the SFP+ ports, including higher speed and higher volume fans (which in turn would also consume more energy themselves) further increases the power demands of a switch utilizing SFP+ modules for 10GBaseT SFP+ modules. X2 modules are indeed out there, but X2 is a different form factor to begin with and I won’t be discussing this here.

Why do I reference 30 meters?
Why do I reference distances greater than 30 meters (30m)? Two reasons: 1. When people want to look at Cat6a/7 for long haul connectivity (to somewhat come close to the distance of multi-mode fiber optics on OM4 fiber cables) 2. Current technology at the time of this writing actually permits us to engineer a 10GBaseT SFP+ module for distances of up to 30m using about 2.5W of energy per port. Once again, please look up the company Methode Electronics and their white paper on 10GBaseT SFP+ optics, its pretty cool stuff.

Who wants this?
Now, what audience cares about utilizing such stuff as copper for distances at 100m? In the enterprise market you’ll likely never see anyone think about using copper for spanning distances close to 100m, especially in the Data Center where the copper cross-connect is disappearing in favor of 10/25/40/50/100G fiber cross-connects, because the cost of these optics are dropping fast. When I say 40G here, I am also assuming the use of Cisco 40G BiDi transceivers because they allow you to utilize existing LC based fiber infrastructure. However, service providers are still interested in utilizing copper back haul connections for distances for at 100 meters because, if the SFP+ modules are cheap enough along with the cost of laying the Copper, they’ll want to utilize this. You’ll likely see such things as connections at last mile (rather under a mile, a lot) or between two offices or central offices. Once again, price usually always wins; thus, time will tell. So, now you know, why you’re just not seeing mass produced 10GBaseT SFP+ modules on the market.

If you’re looking to use command line variables for scripting stuff you have some predefined variables in the NX-OS environment to use and you can also create your own. For now, I’ll just show you how to use the most common, the switches hostname. In some environments you’ll have to save the output of a show tech file and later on upload it via SCP. However, if you’re doing this to 2 or more switches, you’ll need unique file names to make your life easier. Instead of going to each one, you can just use the variable SWITCHNAME in the file. So, if you’re using a script or something like cluster-ssh, this makes your job easier.

If you have upgraded your Cisco Nexus switches to code level 7.0(3)I2(1) or higher and had flowcontrol enabled on an interface, you’ll likely find you’re not able to do a “no flowcontrol receive on” because the command was deprecated. Current recommendation is to default the switch configuration but I have a solution you can implement one switch at-a-time with a single reload to fix this issue:

So, you’ve surely seen some interesting tidbits in the previous section, things you haven’t noticed from other configurations on the Internet. I will outline why these are present in this configuration based on the failure scenario I present below:

Complete and total loss of spine connections on a single leaf switch – First I’ll outline the ONLY reasons why a single leaf switch would lose all of its spine uplinks:

Total and absolute failure of the entire leaf switch

The 40GbE GEM card has failed, but the rest of the switch remains operational

An isolated ASIC failure affecting only the GEM module

Someone falls through a single cable tray in your data center, taking out all the connections you placed in a single tray

Total and complete failure of all 40GbE QSFP+ modules, at the same time

Total loss of power to either the leaf switch or to all spine switches

All three line cards, in three different spine switches, at the same time, suffer the same failure

Someone reloaded the spine switches at the same time

Someone made a configuration change and hosed your environment

OK, now, lets make one thing clear: NO one, and I mean no one, can prevent any issue with starts with “Someone”, you can’t fix stupid. If you lose power to both of your 9396PX power supplies or to the 3+ PSUs in the 9508 spine switches, I think your problem is much larger than you care to believe. Lets see, we now have just 5 scenarios left.

If your leaf switch just dies, well, you know. Down to four! Yes, a GEM card can fail, I’ve seen it, but this isn’t common and is usually related to an issue which will down the entire switch anyway, but we’ll keep that in our hat. Failure of all the connected QSFP+ modules at the same time? I’ll call BS on this, if all of those QSFP+ modules have failed, your switch is on the train towards absolute failure anyways.

Isolated ASIC failure? So uncommon I feel stupid mentioning it. All three line cards in the spine failing at the same time? Yeah, right. So, in all we’re looking to circumvent a failure in the event of a GEM card failure which doesn’t also mean your switch is dead, being the only real valid reason; however, please note, I am only providing this as proof of concept and I don’t think anyone should allow their environment to operate in a degraded state. If your environments operating status isimportant to you, perhaps a different choice of leaf switch for greater redundancy, a cold or warm backup switch, or at least have 24x7x4 Cisco Smartnet.

When you have a leaf switch suffering from a failure of all the spine uplinks, your best course of action, on a vPC enabled VTEP, is to down the VPC itself on the single leaf switch experiencing the failure. This is where the tracking objects against the IP route and the tracking list which groups them for use within the event manager come to use. Once all the links have gone down, using the boolean AND, by the removal of the BGP host address in the routing table, the event manager applet named “spine down” initiates and shuts down the vPC, loopback0, and the NVE interface, respectively.

When all the links return to operation, there is a 12 second delay, configured for our environment to allow for the BGP peers to reach the established state, and then the next event manager applet named “spine up” initiates, basically just “un-shutting” the interfaces in the exact same order. The NVE interface configuration for the source-interface hold-down-timer, brings the NVE interface UP, but keeps the loopback0 interface down long enough to ensure EVPN updates have been received and the vPC port-channels come to full UP/UP status. If this didn’t happen, and the loopback0 and port-channels come up way too soon before the NVE interface, we’ll blackhole traffic from the hosts towards the fabric. If the NVE and loopback0 interface come up too long before the port-channels, you’ll black hole traffic from the network-to-access direction; thus, timing is critical and will vary per environment so testing is required.

A lot of stuff, right? This is all done to prevent the source interface of the NVE VTEP device coming up before the port-channels towards end hosts come up, to prevent the VTEP from advertising itself into the EVPN database and black holing INBOUND traffic.

You might be thinking: Why not just create a L3 link and form an OSPF adjacency between the two switches to allow the failed switch to continue to receive EVPN updates and prevent blackholing? Well, here are my reasons:

Switchport density and cost per port – If it costs you $30,000 for a single switch of 48 10GbE ports, not including smartnet or professional services, you’re over $600/port, and you and I both know you’re not just going to use ONE link in the Underlay, you’ll use at least two. Really expensive fix.

Suboptimal routing – Lets be real here, your traffic will now take an additional hop because your switch is on the way out

Confusing information in EVPN database for next-hop reachability. – Because the switch with the failed spine uplinks still have a path and receiving EVPN updates, you’ll see it show up as a route-distinguisher in the database, creating confusion

It doesn’t serve appropriate justice to a compromised switch – Come on, the switch has failed, while not completely, it is probably toast and should be downed to trigger immediate resolution of the issue, instead of using bubble gum to plug a leak in your infrastructure. The best solution is to bring down the vPC member completely, force an absolute failover to the remaining operational switch, prevent suboptimal routing, and prevent confusion in troubleshooting.

I can’t stress this enough: Engineering anything other just failing this non-border vPC enabled leaf switch, in the event it is the only switch without all, at least, 3 spine connections, is an attempt at either trying to design a fix for stupid or you’re far too focused on why your leaf switch has failed and ignoring the power outage in your entire data center because you lost main power and someone forgot to put diesel in the generator tanks. Part 3 will include more EVPN goodness, stay tuned!

Ooook, here is another configuration example for the Cisco implementation for VXLAN using BGP EVPN for distributed control-plane operations. anycast gateway, and unicast head-end replication. I am using Cisco 9396PX devices for leaf switches and Cisco 9508 chassis switches for the spine using iBGP. We’ll explore the basic setup with the leaf switches being vPC enabled, including the Border Leaf switches, while also going over a few scenarios which can blackhole traffic and how to avoid this without a OSPF adjacency between the leaf switches.

This blog will assume you understand the basic setup of BGP EVPN VXLAN by reading the great Cisco documentation already available; thus, I presume you’re coming here for a more in-depth, real-world deployment scenario and for some better explanations and failure scenario testing and outputs

Below, this diagram shows the connectivity in the UNDERLAY network:

Cisco BGP EVPN UNDERLAY

You can see we have three spine switches, two configured as route reflectors for scalability. Below is the configuration of a single spine switch being used as a route reflector, the other route reflector is setup the same way, with IP addresses being different and such and, of course, the other spine switch not having any iBGP peering relationships with the third spine switch is just runs OSPF, forms adjacencies with all VTEPS for advertisement of VTEP IP reachability.

The above forms the basis of the Underlay network on the spine and sets up the route-reflectors. We have tuned this for protocol convergence speed; thus, timers are aggressive for BGP and you’ll notice the “link debounce time 0”, which disabled link debounce. In a nutshell, by default, the debounce time is the amount of time after a switchport goes down for which the switchport will wait to notify the supervisor, 100msec by default. Disabling this allows immediate updating to the supervisor on a link failure to start protocol convergence. If you’re worried about an unstable interface, it is quite likely in the event of a link failing/flapping issue, the link-flap detection mechanism will down the port. Finally, we set BOTH the interface medium to p2p and set the OSPF network type to point-to-point. Why? In the event someone misses the command to switch OSPF to point-to-point, since this interface type is broadcast by default, the medium p2p command changes the ports operating mode and OSPF will properly adjust to point-to-point; thus, this is just good extra redundancy.

Now, here is the overlay view, which is only relevant to leaf switches, pretend this is an OVERLAY named “Tenant-01”:

A lot to see here, right? This is why I decided to break this into two parts, so this is part 1 and my next post is part 2 for border leafs and failure scenarios! Lets get this initial review over with!

I will just outline all the key points here:

policy-map type qos REST-YOUR-COS-FOR-UCS-FI – This is for those of you who utilize the COS in Cisco UCS and want to maintain your COS value AFTER your packet is VXLAN DE-CAPSULATED. With this EVPN VXLAN configuration, the original 802.1Q header is stripped at ingress; thus, no COS value remains, but if you set any DSCP at the virtual switch level it is maintained throughout so we’re assuming you’re marking DSCP at your virtual switch along with COS and you have your own unique mapping from COS to DSCP. So, you create the classes I have above, this is all for example, your mappings will/may be different, and then create a policy-map to match against the DSCP value marked from your virtual switch and set the appropriate COS value. You then set this as a QOS OUTBOUND policy on the port-channel towards your Fabric Interconnects, but you will have to adjust your TCAM entries for this to work. The other one, for the COS-IGNORANT, will be for devices which aren’t smart enough to set either the DSCP or COS value; thus, just apply this to the interface, inbound, and set your values as needed

fabric forwarding anycast-gateway-mac 0005.0005.0005 – This is for the anycast gateway mac address. You can get “funny” here, but I like to keep it simple, your choice.

fabric forwarding dup-host-ip-addr-detection 5 180 – I set the duplicate host IP detection to 5 moves in 180 seconds for my environment, tune to the values best suited for yours

track objects and object list – I set these to look for the BGP neighbor address of the route-reflectors in the routing table and then assign each of those to the track object list for later assignment to the VPC. Part 2 will show and explain why

hardware tcam entries – Follow these for success in this configuration, especially if you’re in need of using the outbound QOS service policies

VPC peer-keepalive and delay-restore timers – Set to our environment and for specific reasons we’ll explain in part 2

NVE source-interface hold-down – This timer is set to 120 seconds, tuned for our environment, from the default of 300 seconds. I will explain the use of this and why I use 120 seconds in part 2

Loopback0 – Used ONLY for the NVE VTEP interface

Loopback0 secondary address – for vPC enabled VTEPS only, this is the PROXY VTEP address used

Loopback1 – Used ONLY for BGP source-updates

BGP passwords – This is used for security in the Underlay, you can also utilize OSPF authentication too, for extra security

VLAN and interface VLan 950 – This is used strictly between the vPC switch pairs, in the underlay only. This allows for reachability in the event of a single switch in the vPC losing all spine links beause there will be a sub-optimal route which will be instantly placed into the routing table upon failure and allow for continuous reachability for BGP. This is only for allowing continuous forwarding and to prevent a blackhole for traffic and you’re meant to figure out what happened with your spine uplinks

So, like Forest Gump said to all his faithful followers “I’m pretty tired….I think I’ll go home now”. So, see you on Part 2, where the FUN is!!!