LahNet#

Pages

Friday, May 23, 2014

I always love hearing about little IOS tips and tricks that are quick, easy, and provide lots of useful information. One that I use occasionally when I'm on a new Cisco IOS device is the command "show tech-support | include show".

What this gives you is all the 'show' commands generated when the command 'show tech-support' is executed. Pretty useful to see what commands Cisco TAC deem essential for troubleshooting, without having to scroll through hundreds of pages of output from 'show tech-support' command.

Here's what commands are executed on a Cisco 3850 running IOS-XE 03.03.03SE with IP Services license:

Thursday, May 22, 2014

Well I've been busy studying for the CCIE R/S lab for the past 6 months (off and on with work duties); also meaning I haven't been posting here as much as I'd liked to.

I reattempted on May 8th to sneak it in before the v4 to v5 change on June 4th, but unfortunately it was another failure. I had my first attempt fall of 2011, but was never able to reattempt quickly due to some major work projects I had taken on. However, this attempt I was very close to passing; after some post-lab reflection, I believe my primary downfall was time management. It can be easy to forget during the lab that they're designed to be very difficult to complete 100% in the allotted time. It forces the testee to solve the entire lab as a whole, but skip questions that don't affect the core route/switch configs, if time becomes an issue, which it does due to documentation referencing or tracking down issues from misconfigs (or whatever else comes up).

I had wanted to attend a bootcamp class before this last attempt, but the timing just didn't work out for various reasons. But now I'm more determined than ever to get my #, so off to an INE bootcamp this mid-June in Seattle! I have some work to do to get caught up on the new v5 topics, but its a relief that many of the legacy technologies have been removed or moved to the written exam only. Hopefully this bootcamp will be the final push over the edge to victory!

Wednesday, January 29, 2014

I came across a unique issue a while ago that I thought would make a great blog topic with the Nexus 5500/2248 platforms and a server cluster attempting to sync/peer through the use of IP multicast. Strangely the cluster would constantly drop adjacencies, and was a bit of a mystery. Being an IT consultant that works with customers to design and implement data center infrastructure, most of the time we (the consultants) don't have any background info on lesser known or custom applications. To compound that even further, many times sysadmins are not very network savvy, and do not understand how the application operates from a low-level network perspective.

This particular issue all started late one night during a cutover to move server connections from an old Brocade switch to a new Nexus infrastructure (this cluster was just a fraction of the servers migrating). Initially all connections were migrated to Nexus 2248TPs hanging off of Nexus 5596UPs [FWIW, running NX-OS version 5.1(3)N1(1)], and all servers appeared to be working just as they were. Once the sysadmin starting looking deeper into this particular server cluster, it was found that cluster adjacencies would form, then fail for no apparent reason.

Just based off of that word, I immediately started checking for physical link errors, speed/duplex settings, and logs on the Nexus for any indication of problems. Of course that would be too easy! The links were error free, logs were clean, links negotiated at 1000/full, and to top it off, the interface counters for the servers were incrementing packet counters like they were operating just fine. And they were, somewhat, since the sysadmin had no issue logging into the servers, it's just this application cluster operation that was failing. The cluster had been operating just fine on the Brocade switch - which was L2-only, and essentially a dumb switch.

Me: "Ok Sysadmin, what does the application cluster software need from the network in order to operate?"

Sysadmin: "Well I believe it's multicast."...after further digging in documentation... "Yes it is multicast and it's using multicast group 224.1.1.1."

Me: "Are the adjacencies just never forming, or are some partially up?"

Sysadmin: "It appears some of the servers form a adjacency, but then a few minutes later it drops. It appears to keep cycling through randomly"

Now I had more info on where to further isolate the problem and extra details about the failure occurring. What did the server switchport configs look like?

Ok, that's a pretty plain-jane server config. Interesting to note that VLAN 200 in this case is a non-routed VLAN, meaning there is no SVI, router, or any other L3 gateway in that VLAN. The Nexus 5596UPs in this instance did not have the L3 module either. No L3 device on the VLAN - that could be a problem - let's investigate IGMP, which is what hosts use to communicate multicast group membership.

Ok, so all the server ports in VL200 show in the IGMP Snooping table, and Snooping is enabled by default. There's no L3 device to respond to hosts on the VLAN with IGMP Queries (hosts use IGMP Reports to request a multicast group), which is the communication that keeps an intermediary switch with IGMP Snooping 'in-the-know' about the multicast needs on the VLAN. Also to note there's no MAC address learned for the multicast group that server is trying to join.

What I really like about NX-OS is the very detailed logs (debug level) that it stores for just about every process running, at all times.

There are 0 multicast sources listed to exclude - meaning any source will do

IGMPv3 report with 1 multicast group seen from host 10.11.200.11 on Eth141/1/22

OIF (Outgoing Interface) Eth141/1/22 for (*,224.1.1.1)

No IGMP Proxy info the N5k has stored

From Cisco N5k documentation - "The [IGMP] proxy feature builds the group state from membership reports from the downstream hosts and generates membership reports in response to queries from upstream queriers."

Forward the IGMP packet to 'router-ports', which the only one in this system is the VPC Peer-Link

Exactly 3 minutes later (this was a consistent timer, but I can't find any documentation on why - IGMP group timeout defaults are 260sec, Querier timeout default is 255sec, bug maybe?).

Still no IGMP proxy records, and since the Nexus never 'saw' an IGMP Query, the 'Noquerier timer expired, remove all groups in this vlan.'

Boom! This correlates to the constant up/down of multicast adjacencies the sysadmin was seeing. We were then also able to watch a particular server that had a successful peer, drop, and matched the timestamps.

Using the debug command will give you slightly more information than the 'event-history' command (note this output is on the peer N5k, receiving the IGMP packet from the Peer-Link Po999):

So how do we fix this issue? Well there are a few ways, but in this instance I added Static IGMP Snooping mappings for the 224.1.1.1 multicast group to each server switchport (there were only a handful of ports).

Other methods to fix this would be

Add a L3 gateway into the VLAN to reply to the IGMP messages so the snooping would work correctly

Configuring a manual IGMP Snooping Querier (for situations like this where there is no PIM running because the traffic isn't routed)

Disabling snooping for that VLAN altogether

I definitely didn't want to disable snooping since we don't want that traffic flooded throughout the VLAN, and due to the existing customer layout of things (and security) I also did not want to create a route-able entry point into that VLAN. Going the route of a manual IGMP Snooping Querier would have been an option, but that would have then required locating an IP address to use late night (IPAM is usually an afterthought for many). Using static mappings for specific interfaces allows the most granular control, but at the cost of a little extra complexity (document!).

So as you can see from the configuration and output above, the static IGMP Snooping mappings were added, the 'sh ip igmp snooping groups' command showed the server ports joined to the proper group, and the MAC address table showing a multicast MAC for the server ports. Once the statics were added, the sysadmin immediately saw the application cluster form all its adjacencies, and remained stable.

Monday, November 18, 2013

As you may have read by now, Cisco has announced their first big 'SDN' (Software Defined Networking) solution named ACI (Application Centric Infrastructure) that tightly pairs with the Nexus 9000 line (announced along with ACI). However, with most product announcements that are released far in advance to the actual product release, the technical details are very few and far in-between. I recently had the opportunity to attend a conference where I attended an ACI and Nexus 9000 breakout session discussion with presenter Joe Onisick (www.definethecloud.net), a Cisco TME for ACI/N9k.

From the discussions that followed, these were the interesting points and thoughts that stuck out to me about ACI and the N9k:

As Cisco has already stated, the N9k will be shipping soon, but they won't be able to run in ACI-mode until 2HCY14. The upgrade from standalone-mode (standard NX-OS) to ACI-mode will be a major upgrade, as the whole underlying OS/firmware is completely different. No ISSU upgrade.

The N9k and ACI is currently a Data Center only solution, in a CLOS fabric design (Spines and Leafs) with the APIC controller (Application Policy Infrastructure Controller). It was not designed to replace Core, WAN-edge, or Campus network environments - it will likely expand to these other environments after the technology gains momentum in the DC space. The whole concept of SDN is still very early in it's infancy - at least for everyone who isn't Google.

The N9k will be priced very competitively - partly due to the use of merchant silicon and mid-plane elimination - but I would say more importantly due to the DC-focused scope of software functionality. Technologies like OTV, LISP, etc will still require a N7k or ASR. Design guides will become available with how to integrate the ACI DC infrastructure with other areas of the network. Since it's using VXLAN as an overlay - there will certainly be a VXLAN-gateway functionality to have that integration.

40G BiDi optics - man these are great (also announced along w/ ACI and the N9k)! 40GE over a single pair of OM3 MMF (good for 100m) using essentially CWDM, but only 2 waves (20G each). And they are able to manufacture and sell them very cost effectively. This could be a major Cisco differentiator when 40G becomes more of the norm. Businesses already have a lot of sunk cost into their fiber cable plants - would they rather replace/addon to accommodate the 12-strand MTP fiber cables for MMF 40G or use their existing 10G fiber plant? A 'no-brainer' decision. Some great info on them here: http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps13386/white-paper-c11-729493_ns1261_Networking_Solutions_White_Paper.html

How will the APIC controller look/feel/operate? It's still somewhat of a mystery, but I expect it to be very similar to the successful UCS Manager (configuring network/application policies with various metrics/SLAs). After all, the people at Insieme were also the people who created UCS and the Nexus products.

NSH - Network Service Header - a Cisco vPath-like technology that has been submitted to the IETF as a draft (http://tools.ietf.org/html/draft-quinn-nsh-00). See who co-authored it? Cisco and a certain company Cisco announced they were acquiring at the launch of ACI (Insieme). This appears to be one of the major underlying technologies that the APIC (the controller) will use to chain network services (firewalls, load balancers, etc). vPath is a really cool technology that the Cisco Nexus 1000v uses to communicate with VMware ESX and virtual network appliances (VSG, vASA, etc) for logically 'chaining' network services. That makes the Cisco AVS (Application Virtual Switch, also announced along w/ ACI) seem to fit quite nicely in the mix, as it's essentially a Nexus 1000v that communicates with the ACI infrastructure. With NSH having a fixed header, it makes it easily implemented into hardware - essentially doing the same function of the N1000v and vPath, but with the ability to have hardware ASICs participate in the service chaining.

Starting to see the potential of ACI now? There are still lots of technical details that are missing, and for that matter the actual product. It'll be very interesting to see how the market reacts to ACI and VMware's NSX. VMware has already released NSX, but will customers adopt it? Will NSX be production-ready by the time ACI/APIC are released; will customers see the need for tighter integration with network and other hardware (VMware has stated that they are working with networking vendors for interop, but how well will that turn out)? All questions that come to mind in terms of the race to see who wins SDN in the DC. The next couple years in the networking field are going to be really interesting.

Saturday, October 5, 2013

Typically in any size routing and switching infrastructure environment that real-time and business critical applications rely on, QoS is an absolute must. One of the main advantages of buying Cisco equipment is the extensive services that IOS can provide - one of which is granular QoS control.

As real-time services are increasingly converging to the IP network, namely voice and video, QoS is becoming even more important to ensure a quality end-user experience. Gigabit and multi-Gigabit (EtherChannel bundle) uplinks are becoming more saturated as users and the businesses increase the need for data-intensive network connectivity. Queue the requirement for QoS!

Unfortunately, between the different Business Units within Cisco that are responsible for the various Catalyst Switch products, there is a decent amount of feature and hardware disparity between them (Catalyst 2960, 3560/3750, 4500, 6500 - and the Nexus lines). QoS configurations between the different products can be very different, which makes understanding QoS in switched environments very cumbersome and easy to forget. This is especially true since there can often be an abundance of bandwidth available in a LAN, and easy for a network engineer/admin to discount the need for QoS.

Network Management tools may suggest to an engineer that a link is not congested, however, these tools rely on SNMP to poll the device for interface stats, taking the delta of the counters to show a rate. Often these tools cannot poll any faster than 30 second intervals, and usually are set to 1-5 minute intervals. This is really an average and doesn't account for spikes in utilization or microbursts. Once these spikes and microbursts of data become too large for the device buffers, packets drop. In the case of real-time traffic, even buffering of the data will cause a degradation of service because buffering this traffic causes delay and jitter.

Fortunately, Cisco has a great design guide for QoS called the "Medianet Campus QoS Design 4.0", also known as the QoS Solution Reference Network Design (SRND) 4.0 guide. Here are the web and PDF links to that document:

Most often the amount of granular control that is explained in the SRND 4 guide for campus switches is not needed because of the simple feature known to Cisco switches as Auto-QoS. Auto-QoS on switches is essentially a macro that has all of the recommended configurations from the QoS SRND 4 guide. With the Auto-QoS feature, it simplifies QoS for switches to just a few commands, and covers probably close to 95% of any QoS needs an enterprise might need. Since it's just a macro, the actual configurations created are easily modifiable for any specific or custom requirements.

There can be a few hiccups when using Auto-QoS, and one of them that I've run into on numerous occasions happens when trying to apply the Auto-QoS generated policies to an EtherChannel interface or physical interfaces linked to it.

The information in this article is in reference to Auto-QoS VoIP for the Cat4500-X.

Now let's look at the problems that arise when attempting to configure Auto-QoS for an EtherChannel. The following are the configurations for the 2 physical interfaces with Auto-QoS already applied that we want to bundle into an EtherChannel:

This is what happens when attempting to configure the ports into the EtherChannel:

4500X(config)#int range te1/15-16
4500X(config-if-range)#channel-group 1 mode active
% The attached policymap is not suitable for member either due to non-queuing actions or due to type of classmap filters.
TenGigabitEthernet1/15 is not added to port channel 1
% Range command terminated because it failed on TenGigabitEthernet1/15
4500X(config-if-range)#

The error that IOS gave us says that either the 'non-queuing actions' or the type of class-map filters that are in the QoS policy-map configuration applied to the interface. And the EtherChannel configuration was not applied because of this error:

Ok, so what if we just apply the 'auto qos trust voip' command to the Port-Channel interface itself? Maybe that will work (nope):

4500X(config)#int po1
4500X(config-if)#auto ?
% Unrecognized command

Damn, that would have been nice Cisco (hint hint). Ok, what happens if we remove the policy-map configurations from the physical interfaces, add them to the EtherChannel, and then reapply the policy-maps to the physical interfaces?

Ok, so at least something was configured that time. We also received a different error this time, suggesting that the output policy-map has queuing actions, which is only supported on physical ports. Let's try applying only the output policy-map to the physical interfaces, and leave the input policy on the Port-Channel interface:

4500X(config-if)#int range te1/15-16
4500X(config-if-range)#service-policy output AutoQos-4.0-Output-Policy
% A service-policy with more than one type of marking field based filters in the class-map is not allowed on the channel member ports.

Yet another error. Fun. Now there appears to be an issue with the ACLs that are applied to the class-map.

So to not drag this troubleshooting scenario out any further, these are the limitations for EtherChannel QoS that you need to work around from the Auto-QoS generated policy:

Output policing needs to be configured on the Port-Channel interface, and separated from any queuing.

Output queuing needs to be configured on the physical interfaces.

The class-maps for the queuing policy-map can only have one type of match statement (i.e. an ACL, or matching on QoS tags) per class-map.

The policing policy-map cannot use the ‘policing percent’ command.

The following functional QoS policy I adapted for use on EtherChannels I tried to closely match the SRND 4.0 configs (mostly comes from SRND configs):

Since the last post of reviving this blog, I have been studying for the CCIE RS Written exam again to recert my existing Professional-level certifications, as well as being qualified to attempt the lab again. Good news is I passed the written exam a week ago! Some renewed interest again in going for the IE again, so we'll see how the studies go in the next few months.

In other news I have been exploring other options for the blog host to get better design and layout options. Wix.com seemed like a great alternative to Blogger, but their 'blog' widget is still a work in progress. Hopefully soon!

Thursday, June 27, 2013

So it's hard to believe, but it's been 2 years since I've last posted on here. Lots of things have happened since then - professional and personal.

I did end up taking the CCIE RS Lab in November of 2011, but unfortunately did not pass. The Troubleshooting section was very difficult, even coming from someone who considers themselves skilled in the art-of-tshoot, but the Config section was very reasonable. I've been pecking at re-studying to retake the lab, but I've been lead engineer in a couple large data center upgrades that have consumed a LOT of my time.

I almost daily weigh the benefits of committing several months+ to studying for this cert, versus using that time learning other stuff (security/wireless/DC, programming, Linux). Now with the prospects looking very high for SDN, it's starting to make me see the decline in the value of the CCIE (for me personally).

Back in May of 2013, I purchased the lah.io domain a few hours after Google annouced they were promoting the status of the .io TLD for search (to the same level as .com, etc), and was able to pickup this 3-character domain out of the last hundred or two still available (lucked out!).

A bit of a renewed interest in blogging again, however, posts will be focused more on day-to-day technology and interesting bits learned from my consulting job (for a major Cisco partner).