We all know and love HSRP. It multicasts HSRP messages to peer routers on 224.0.0.2 / 0100.5e00.0002, and it responds to ARP queries for the virtual IP with one of 3 options:

The HSRP standard MAC in the range 0000.000c.ac??

A manually address with standby mac-address xxxx.xxxx.xxxx

The burned-in address with standby use-bia

Other than HSRP coordination traffic (hellos and whatnot) and ARP replies (unicast and gratuitous broadcast on HSRP takeover), I wasn't expecting anything else to come from an HSRP router. But it sends frames to the peculiar STP uplinkfast address too! Could L2 switches running uplinkfast be listening to HSRP routers? Are they doing something with the information in those frames?

Uplinkfast?

Uplinkfast is a Cisco proprietary enhancement to 802.1D (slow) spanning tree. Switches configured for uplinkfast will identify an alternate root port: One that's currently in blocking mode, and which isn't self-looped. If the root port fails, the backup port is put directly into forwarding mode. It skips the time consuming listening and learning phases.

Any MAC addresses learned on the old root port are moved to the new root port in the forwarding table. They don't need to be re-learned.

Additionally, the switch sends bogus Ethernet frames out the new root port. These frames are stamped with spoofed source addresses belonging to client systems on the switch's designated (downstream) ports. The purpose of these frames is to update the forwarding table on the upstream switches, so they'll forward traffic correctly for our switch's downstream clients, which are now attached to a different spot in the L2 topology.

What's in these spoofed frames? Consider:

Our switch is an L2 device. It doesn't know anything about its clients' IP addresses, and might not even have an IP address of its own. IP packets are out.

That's okay, because the goal is to update L2 forwarding (mac-address) tables.

He wants to update every bridge in the spanning tree that's reachable through his root port, so unicast frames are out.

Broadcast frames will be delivered to end stations, who might try to do something with them. Some IP implementations are built like a house of cards, so it would be good to send frames unlikely to be processed by an IP stack. Broadcast frames are out.

When an uplinkfast transition occurs, the switch spoofs frames from each downstream client. The frames are sent to the uplinkfast multicast address 0100.0ccd.cdcd, and flooded throughout the upstream portion of the spanning tree. The upstream switches don't need to (and probably shouldn't!) be running uplinkfast, and don't even need to be Cisco switches. The regular MAC learning mechanisms implemented on any learning bridge will update the L2 forwarding table appropriately.

It doesn't really matter what's in a spoofed uplinkfast packet, but the Catalysts in my lab send two frames for each client:

The first is an 802.3/SNAP encapsulated frame with mostly unrecognizable contents

The second is an Ethernet II encapsulated frame carrying a typecode of ARP, formatted like an ARP frame, but with nonsense contents.

I don't think that Cisco switches apply any special handling to these uplinkfast frames, but I can't be sure about it. On the surface, there doesn't seem to be any requirement for anything beyond the standard learning mechanisms.

So, what does HSRP have to do with this?

When an HSRP router transitions to the active state, it can face a few challenges, mostly in the areas of L2 topology and ARP tables:

A single gratuitous ARP packet will solve all of these problems, provided that we don't have a paranoid client that only processes solicited ARP replies. So that's what the router does. He sends a gratuitous ARP, announcing the HSRP IP/MAC mapping to the all-ones (broadcast) Ethernet address. In fact, he sends 3 of them, with a 3 second pause after each one.

But that's not all. Immediately after each broadcast ARP, the newly minted active HSRP router sends another ARP to the STP uplinkfast reserved multicast MAC address. These frames are properly formatted, and except for the different destination MAC (uplinkfast) and ARP target MAC (also uplinkfast) fields, they're identical to the broadcast ARP frames.

I pondered this for a while... An HSRP state transition could have been precipitated by an L2 topology change... Is there some circumstance where it would be useful to have a port in listening mode learn a the router MAC anyway? Or maybe have a port in learning mode forward forward the new router's frame?

I can't figure it out, but there's got to be some reason that Cisco programmed this behavior into their routers, right?

The TAC case has been open for 6 weeks. Explanations they've given include:

The router needs to update L2 forwarding tables. ...Okay, But the broadcast frames do that job equally well, and the broadcast frames are sent first. Why code in this crazy multicast address, which appears to be reserved for a different purpose altogether?

The L2 switches hear the frames, and adapt the shape of the resulting spanning tree to better accommodate the gateway router. ...What, what? How does this work?

The L2 switches respond to the special MAC, and flush the source from their CAM tables. Exactly the opposite of learning. ...Not according to my tests they don't. And if they did, this would lead to ridiculous learn/flush/learn/flush/learn/flush gyrations.

We don't know why this behavior was programmed in. ...Okay, at least this is easy to believe.

If you have any clue why Cisco might have written the uplinkfast address into HSRP code, please share!

Tuesday, November 23, 2010

A couple of years ago I configured a topology for a business partner extranet much like the one sketched below.

No dynamic routing was allowed on the firewall. Layer 9 didn't trust it to run an IGP, so the firewall was configured with static routes:
- Known internal nets (registered and 1918 space) pointed in
- Default route pointed out

Two eBGP sessions were configured to learn business partner prefixes (not shown) from the external switch, and redistribute them into the IGP. It was a small number of prefixes, and they were thoroughly filtered and quantity-limited, making things safe for the IGP.

But it didn't work correctly: Only one BGP session could be brought up at a time, but never both at once.

The cause of the error took me more hours of head-scratching than I care to admit. In my defense, the topology was actually quite a bit more complicated than depicted here. Presented here is the bare minimum required to recreate the problem.

The problem was neither a firewall policy issue, nor a typo. Any typos here are just typos.

Can you spot my mistake? Which session comes up, and what's wrong with the other one?

Wednesday, October 27, 2010

This is an update to the Amazon EC2 IPsec tunnel to Cisco IOS router post I made several weeks ago. Amazon has changed the offering a bit, and not all of the commands, nor the distribution I previously used is still available.

That's it! Now I can ping the private ($EC2PRIVATE) address of the EC2 instance from one of my internal machines at home. This works in my environment because the 10.x.x.x address assigned by Amazon happens to fall within the default route in use by my home gateway. You may need to add a static route if you're pushing the 10/8 block elsewhere in your environment.

Being able to talk securely to the private address is preferable to using the public one because of applications (SIP, FTP) that embed IP address information into their application payload. These don't NAT well, and now they don't have to.

If you want to be able to talk securely to the public address of an EC2 instance, that can probably be done with a dummy interface on the EC2 end. I'll work on that later.

Tuesday, October 19, 2010

This is part 3 in the IPv4 multicast series. Part 1 covered scoping and address assignment. Part 2 covered scoping and RP placement.

The traffic scopes we've defined are: building, campus, region and enterprise. This article explains the strategy we're going to use when bolting these various scopes together with MSDP.

Our hypothetical enterprise has nine sites, arranged according to the following diagram.There's a lot going on in this diagram. It's all meaningful, and I'm going to talk through a good portion of that minutia, but want to begin with the following: The lines on the diagram represent MSDP peering. MSDP is a multihop protocol, so the lines have nothing to do with layer1/2/3 topology, nor with PIM neighbor relationships. This is just the map of how routers around the enterprise share multicast metadata among themselves.

The small colored circles represent individual routers which act as RPs. Their color indicates the scope of data for which the RP holds top-level responsibility.

Each RP is in a building (blue oval). Buildings are in a campus (green oval). Campuses are in a region (red oval). All regions are within the enterprise (no boundary shown).

The lines connecting between RPs indicate the scope of MSDP source-active (SA) advertisements flowing between those routers. The colored lines never cross their same-colored scope boundaries:

Blue lines represent building scope SA's, so they never cross out of building (blue oval)

Green lines represent campus scope SA's, so they never cross out of a campus (green oval)

Red lines represent region scope SA's, so they never cross out of a region (red oval)

Our enterprise has two data centers in the Chicago campus: Building 1 and Building 2. They're responsible for propagating enterprise scope data everywhere, as well as being responsible for the various smaller scopes that they happen to fall in. The RPs in building 3 and 4 share only building scope data among themselves. They exchange SA messages for all larger scopes only with their upstream RPs in the data centers.

The Tokyo office is the only site in the Asia-Pac region, so the Tokyo RPs share building, campus and regional multicast data only among themselves, as explained in Part 2. Enterprise scope data is shared only with the RPs in Chicago.

Finally, I want to call your attention to the pair of RPs in Paris. They share building and campus data among themselves, and share regional data with their upstream RPs in London and Madrid. But with whom do they talk about enterprise flows? Rather than going straight to the source (Chicago 1 and 2), they go to their upstream in-region peers. That's the model for all of this peering: Every router may peer only with his direct neighbor, or with a router one layer above or below in the hierarchy. This isn't the only way to do it, but it's the model I've chosen for this hypothetical deployment.

Monday, October 18, 2010

I've recently discovered that VMware runs its physical NICs in promiscuous mode. At least, I think I have made that discovery.
There's a lot of chat out there about VMware and promiscuity, but it's usually devoted to the virtual host side of the vswitch. On that side of the vswitch, things are usually pretty locked-down:

No dynamic learning of MAC addresses (don't need to learn when you know)

This leads to frustration for people trying to deploy sniffers, intrusion detection, layered virtualization and the like within VMware, and it's not what I'm interested in talking about here.

I'm interested in something much more rudimentary, and which has always been with us. But which has begun to vanish.

History Lesson 1On a truly broadcast medium (like an Ethernet hub), all frames are always delivered to all stations. Passing a frame from the NIC up to the driver requires an interrupt, which is just as disruptive as it sounds. Fortunately, NICs know their hardware addresses, and will only pass certain frames up the stack:

Frames destined for the burned-in hardware address

Frames destined for the all-stations hardware address

You're probably aware that it's possible to change the MAC address on a NIC. It's possible because the burned-in address just lives in a register on the NIC. The contents of that register can be changed, and in all likelyhood were loaded there by the driver at initialization time anyway. The driver can load a new address into this register.

In fact, most NICs have more than one register for holding unicast addresses which must be passed up the stack, allowing you to load several MAC addresses simultaneously.

History Lesson 2Multicast frames have their own set of MAC addresses. If you switch on a multicast subscriber application, a series of steps happen which culminate in the NIC unfiltering your desired multicast frames and passing them up the stack. This use case is much more common than loading multiple unicast addresses, and hardware designers saw it coming before they allowed for multiple reconfigurable unicast addresses.

This mechanism works in much the same way that an EtherChannel balances load among its links: Deterministic address hashing. But it uses a lot more buckets, and works something like this:

Driver informs the NIC about the multicast MAC that an upstream process is interested in receiving.

The NIC hashes the address to figure out which bucket is associated with that MAC.

The NIC disables filtering for that bucket.

All frames that hash into the selected bucket (not just the ones we want) get passed up the stack.

Software (the IP stack) filters out packets which made it through the hardware filtering, but which turn out to be unwanted.

Modern implementations

Surprisingly, nothing here has changed. I reviewed data sheets and driver development guides for several NIC chipsets that are currently being shipped by major label server vendors. Lots of good "server class" NICs include 16 registers for unicast addresses and a 4096-bucket (65536:1 overlap) multicast hashing scheme.

And VMware fits in how?

Suppose you're running 20 virtual machines on an ESX server. Each of those VMs has unique IP and MAC addresses associated with it. But the physical NIC installed in that server can only do filtering for 16 addresses!

The only thing VMware can do in this case is to put the NIC (VMware calls them pNIC) into promiscuous mode, then do the filtering in software, where hardware limitations (registers chiseled into silicon) aren't a problem.

It's good news for the VMware servers that they're (probably) not plugged into a hub, because the forwarding table in the physical switch upstream will protect them from traffic that they don't want.

Promiscuity in NICs is widely regarded as suspicious, performance impacting, and a problem. ...and 101-level classes in most OS and network disciplines cover the fact that NICs know their address and filter out all others. The idea that this notion is going away came as a bit of a surprise to me, and makes a strong argument for:

Okay, so 16 addresses per NIC, isn't quite so dire. A big VM server running dozens of guests probably has at least a handful of NICs, so the ratio of guests+ESX-overhead/pNIC_count might not be higher than 16 in most cases.

VMware could handle this by using unicast slots one-by-one until they're all full, and only then switching to promiscuous mode.

I've only found one document that addresses this question directly. It says:

Friday, October 15, 2010

Search queries that have led readers here, conversations with customers, and various blogposts, have made it clear that people are running full-speed into an interesting feature of the Cisco Nexus Fabric Extenders: FEX ports always run bpduguard. You can't turn it off.

For the uninitiated, Bridge Protocol Data Units (BPDUs) are the packets used by bridges (switches) to build a loop-free topology. BPDU Guard is a Cisco interface feature that immediately disables an interface if a BPDU arrives there. It's appropriate on interfaces where you never expect to plug in a switch. If you've ever plugged a switch into your cubicle jack at work, you may have experienced this feature firsthand.

BPDU Guard tends to go hand-in-hand with portfast, a feature that makes switch interfaces ready for use as soon as they link up, instead of forcing these fresh links to jump through the Spanning Tree Protocol (STP) loop prevention hoops.

If you're going to use portfast, you must use bpduguard.

Nexus Fabric Extenders (2148, 2248, 2232) run bpduguard on their interfaces all the time. It can't be disabled.

Who cares?Bpdugaurd means that you can't hang a switch from a FEX. If you've adopted Cisco's vision of the modern data center, this can become a problem because the 2148 fabric extenders can only do gigabit. No 10/100 capability here. It turns out there's a lot of stuff in a modern data center that still can't do gigabit, but lives out in the general population that's intended to be served by the Fabric Extenders:

HP server iLO interfaces

Terminal server appliances

Power strips

Environmental monitors

KVM equipment

The best option for these small, far-flung clients might be the installation of a small, 10/100 capable switch nearby. I selected the WS-C2960-24TC-S for this purpose in a recent build because it's super cheap (list price is $725) and because it has dual-purpose uplink ports. The natural inclination is to try to uplink directly into the nearby 2148T fabric extender. But, as soon as you do, the 2960 sends a BPDU, and the 2148 shuts down the interface.

Now what?

You could disable spanning-tree on the 2960, but that's asking for trouble, and makes redundancy impossible. You could disable spanning tree on just the uplink interface because it's safer, but still doesn't accomplish redundancy. You could link the small switch directly to the distribution layer, but that's a lot of buck for relatively little bang.

Another possible answer is flexlinks, a long-forgotten uplink redundancy mechanism that probably predates stable STP operation. I had assumed so, but it doesn't look like flex links is quite that old. I don't know why this feature was introduced, but it's useful here.

Now, Gig0/2 backs up Gig0/1. Spanning tree protocol is disabled on these two ports, so BPDU Guard won't cause a problem, and there's no risk of these ports creating a topology loop because only one one of them will be in forwarding mode.

Plug both interfaces into a fabric extender and you're in business.

The interfaces can be trunks or they can be access ports. If they're trunks, you can even balance the vlans across the uplinks if you're so inclined (personally, I don't care for it, but this is a very common strategy).

A topology change will flood bogus packets which appear to be sourced from your client systems so that upstream switches update their forwarding tables. ...A process that can be made even quicker with the 'swichport backup mmu' mechanism - but I doubt the Nexus supports that anyway.

Going forward:

100Mb/s is really the only real requirement. I can only think of twodevices that are limited to only 10Mb/s in any of the networks I work on. And both of those are in my basement, which has not yet been migrated to the Nexus platform. Fortunately, the Nexus 2248 Fabric Extenders can do 100/1000, so you'll be much less inclined to try to hang a switch off of them. Plus they're cheaper, and have a few other benefits over the 2148T. As far as I'm aware, there's not a single technical reason to prefer a 2148 over the 2248, so stop buying 2148s.

Wednesday, October 13, 2010

Unicast traffic in Cisco IOS devices is usually forwarded by the CEF mechanism, which by default will load balance traffic across multiple equal-cost paths.

Multicast traffic is not so lucky. Multicast flows are attracted to receivers by routers in the path. If a subscription needs to be sent upstream toward a multicast source, and multiple equal-cost paths exist, PIM will send the subscription to the upstream router with the highest IP address.

Load Balancing Mulicast Flows

In the scenario below, R1 has two equal-cost paths to the 192.168.10.0/24 network, but R1 will send all PIM joins for all 8 flows in the direction of R3, because of R3's higher IP address. The result will be that all 8 multicast flows traverse the R3-R1 link, and the R2-R1 link sits idle.

The fix to balance multicast traffic across both links is to enable one of the EMCP multicast multipathing mechanisms. The simple ip multicast multipath directive in global configuration mode will enable load sharing using the S-Hash algorithm. Much like load sharing in EtherChannels, this is a deterministic hashing mechanism that considers the source IP (for (S,G)) or RP IP (for (*,G)) when performing RPF calculations. Unlike the EtherChannel case (which requires determinism to maintain the ordered delivery LAN invariant), determinism is required here because the RPF calculation performed at PIM-join time must select the same upstream interface as the one used when performing the RPF check on incoming multicast data packets in the future.

In our case, with eight groups evenly distributed across four servers, we're done. But what if there's only one server talking on many groups? Load balancing with S-hash would force all subscriptions for that one server onto one upstream link even though multiple links are available. The next step in load balancing is ip multicast multicast s-g-hash basic. It load balances RPF decisions by taking both the Source IP and the multicast Group address into account, and will satisfactorily balance the few-producers-many-groups scenario.

Polarization

Consider a multi-tier network like the one depicted below. R1, R2 and R3 each have two choices (interfaces) when performing RPF lookups for sources on the 192.168.10.0/24 network. For simplicity, I've labeled them "0" and "1".

It doesn't matter whether we use s-hash or s-g-hash algorithm in this example. Assume we've selected one, and applied it to all seven routers. R1 balances the load beautifully: half of the flows are subscribed via upstream link "0" to R2, and the other half are subscribed via upstream link "1" to R3.

What will R2 and R3 do? Remember that the hashing scheme is deterministic. This means that R2 will request all multicast flows from R4. Determinism: Every flow going through R2 is a "link 0 flow", so R2 will always choose R4, because R2's RPF lookup is using the same criteria as R1's. Likewise, R3 will send all join requests to R7. The R5-R2 and R6-R3 links will sit idle.

Polarization Fix

To balance the load equally we need to use different path selection criteria at each routing tier, and ECMP has a mechanism to do this. We can add the next-hop router address to the hashing mix to re-balance the subscription load at each tier. This works because each router in the topology has a unique perspective on the the next hop address. This is implemented with ip multicast multipath s-g-hash next-hop-based in global configuration mode.

Nexus 7000 has a surprising feature when it comes to installing startup configurations: You can install any startup-config you want, as long as it's the running-config.

The Nexus 7000 platform stores the startup configuration in some sort of binary or compiled state, not a flat ascii file like you'd find on an IOS device. I think that when you 'copy running-config startup-config' on the Nexus, your running config gets nvgened, compiled and written, rather than just nvgened and written as would happen on IOS.

Frustratingly, you can't copy to the startup-config from any source other than running-config:

Tuesday, October 12, 2010

When an Ethernet station builds a frame for an IP packet, it needs to know what destination address to put on that frame.

For a unicast IP packet, the sending station uses the destination node's unique MAC address, which it learns through the ARP mechanism.

For a broadcast IP packet, the broadcast MAC address (ff:ff:ff:ff:ff:ff) is used.

But what about multicast IP packets? A unicast MAC address isn't appropriate, because there might be several stations on the segment which are interested in receiving the packet. Conversely, a broadcast frame isn't appropriate, because we'd be bothering systems that don't want to process the packet.

Sensibly, multicast IP packets get encapsulated into multicast Ethernet frames, using a block of addresses from 01:00:5e:00:00:00 - 01:00:5e:7f:00:00. RFC 1112 has all the details.

Most network folks have seen this process, and then forgotten it. The times it's come up at work, I've found that people think it's much uglier than it really is. It's a little ugly, but worth learning, and luckily, there's an interesting story behind it.

IP multicast group numbers look like IP addresses. They fit in the "Class D" space from 224.0.0.0 through 239.255.255.255. There are 2^28 unique multicast groups in that range. Unfortunately, there are only 2^23 unique multicast MAC addresses, so there's some overlap which needs to be taken into consideration when handing out multicast groups to applications.

I'm going to cover two historical points here. They're both interesting tidbits that make the multicast mapping rules make sense.

Ethernet frames are structured to make things easy on stations and bridges.

An Ethernet frame doesn't really begin with the destination MAC address. It starts with the preamble, which can be thought of as a way to "wake up" stations on a shared media segment, and get them ready to receive an incoming frame. I think of it like a rumble strip you'd encounter before a higway toll plaza, because it serves a similar function. And because it looks like one. The preamble, along with it's partner the start-of-frame-delimiter (SFD) comprises a 64 bit pattern of alternating ones and zeros ending with an errant one: 101010....101011

That pattern-breaking '11' at the end of the SFD indicates that the destination address will begin in the next bit. If you're a bridge, you're going to use the next 6 bytes to make a forwarding decision. If you're a station, you'll use thse 6 bytes to decide whether to process the frame or ignore it. The Ethernet designers did this so that the receiving NIC can quickly determine whether the frame is worthy of processing.

But that's not all. The very first bit in those 6 bytes, the bit that comes immediately after the '11' in the SFD is special: Bytes on Ethernet are transmitted in little-endian order, so that first bit to arrive is the least significant bit in the first byte of the address, otherwise known as the individual/group bit. If it is a '1', a bridge knows immediately (only one bit into the frame!) that this frame will need to be flooded out all ports. Nifty, and makes very speedy cut-through bridging decisions possible.

If you look at the various hardware addresses in one of your device's mac-address or arp tables, you shouldn't find any stations where the first byte is an odd number because stations must use unicast addresses. An odd numbered first-byte would mean that the individual/group bit is set. The broadcast (all-ones) Ethernet address, appropriately enough, has the bit set. Along with all of the other bits.

Somebody else's tight budget can become your forwarding problem.
The story goes that when Steve Deering was putting RFC 1112 together, he wanted to purchase 16 Ethernet OUIs. Each OUI allows for 2^24 unique addresses, so 16 of them would be required to cover the whole 28-bit IP multicast space. But the budget wouldn't cover 16 OUIs. The budget wouldn't even cover one OUI. Instead, he was able to procure only half of an OUI. So, that's why we map 28 bits of multicast group into 23 bits.

Armed with these two bits of information, we know more than two thirds of the resulting multicast frame. Here's how all 6 bytes of the multicast frame are derived:

Must be an odd number (multicast/broadcast bit is set), happens to be "01". That should be easy to remember now.

Always "00". Memorize it.

Always "5E". Memorize it.

Mapped from the multicast group, keeping in mind that Dr. Deering only procured 7 of the 8 bits in this byte.

Mapped directly from the multicast group.

Mapped directly from the multicast group.

I find that knowing the origin story behind these things makes them much easier to remember than:

The least significant bit of the most significant byte is the multicast/broadcast flag.

The budgetary reasoning behind this technical decision, and the long term implications it has for filtering multicast at L2 is a real bummer.

Sunday, October 10, 2010

I had to build a 2x 10Gb/s aggregate link with one member on a 6704 CFC card and the other member on a 6716 DFC3 card. Not ideal, but it was what I had to work with.

I knew going in that I'd need to configure no mls qos channel-consistency on the port channel interface because these cards have different hardware queueing capabilities. Somehow I forgot this detail immediately prior to the implementation.

The etherchannel interface was up with just the 6704, and I was adding the new link to the mix. My plan was to do show run int tengig x/y on the existing member link, and then copy/paste that configuration onto the new member link.

Everything went fine until I got to switchport trunk encapsulation dot1q, which earned me a %Unrecognized command in reply. I've noticed that on some platforms you can't tell a tagging interface what type of encapsulation to use because you don't have multiple encapsulation options anymore: ISL is obsolete, and support for it is drying up. ...What I hadn't noticed until right then is that support for ISL varies on a per-module basis within a platform: The 6704 could do ISL and required the encapsulation type directive, but the 6716 can't do ISL, and wouldn't accept the encapsulation command. Huh. Encapsulation/decapsulation is a hardware feature. It makes perfect sense, but I'd never considered it before.

The next thing I noticed is that the 6716 wasn't joining the aggregation. I puzzled over this for a while, comparing the running configuration of the two intended link members. With the exception of the 'encapsulation' directive, they were identical, but the 6716 wouldn't join the aggregation.

My problem, of course, was the QoS consistency check. EtherChannel doesn't care if the configuration on each interface looks the same. It only matters they are running the same, which they were.

Once I disabled the QoS consistency check, the 6716 link joined the 6704 link, and everything was fine. Of course, I didn't remember to disable the consistency check until immediately after I'd downed the 6716 interface, ripped it out of the channel-group, set the STP port path cost ridiculously high, and brought up the second link as an STP-blocked alternate path.

No change is so simple that it doesn't deserve a detailed execution plan.

In a vanilla PIM deployment, every router knows the one router that serves as RP for any given multicast group. You can have a single active RP (serving all groups), or many RPs, each one serving a different range of groups.

PIM routers can learn the RP address for a given group through one of several mechanisms:

Anycast RP + MSDP
Anycast RP is far more simple than the election-based mechanisms, and lets us do lots of nifty scoping tricks fairly simply. It also saves us from the pain of running multiple different RPs for different purposes. If you're not familiar with anycast, it's a simple concept: Run the same service on the same IP address on multiple different areas in a network, and advertise those IPs into your IGP. Routing protocols will deliver your packets to the closest implementation of that service (IP address). It works great for connectionless services (like PIM or DNS), where it doesn't matter if every packet you send hits exactly the same server.

In the case of anycast RP, we just spin up a loopback interface on every participating router, using the same IP address on each of them. Be careful if the anycast address has the numerically highest address among loopback interfaces. Then manually configure the anycast IP as the RP on all leaf routers in the network, just like you would for static RP configuration.

Leaf routers will now send PIM traffic to the closest RP. But there's a missing bit here:

The missing piece is synchronization between RPs, and that's where Multicast Source Discovery Protocol (MSDP) fits in. MSDP was designed to share information about active multicast sources between the (presumably single) RPs in different administrative domains. You might use MSDP peering with your ISP to learn about active multicast flows out in the Internet, or with a business partner in order to attract flows from their network, because once your RP knows about an active source, it can send subscription requests (PIM joins) in the direction of the active source.

A clever use of MSDP is peering between your own Anycast RPs, so that each of the many simultaneously active RPs will know about all active flows in your network:

Anycast RP + MSDP is a great solution. Anycast PIM (available on NX-OS) lacks rich filtering capability, which is key to the enterprise scoping scheme. You can do interesting combinations of anycast RP with Auto-RP, but it gets complicated quickly, and I don't see a compelling use case for it.

Where do RPs belong?
So, having established that we're going to run lots of RPs, where do they belong exactly? Lots of people spend lots of time thinking about exactly where RPs should go relative to sources and receivers. Someplace between them is ideal, if you can manage it. ...But it probably doesn't matter much. With the default PIM sparse configuration, data won't flow through the RPs, except for a brief moment when subscribers initially come online. If your packet rates are high enough, and your subscribers transient enough, then you should place RPs carefully.

So, where exactly do we need RPs? Start with the smallest routable scope. In our case, it's "building". If there will be intra-building multicast flows, then by definition there must be an RP in each building. And if you need one RP, then you probably need a second one for redundancy. So, that settles it: Two RPs in each building. With wild, crazy and carefully planned MSDP peering to bring them all together.

Thursday, October 7, 2010

This is the first in a series of posts about deploying IPv4 multicast within an enterprise. I'm starting with allocation of multicast group addresses because the way groups get laid out will impact other aspects of the design.

This exercise assumes a large global enterprise running sparse mode PIM with multicast everywhere, and lots of different multicast applications with different relevant scopes. Some applications will never reach beyond the local link, some will multicast between continents, the others fall somewhere in between. This design is a scalable framework. You won't be deploying this whole scheme all at once, but it's helpful to get all of these bits and pieces into place early to allow for room to grow without having to rip things apart later.

I assume we don't have the luxury of using Source Specific Multicast (SSM), a mechanism in which the receivers (or, alternatively, the leaf routers) know the address of the originating endpoints, and ask for them by source IP. Instead, we'll plan for Any Source Multicast (ASM), which requires the placement, configuration and peering of PIM Rendezvous Points (RPs) to bring together active data flows and interested receivers.

IPv4 Multicast falls into the 224/4 address block: 224.0.0.0 - 239.255.255.255. Within this range there are sub-ranges used for various purposes, which will drive our assigment decisions:

RFC2365 Administratively Scoped IP MulticastRFC2365 gives us the "IPv4 Local Scope" 239.255/16. The only restrictions on the use of this scope is that it not be further subdivided, and that it be the smallest scope in use. It's perfect for Link-Local multicast applications that won't be routed off-net. Things like application heartbeat traffic, server load balancing coordination, pricing application backends, etc... Because this traffic won't be routed, the group addresses can be re-used on different subnets. Depending on your perspective, this re-use can make your life easier (production and disaster recovery instances of an application can run with the same backend configuration), or more complicated (trying to keep track of exactly what is using 239.255.5.5 on each subnet). Proceed with caution.

RFC2365 also prescribes the 239.192/14 block for private use within an organization. The block breaks nicely into four /16s, which is exactly the number of routable scopes I'm going to present. If you need more scopes, you can dip into the expansion range described by section 6.2.1 (not recommended), or you can slice the /14 into smaller chunks.

Don't use x.0.0.x or x.128.0.xThere are thirty-two /24s that should be avoided at all costs. These are abyproduct of RFC 1112 Section 6.4 and RFC 3171 Section 3. These multicastgroups map into MAC addresses 01:00:5E:00:00:XX. L2 flooding suppresion mechanisms don't work on these groups. They will always flood to all ports in a broadcast domain unless you're using very new and expensive equipment which can restrain L2 multicast traffic based on information in the L3 header.

Internet groupsMost of the 224/4 block is registered space. The applications here could conceivably be delivered to you over The Internet, but more likely will arrive on dedicated circuits from vendors and business partners. A common example of this sort of traffic is market pricing data in financial firms. You won't likely be talking PIM with your ISP, but you might see multicast using registered space at your B2B network edge.

The Scopes

239.255.0.0/16

Link local

Used for non-routable traffic only. Group addresses are universally re-usable.

239.192.0.0/16

Building local scope

Used for applications where the sender and receivers live in the same building. A building-wide public address system might use this scope. Group addresses can be re-used in each building: Perhaps the same PA system is in each building. If you use the same addresses, the application owners won't need a special configuration for each building.

239.193.0.0/16

Campus local scope

The campus local scope works just like the building scope, but has wider reach. Perhaps you're pulling MPEG2 HDTV out of the air and multicasting it onto your LAN. You probably don't want to put these fat streams on the wide area links, so you duplicate the multicasts in each campus. By re-using group addresses, you'll only need a single TV guide for the whole enterprise. Users who tune into 239.193.1.1 will find their local ABC affiliate (for example), no matter which office they're in.

239.194.0.0/16

Region local scope

Works just like Campus and Building scopes, but with national or continental scale.

239.195.0.0/16

Enterprise local scope

The enterprise local scope is for application streams that will be used enterprise-wide. These group addresses are not reusable.

224.0.0.0/4

Internet scope

Subsets of this /4 are registered space. You probably won't be multicasting to or from The Internet anytime soon, but might find yourself forwarding registered applications that arrive on private circuits.

One final detail about these scopes: Lets say you have offices in London, Madrid, Tokyo, Los Angeles, and a multi-building campus in Chicago.

The Building scopes are easy to identify: Each building is a scope!

The Chicago campus obviously constitutes a campus scope, but what about those one-office cities? They're campuses too. One-building campuses. As applications roll out in LA, pretend you have multiple buildings there, and assign addresses accordingly. If you assign KTLA to a Building-scope multicast group, you'll have to reconfigure things when a new office opens so that those folks won't miss watching any live police chases.

Accordingly, the Tokyo office represents a Building, Campus and Region, all by itself. When the Osaka office opens, the Region-scoped Japanese music-on-hold multicast stream that you deployed in Tokyo will be available for use in Osaka.

PIM BiDirFinally, we need to carve those scopes up one more time. Bidirectional PIM is a mechanism where multicast traffic flows in both directions between end stations. Everybody is a sender and a receiver at the same time. BiDir PIM doesn't use a shortest-path-tree like ASM, so it's good to set aside an address block for it, even if we're not going to use it right away. We'll take the top half of each local routable /16 for BiDir.

Friday, October 1, 2010

Cisco Nexus vPC operation is well documented all over the 'net, so I won't be covering the basics here. Instead, I want to focus on a particular failure scenario, in which the vPC safety mechanisms can indefinitely prolong downtime when other failures have occurred.

Consider the following topology:

Nothing fancy is going on here. Nexus 5000s have 3 downstream (south-facing?) vPC links and a redundant vPC peer-link. The management interface on each Nexus is doing peer-keepalive duty through a management switch.

My previous builds have been somewhat paranoid about redundancy of the peer-keepalive traffic, but I no longer believe that's helpful, and I'll be doing keepalive over the non-redundant mgmt0 interface going forward.

Each Nexus knows to bring up its vPC link members because the peer-link is up, so the activity can be coordinated between chassis. If the peer-link fails each Nexus pair can still coordinate their vPC forwarding behavior by communicating each other's state over the peer-keepalive management network.

If a management link (or the whole management switch) were to fail, then no problem. It's the state-only backup to the peer-link, and not required to forward traffic.

If a whole Nexus switch fails, the surviving peer will detect the complete failure of his neighbor, and continue forwarding traffic for the vPC normally.

When the failed Nexus comes back up, he waits until he's established communication with the survivor before bringing up the vPC member links, because it wouldn't be safe to bring up aggregate link members without knowing what the peer chassis is doing.

...And that brings us to the interesting failure scenario: Imagine that a power outage strikes both Nexus 5Ks, but only one of them comes back up. The lone chassis can't reach his peer over the peer link or the peer-keepalive link. He's got no way of knowing whether it's safe to bring up the vPC member links, so they stay down.

If this happened to me in production, I'd probably do two things to bring it back online:

Take steps to ensure the failed box can't come back to life. How do you kill a zombie switch, anyway?

Remove the vpc statement from each vPC port channel interface

Nortel had a similar problem with their RSMLT mechanism, but that deadlock centered around keeping track of who owns the first-hop gateway address (not HSRP/VRRP/GLBP). They solved it by recording responsibility for the gateway address into NVRAM (flash? spinning rust? wherever it is that Nortel records such things).

Tuesday, September 28, 2010

Cisco switches have a nifty but little used diagnostic feature: The 'traceroute mac' command.

It does pretty much what you'd expect. It traces the L2 path between two endpoints. Exactly how it accomplishes this feat is much less obvious. Normal (Layer 3) traceroute makes use of progressively larger TTL marks, and uses the ICMP "time to live exceeded" errors from routers along the path in order to print the path between two nodes. These mechanisms don't exist in a bridged environment. So how does it work?

Consider the following topology:

Running traceroute mac on the rightmost switch produces the following result:

The procedure for doing this work manually is straightforward. We look for each MAC address in the VLAN 11 forwarding table, then check to see whether the egress port has a CDP neighbor. If so, log into the next switch (using the management address reported by CDP). Lather, rinse, repeat.

The manual procedure works on each MAC address independently. The 'traceroute mac' command, however requires you to specify both source and destination stations. It's curious: L3 traceroute doesn't do that, it assumes you want to trace the path from here to somewhere else. 'traceroute mac' on the other hand can trace from somewhere else to somewhere else. Here we run a trace from a switch that isn't in the transit path between the two end stations.

In that case, we were logged into the 2950, but traced the L2 path between two stations that were each connected directly to the 2960. Two L2 hops away. The operation of this tool is somewhat mysterious. Here's what I can tell about it so far:

Both MAC addresses must appear in the forwarding table on each switch in the path.

Each switche uses CDP to figure out what next hop lies beyond its local egress port (and the IP address on which it can be reached).

It's an L3 process. The right switch in the figure above is in a different management subnet than his neighbors. These switches all forward traffic directly at L2, but they talk amongst themselves via the router-on-a-stick. This is different than the operation of CDP (a link layer protocol).

If more than one CDP neighbor appears on an interface in the path (several switches hanging from a hub), the process blows up because it's impossible to discern the next hop. You can replicate this easily by changing the hostname of a switch. The neighbors will see two entries until the old name times out.

The switch running the trace communicates directly with each device in the path. The first example involved:

An exchange between the 2960 and the 3550

An exchange between the 2960 and the 2950

The process requires a service running on UDP 2228 on each switch (see it with 'show ip sockets')

If there's no CDP neighbor on a port (like when I switched CDP off on the 2950), then that's the end of the trace.

The wire format is undocumented as far as I can tell. The wireshark wiki page for CDP mentions the protocol, but doesn't have any information on it. It's not CDP, but it's similar. Each packet seems to have some fixed fields, and some TLV sets.

It's very surprising to me that mapping out an L2 environment can be done using L3 (off subnet) tools with seemingly no security. I'm not much of a believer in security by obscurity (I generally run CDP on edge ports), but this level of network mapping without even requiring an SNMP read-only string seems like it could be a problem. The only hint of a complicating factor here is that the name of the target switch is embedded in the request packet. If that name is checked before a reply is sent, there's some small measure of security. But all an attacker needs is the name of a single switch, since that first switch will give up the names of all of his neighbors.

It will be difficult to strike the balance between security and usability when writing ACLs for this service, since you need to protect every IP interface on an L3 switch, while still providing service to clients on every IP interface of every L2/L3 switch.