Wednesday, December 28, 2011

Tom Hollingsworth published a great post about some of the common NEMA and IEC power connectors that you're likely to encounter when working on network gear in North America. Tom's post inspired me to throw together a short list of gotchas about powering network gear.

North America Power Cord
One of the many power cord choices you'll encounter when ordering a Cisco switch is this one:

CAB-N5K6A-NA

Power Cord, 210/220V 30A North America

This is a post about providing power in North America, so it sounds like a good cable, right? Wrong. The male end of this cable features a NEMA 6-15 plug. It would have been nice of Cisco to mention that detail in the product description, huh? Also, I'm a little puzzled by the "30A" reference in the description. A NEMA 6-15 outlet should be backed by a 15A circuit breaker, and the C13 connector is rated for only 10A, as the sketch below indicates.

It's such a weird cable that Tom didn't bother to mention it in his cable rundown post, and I don't think I've ever seen one of these outlets in the wild. I have, however, seen a customer place a large order, and specify these cables (two of them, actually!) for every top-of-rack component.

Cable Length and Where Is The Inlet?
Consider the following power cord choices for the Nexus 5020 switch (note that this is just 1/3 of the choices for this platform):

CAB-N5K6A-NA

Power Cord, 210/220V 30A North America

CAB-AC-250V/13A

North America,NEMA L6-20 250V/20A plug-IEC320/C13 receptacle

CAB-C13-C14-2M

Power Cord Jumper, C13-C14 Connectors, 2 Meter Length

CAB-9K12A-NA

Power Cord, 125VAC 13A NEMA 5-15 Plug, North America

CAB-C13-CBN

Cabinet Jumper Power Cord, 250 VAC 10A, C14-C13 Connectors

It's pretty common to find power strips with IEC C14 outlets in server racks these days, and the Nexus 5000 has C13 power inlets. A C13-C14 cable seems like the obvious choice. But there are two of them listed here! That's because CAB-C13-C14-2M is 2m long (as indicated), and the CAB-C13-CBN is only .7m long. Again, it would have been nice of Cisco to mention this detail in the product description.

For a Top-Of-Rack Nexus 5010/5020, CAB-C13-CBN is probably the right choice. The power inlets are on the back (hot aisle side) of the switch, right where we're probably going to find a power outlet. 2m would be way too much power cord for this application.

But what about a Top-Of-Rack Nexus 5500, Nexus 2000, or fixed-configuration Catalyst switch? Those units tend to have the power inlet on the opposite side. A .7m cable would be too short, so we should order the CAB-C13-C14-2M for those.

What Voltage?
When talking about NEMA outlets, it's easy to know what voltage you're going to find. NEMA 5-xxx indicates 110V, and NEMA 6-xxx indicates 220V. But the IEC outlets can be unpredictable. They're probably going to be 220V, in spite of the fact that the C13 cords powering the system I'm sitting in front of right now deliver only 110V.

Most data center gear includes auto-ranging power supplies, so it's no problem, but things that weren't intended to live in the data center might have a voltage selector switch. I've seen Sun and Dell workstations blown up by connecting them to 220V power without setting the manual voltage selector switch.

Along those same lines, I always carry one of these adapter cords so that I can charge my laptop in the data center.

It's super-handy, but creates a situation where you might accidentally plug a 110V device into 220V power. Be careful.

Power Supplies: Bigger Isn't Better
Many of my customers will automatically select the biggest available power supply when specifying their chassis-based switches. Hey, bigger is better, and they only cost a little bit more, right?

Large power supplies generally get big power through one of two options:

Multiple power inputs

High current power inputs

The high current power inlet approach can be a problem. The largest removable power cords we have available are the C19 connector type, rated for up to 16A. Power supplies with 30A inlets like the 4000W unit for Catalyst 6500 and the 7500W unit for Nexus 7000 have fixed cords that can't be disconnected from the power supply.

4KW PSU for Catalyst 6500

7.5KW PSU for Nexus 7000

When most power supplies fail, you simply unplug the cable from the PSU, replace the PSU, and reattach the cable. With fixed cords, you'll need to fish the whole cord, along with it's huge connector out of the rack. Sometimes pulling that cord out of the rack isn't possible because too many new cables have been run through the penetration in the floor/rack/whatever since the switch was installed. Now what? Chopping off the end of the dead cord is a reasonable option, but how do you get the new PSU installed?

Power supplies with removable cords are often preferable because they simplify operations.

2 PDUs; 2 PSUs; 4 cords -- Now what?
Unfortunately, multiple input power supplies can complicate the initial design. Cisco refers to this set of issues using the phrases "input source redundancy" and "power supply redunancy." The issue boils down to how will you mesh up the four power cables between two PSUs and two PDUs?

You probably want to run both power cords from each PSU to the same circuit/PDU (assuming the PDU can deliver the required current), but people tend to split each device between multiple PDUs. If/when a PDU fails, suddenly each power supply will drop to half of it's previous capacity. If you've configured the switch for full power redundancy mode, then the switch will begin shutting down line cards until the allocated power falls below the threshold represented by half of one power supply. It's ugly, and I've seen it happen more than once.

The same thing comes up when configuring 3750X stackable switches with StackPower. Think carefully about how the power cords align with available circuits. Assume you're going to lose a circuit. Will the stack survive? If you configure every switch with dual power supplies, then it'll be fine, but you might as well skip StackPower at that point because you've just made each switch individually redundant.

To solve this problem the StackPower way, then you need to make sure that the stack power pool has sufficient capacity even when you're down a circuit. Unfortunately, there's no way to validate power cable to circuit mapping remotely. The only way to make sure that this is done correctly is to visit the closet and trace out the power cords.

Tuesday, December 20, 2011

My introduction to enterprise networking was a little backward. I started out supporting trading floors, backend pricing systems, low-latency algorithmic trading systems, etc... I got there because I'd been responsible for UNIX systems producing and consuming multicast data at several large financial firms.

Inevitably, the firm's network admin folks weren't up to speed on matters of performance tuning, multicast configuration and QoS, so that's where I focused my attention. One of these firms offered me a job with the word "network" in the title, and I was off to the races.

It amazes me how little I knew in those days. I was doing PIM and MSDP designs before the phrases "link state" and "distance vector" were in my vocabulary! I had no idea what was populating the unicast routing table of my switches, but I knew that the table was populated, and I knew what PIM was going to do with that data.

More incredible is how my ignorance of "normal" ways of doing things (AVVID, SONA, Cisco Enterprise Architecture, multi-tier designs, etc...) gave me an advantage over folks who had been properly indoctrinated. My designs worked well for these applications, but looked crazy to the rest of the network staff (whose underperforming traditional designs I was replacing).

The trading floor is a weird place, with funny requirements. In this post I'm going to go over some of the things that make trading floor networking... Interesting.

Redundant Application Flows

The first thing to know about pricing systems is that you generally have two copies of any pricing data flowing through the environment at any time. Ideally, these two sets originate from different head-end systems, get transit from different wide area service providers, ride different physical infrastructure into opposite sides of your data center, and terminate on different NICs in the receiving servers.

If you're getting data directly from an exchange, that data will probably be arriving as multicast flows. Redundant multicast flows. The same data arrives at your edge from two different sources, using two different multicast groups.

If you're buying data from a value-add aggregator (Reuters, Bloomberg, etc...), then it probably arrives via TCP from at least two different sources. The data may be duplicate copies (redundancy), or be distributed among the flows with an N+1 load-sharing scheme.

Losing One Packet Is Bad

Most application flows have no problem with packet loss. High performance trading systems are not in this category.

Think of the state of the pricing data like a spreadsheet. The rows represents a securities -- something that traders buy and sell. The columns represent attributes of that security: bid price, ask price, daily high and low, last trade price, last trade exchange, etc...

Our spreadsheet has around 100 columns and 200,000 rows. That's 20 million cells. Every message that rolls in from a multicast feed updates one of those cells. You just lost a packet. Which cell is wrong? Easy answer: All of them. If a trader can't trust his data, he can't trade.

These applications have repair mechanisms, but they're generally slow and/or clunky. Some of them even involve touch tone. Really:

Because we've got two copies of the data coming in. There's no reason to fix a single failure. If something breaks, you can let it stay broken until the end of the day.

What's that? You think it's worth fixing things with a dynamic routing protocol? Okay cool, route around the problem. Just so long as you can guarantee that "flow A" and "flow B" never traverse the same core router. Why am I paying for two copies of this data if you're going to push it through a single device? You just told me that the device is so fragile that you feel compelled to route around failures!

Don't Cluster the Firewalls

The same reason we don't let routing reconverge applies here. If there are two pricing firewalls, don't tell them about each other. Run them as standalone units. Put them in separate rooms, even. We can afford to lose half of a redundant feed. We cannot afford to lose both feeds, even for the few milliseconds required for the standby firewall take over. Two clusters (four firewalls) would be okay, just keep the "A" and "B" feeds separate!

Don't team the server NICs

The flow-splitting logic applies all the way down to the servers. If they've got two NICs available for incoming pricing data, these NICs should be dedicated per-flow. Even if there are NICs-a-plenty, the teaming schemes are all bad news because like flows, application components are also disposable. It's okay to lose one. Getting one back? That's sometimes worse. Keep reading...

Recovery Can Kill You

Most of these pricing systems include a mechanism for data receivers to request retransmission of lost data, but the recovery can be a problem. With few exceptions, the network applications in use on the trading floor don't do any sort of flow control. It's like they're trying to hurt you.

Imagine a university lecture where a sleeping student wakes up, asks the lecturer to repeat the last 30 minutes, and the lecturer complies. That's kind of how these systems work.

Except that the lecturer complies at wire speed, and the whole lecture hall full of students is compelled to continue taking notes. Why should the every other receiver be penalized because one system screwed up? I've got trades to clear!

The following snapshot is from the Cisco CVD for trading systems. it shows how aggressive these systems can be. A nominal 5Mb/s trading application regularly hits wire-speed (100Mb/s) in this case.

The graph shows a small network when things are working right. A big trading backend at a large financial services firm can easily push that green line into the multi-gigabit range. Make things interesting by breaking stuff and you'll over-run even your best 10Gb/s switch buffers (6716 cards have 90MB per port) easily.

Slow Servers Are Good

Lots of networks run with clients deliberately connected at slower speeds than their server. Maybe you have 10/100 ports in the wiring closet and gigabit-attached servers. Pricing networks require exactly the opposite. The lecturer in my analogy isn't just a single lecturer. It's a team of lecturers. They all go into wire-speed mode when the sleeping student wakes up.

How will you deliver multiple simultaneous gigabit-ish multicast streams to your access ports? You can't. I've fixed more than one trading system by setting server interfaces down to 100Mb/s or even 10Mb/s. Fast clients, slow servers is where you want to be.

Slowing down the servers can turn N*1Gb/s worth of data into N*100Mb/s -- something we can actually handle.

Bad Apple Syndrome

The sleeping student example is actually pretty common. It's amazing to see the impact that can arise from things like:

a clock update on a workstation

ripping a CD with iTunes

briefly closing the lid on a laptop

The trading floor is usually a population of Windows machines with users sitting behind them. Keeping these things from killing each other is a daunting task. One bad apple will truly spoil the bunch.

How Fast Is It?

System performance is usually measured in terms of stuff per interval. That's meaningless on the trading floor. The opening bell at NYSE is like turning on a fire hose. The only metric that matters is the answer to this question: Did you spill even one drop of water?

How close were you to the limit? Will you make it through tomorrow's trading day too?

I read on twitter that Ben Bernanke got a bad piece of fish for dinner. How confident are you now? Performance of these systems is binary. You either survived or you did not. There is no "system is running slow" in this world.

Routing Is Upside Down

While not unique to trading floors, we do lots of multicast here. Multicast is funny because it relies on routing traffic away from the source, rather than routing it toward the destination. Getting into and staying in this mindset can be a challenge. I started out with no idea how routing worked, so had no problem getting into the multicast mindset :-)

NACK not ACK

Almost every network protocol relies on data receivers ACKnowledging their receipt of data. But not here. Pricing systems only speak up when something goes missing.

QoS Isn't The Answer

QoS might seem like the answer to make sure that we get through the day smoothly, but it's not. In fact, it can be counterproductive.

QoS is about managed un-fairness... Choosing which packets to drop. But pricing systems are usually deployed on dedicated systems with dedicated switches. Every packet is critical, and there's probably more of them than we can handle. There's nothing we can drop.

Making matters worse, enabling QoS on many switching platforms reduces the buffers available to our critical pricing flows, because the buffers necessarily get carved so that they can be allocated to different kinds of traffic. It's counter intuitive, but 'no mls qos' is sometimes the right thing to do.

Load Balancing Ain't All It's Cracked Up To Be

By default, CEF doesn't load balance multicast flows. CEF load balancing of multicast can be enabled and enhanced, but doesn't happen out of the box.

We can get screwed on EtherChannel links too: Sometimes these quirky applications intermingle unicast data with the multicast stream. Perhaps a latecomer to the trading floor wants to start watching Cisco's stock price. Before he can begin, he needs all 100 cells associated with CSCO. This is sometimes called the "Initial Image." He ignores updates for CSCO until he's got the that starting point loaded up.

CSCO has updated 9000 times today, so the server unicasts the initial image: "Here are all 100 cells for CSCO as of update #9000: blah blah blah...". Then the price changes, and the server multicasts update #9001 to all receivers.

If there's a load balanced path (either CEF or an aggregate link) between the server and client, then our new client could get update 9001 (multicast) before the initial image (unicast) shows up. The client will discard update 9001 because he's expecting a full record, not an update to a single cell.

Post-mortem analysis of these kinds of incidents will boil down to the software folks saying:

We put the messages on the wire in the correct order. They were delivered by the network in the wrong order.

ARP Times Out

NACK-based applications sit quietly until there's a problem. So quietly that they might forget the hardware address associated with their gateway or with a neighbor.

No problem, right? ARP will figure it out... Eventually. Because these are generally UDP-based applications without flow control, the system doesn't fire off a single packet, then sit and wait like it might when talking TCP. No, these systems can suddenly kick off a whole bunch of UDP datagrams destined for a system it hasn't talked to in hours.

The lower layers in the IP stack need to hold onto these packets until the ARP resolution process is complete. But the packets keep rolling down the stack! The outstanding ARP queue is only 1 packet deep in many implementations. The queue overflows and data is lost. It's not strictly a network problem, but don't worry. Your phone will ring.

Losing Data Causes You to Lose Data

There's a nasty failure mode underlying the NACK-based scheme. Lost data will be retransmitted. If you couldn't handle the data flow the first time around, why expect to handle wire speed retransmission of that data on top of the data that's coming in the next instant?

If the data loss was caused by a Bad Apple receiver, then all his peers suffer the consequences. You may have many bad apples in a moment. One Bad Apple will spoil the bunch.

If the data loss was caused by an overloaded network component, then you're rewarded by compounding increases in packet rate. The exchanges don't stop trading, and the data sources have a large queue of data to re-send.

TCP applications slow down in the face of congestion. Pricing applications speed up.

Packet Decodes Aren't Available

Some of the wire formats you'll be dealing with are closed-source secrets. Others are published standards for which no WireShark decodes are publicly available. Either way, you're pretty much on your own when it comes to analysis.

UpdatesResponding to Will's question about data sources: The streams come from the various exchanges (NASDAQ, NYSE, FTSE, etc...) Because each of these exchanges use their own data format, there's usually some layers of processing required to get them into a common format for application consumption. This processing can happen at a value-add data distributor (Reuters, Bloomberg, Activ), or it can be done in-house by the end user. Local processing has the advantage of lower latency because you don't have to have the data shipped from the exchange to a middleman before you see it.

Other streams come from application components within the company. There are usually some layers of processing (between 2 and 12) between a pricing update first hitting your equipment, and when that update is consumed by a trader. The processing can include format changes, addition of custom fields, delay engines (delayed data can be given away for free), vendor-switch systems (I don't trust data vendor "A", switch me to "B"), etc...

Most of those layers are going to be multicast, and they're going to be the really dangerous ones, because the sources can clobber you with LAN speeds, rather than WAN speeds.

As far as getting the data goes, you can move your servers into the exchange's facility for low-latency access (some exchanges actually provision the same length of fiber to each colocated customer, so that nobody can claim a latency disadvantage), you can provision your own point-to-point circuit for data access, you can buy a fat local loop from a financial network provider like BT/Radianz (probably MPLS on the back end so that one local loop can get you to all your pricing and clearing partners), or you can buy the data from a value-add aggregator like Reuters or Bloomberg.

Responding to Will's question about SSM: I've never seen an SSM pricing component. They may be out there, but they might not be a super good fit. Here's why: Everything in these setups is redundant, all the way down to software components. It's redundant in ways we're not used to seeing in enterprises. No load-balancer required here. The software components collaborate and share workload dynamically. If one ticker plant fails, his partner knows what update was successfully transmitted by the dead peer, and takes over from that point. Consuming systems don't know who the servers are, and don't care. A server could be replaced at any moment.

In fact, it's not just downstream pricing data that's multicast. Many of these systems use a model where the clients don't know who the data sources are. Instead of sending requests to a server, they multicast their requests for data, and the servers multicast the replies back. Instead of:

Not knowing who your server is kind of runs counter to the SSM ideal. It could be done with a pool of servers, I've just never seen it.

The exchanges are particularly slow-moving when it comes to changing things. The modern exchange feed, particularly ones like the "touch tone" example I cited are literally ticker-tape punch signals wrapped up in an IP multicast header.

The old school scheme was to have a ticker tape machine hooked to a "line" from the exchange. Maybe you'd have two of them (A and B again). There would be a third one for retransmit. Ticker machine run out of paper? Call the exchange, and here's more-or-less what happens:

Somebody at the exchange cuts the section of paper containing the updates you missed out of their spool of ticker tape. Actual scissors are involved here.

Next, they grab a bit of header tape that says: "this is retransmit data for XYZ Bank".

They tape these two pieces of paper together, and feed them through a reader that's attached to the "retransmit line"

Every bank in New York will get the retransmits, but they'll know to ignore them because of the header.

XYZ Bank clips the retransmit data out of the retransmit ticker machine, and pastes it into place on the end where the machine ran out of paper.

These terms "tick" "line" and "retransmit", etc... all still apply with modern IP based systems. I've read the developer guides for these systems (to write wireshark decodes), and it's like a trip back in time. Some of these systems are still so closely coupled to the paper-punch system that you get chads all over the floor and paper cuts all over your hands just from reading the API guide :-)

There's a long tradition of using clever and humorous names for open source software projects. I've recently been introduced to a particularly striking example, and it's got me thinking about some of the funny language games played by open source software folks.

In the 1960s there was Basic Combined Programming Language (BCPL), commonly known as B. It gave way to a new language: C, which happened to be the next letter in BCPL. At this point, BCPL began to be referred to by the backronym "Before C Programming Language." Next, C was followed up not by P, but by C++, because the ++ operator is how you increment something in C. Then Microsoft gave us C sharp, which includes the musical notation roughly analogous to "increment by one", and kind of looks like two "++" operators. Har har.

In the 1980's, rms decided the world needed a truly free UNIX-like operating system. He named his project GNU, which of course stood for "GNU's Not Unix", leading to much recursive acronymhilarity. GNU's Hurd kernel stands for "Hird of Unix Replacing Daemons", and Hird stands for "Hurd of Interfaces Representing Depth." Oh my.

The first email client I ever used was Elm (ELectronic Mail), which I later abandoned for Pine (backronymed: Pine Is Not ELM). Due to licensing restrictions, the University of Washington stopped development of Pine, and shifted their effort to an Apache Licensed version of Pine: Alpine. Plus, they're in the Northwest corner of the United States of America. Lots of evergreen trees up there from what I understand. Also mountains. Trees on mountains, even.

I breifly experimented with some Instant Messaging server software called TwoCan. I can't find their logo anymore, but it consisted of two cans and a string. Chat technology at its finest! Also, it reminds me a lot of Jeff Fry's avatar on the twitter.

The NoCat project is a wifi sharing scheme somewhat like a free/open version of what iPass offers. The project's logo makes it perfectly clear that there are no cats involved. They explain the name this way:

Albert Einstein, when asked to describe radio, replied:"You see, wire telegraph is a kind of a very, very long cat. You pull his tail in New York and his head is meowing in Los Angeles. Do you understand this? And radio operates exactly the same way: you send signals here, they receive them there. The only difference is that thereis no cat."

The project that got me thinking about clever names and logos is the Linux Pacemaker component of the Linux-HA server clustering project. Pacemaker is an add-on to the heartbeat daemon. Get it? Their logo is a set of stylized rabbit ears, so they've got both the EKG/heartbeat thing, as well as the "set the pace for high performace" rabbit thing going on. Clever stuff.

Shoot him in the head? Well, that should take care of any dual-active problems, all right! The implementation is just as dramatic as the name suggests: Basically, it boils down to each node in the cluster being logged into the other node's power strip. Misbehave and I'll cut your power!

Split-brain / dual-active detection and remediation is something with which we're familiar in the networking department. I'm a little bit disappointed that we don't have anything as crazy / awesome as STONITH in our toolbox...

Monday, December 19, 2011

Denton Gentry wrote a great article in which he explained why jumbo frames are not the panacea that so many people expect them to be.

It was a timely article for me to find because I happened to have been going around in circles with a few different user groups who were sure that jumbo frames are the solution to all of their problems, but they're unwilling to do the administrative work required for implementation, nor the analytic work to see whether there's anything to be gained.

The gist of Denton's argument is that jumbo frames are just one way of reducing the amount of work required for a server to send a large volume of data. Modern NICs and drivers have brought us easier to support ways of accomplishing the same result. The new techniques work even when the system we're talking to doesn't support jumbos, and they even work across intermediate links with a small MTU.

There's a small facet to this discussion that's been niggling at me, but I've been hesitant to bring it up because I'm not sure how significant it is.

Why not just use Jumbos?
I'm always hesitant to enable jumbo frames for a customer because tends to be a difficult configuration to support. Sure, typing in the configuration is easy, but that configuration takes us down a non-standard rabbit hole where too few people understand the rules.

Every customer I've worked with has made mistakes in this regard. It's a support nightmare that leads to lots of trouble tickets because somebody always forgets to enable jumbos when they deploy a new server/router/switch/coffeepot.

The Rules

All IP hosts sharing an IP subnet need to agree on the IP MTU in use on that subnet.

All L2 gear supporting that subnet must be able to handle the largest frame one of those hosts might generate.

Rule 1 means that if you're going to enable jumbo frames on a subnet, you need to configure all systems at the same time. All servers, desktops, appliances, routers on the segment need to agree. This point is not negotiable. PMTUD won't fix things if they don't agree. Nor will TCP's MSS negotiation mechanism. Just make them match.

Rule 2 means that all switches and bridges have to support at least the largest frame. Larger is okay, smaller is not. The maximum frame size value will not be the same as the IP MTU, because it needs to take into account the L2 header.

For extra amusement, different products (even within a single vendor's product lineup) don't agree about how the MTU configuration directives are supposed to be interpreted, making the rules tough to follow.

So, what's been niggling at me?
In a modern (Nexus) Cisco data center, we push servers towards using LACP instead of active/standby redundancy. There are various reasons for this preference relating to orphan ports, optimal switching path, the relative high cost of an East/West trip across the vPC peer-link, being confident that the "standby" NIC and switchport are configured correctly, etc... LACP to the server is good news for all these reasons.

But it's bad news for another reason. While aggregate links on a switch are free because they're implemented in hardware, aggregation at the server is another story. Generally speaking, servers instantiate a virtual NIC, and apply their IP configuration to it. The virtual NIC is a software wedge between the upper stack layers and the hardware. It's not free, and it is required to process every frame/PDU/write/whatever handed down from software to hardware, and vice versa.

So , when we turn on LACP on the server, we add per-PDU software processing that wasn't there before, re-kindling the notion that larger PDUs are better. The various TCP offload features can probably be retained, and the performance of aggregate links is generally good. YMMV, check with your server OS vendor.

I'm not sure that we're forcing the server folks to take a step backwards in terms of performance, but I'm afraid that we're supplying a foothold for the pro-jumbo argument which should have ended years ago.

Tuesday, December 6, 2011

Ivan has recently written a couple of posts that have inspired me to put on paper some thoughts about broadcast domain sizing.

We all intuitively know that a "too big" broadcast domain is a problem. But how big is too big, and what are the relevant metrics?

There was a time when servers did lots of irresponsible broadcasting, but in my experience, stuff that's installed in today's virtual data centers is much better behaved than the stuff of a decade ago.

...and just what is a "broadcast storm" anyway? Most descriptions I've read are describing something that can be much better categorized as a "bridging loop". If dozens or hundreds of servers are producing broadcast frames, I benevolently assume that it's because the servers expect us to deliver the frames. Either that, or the servers are broken, and should be fixed.

I have a background in supporting trading systems that regularly produce aggregate broadcast/multicast rates well in excess of 1Gb/s, and that background probably informs my thinking on this point. The use of broadcast and multicast suppression mechanisms seems heavy handed. What is this traffic exactly? Why is the server sending it? Why don't I want to deliver it? QoS is a much more responsible way to handle this problem if/when there really is one.

Whew, I'm well off track already!

The central point here is that I believe we talk about the wrong things when discussing the size and scope of our L2 networks. Here's why.

Subnet size is irrelevant.
I used to run a network that included a /16 server access LAN. The subnet was shockingly full, but didn't really need a full /16. A /18 would have worked, but /19 would have been too small. The L2 topology consisted of Catalyst 2900XL switches (this was a while ago). Was it "too big?"

No. It worked perfectly.

There were only about 100 nodes on this network, and no server virtualization. Each node had around 100 IP addresses configured on its single NIC.

The scheme here was that each server had the potential to run each of 100 different services. I used the third octet to identify the service, and the fourth octet to identify the server. So service 25 on server 27 could be found at 10.0.25.27 (for example).

Broadcast frames in this environment were limited to the occasional ARP query.

Host count is irrelevant.
I expect to make a less convincing case here, but my point is that when talking about a virtualized data center, I don't believe we should care about how many (virtual) hosts (IP stacks?) share a VLAN.

The previous example used lots of IP addresses, but only had a small number of hosts present. Now I want to flip the proportions around. Let's imagine instead that we have 10,000 virtual machines in a VLAN, but were somehow able to virtualize them into a single pair of impossibly-large servers.

Is there any problem now? Our STP topology is ridiculously small at just two ports. If the ESX hosts and vSwitches are able to handle the traffic pushed into them, why would we be inclined to say that this broadcast domain is "too big?"

Broadcast domain sizing overlooks the impact on shared resources.
Almost all discussions of broadcast domain sizing overlook the fact that we need to consider what one VLAN will do to another when they share common resources.

Obvious points of contention are 802.1Q trunks which share bandwidth between all VLANs. Less obvious points of contention are shared ASICs and buffers within a switch. If you've ever noticed how a 100Mb/s server can hurt seven of his neighbors on a shared-ASIC switching architecture, you know what I'm getting at.

Splitting clients into different VLANs doesn't help if the switching capacity isn't there to back it up, but discussions of subnet sizing usually overlook this detail.

How are the edge ports configured?
Lets imagine that we've got a VMware vSwitch with 8 pNICs connected to 8 switch ports.

If those 8 switch ports are configured with a static aggregation (on the switch end) and IP hash balancing (on the VMware end), then we've got a single 8Gb/s port from spanning tree's perspective. If the environment has background "noise" consisting of 100Mb/s of garbage broadcast traffic, then the ESX host gets 100Mb/s of garbage representing 1.25% of its incoming bandwidth capacity. Not great, but not the end of the world.

If those 8 ports are configured with the default "host pinning" mechanism, then the switches have eight 1Gb/s ports from spanning tree's perspective. The 100Mb/s of garbage is multiplied eight times. The server gets 800Mb/s of garbage representing 10% of its incoming bandwidth capacity. Yuck.

This distinction is important, and completely overlooked by most discussions of broadcast domain size.

So, what should we be looking at?
Sheesh, you want answers? All I said was "we're doing it wrong", not "I have answers!"

First, I think we should be looking at the number of spanning tree edge ports. This metric represents both the physical size of the broadcast domain and the impact on our ESX hosts.

Second, I think we should be talking about density of VLANs on trunks. Where possible, it might be worth splitting up an ESX domain so that only certain VLANs are available on certain servers. If the environment consists of LOTS of very sparsely populated VLANs, then an automagic VLAN pruning scheme might be worth deploying.

Third, I think we need to look at what the servers are doing. Lots of broadcast traffic? Maybe we should have disabled netbuei? Maybe we shouldn't mingle the trading floor support system with the general population?

I don't have a clear strategy about how to handle discussions about broadcast domain sizing, but I'm absolutely convinced that discussions of host count and subnet size miss the point. There's risk here, but slicing the virtual server population into lots of little VLANs obviously doesn't fix anything that a network engineer cares about.

Do you have an idea about how to better measure these problems so that we can have more useful discussions? Please share!

Thursday, November 24, 2011

My job puts me in contact with lots of first-rate network gear. I like much of that shiny stuff, but there are some relics for which I continue to have a soft spot.

One of those relics remains relevant today, but few seem to know about it.

It's the Xyplex MaxServer line of terminal servers. When deployed for out-of-band console access, these things are capable of doing essentially the same job as a Cisco router with asynchronous serial ports, but at a tiny fraction of the price, and without the annoying octopus cable that's required for most Cisco async ports.

MX-1640 on top, MX-1620 below

MX-1640 on top, MX-1620 below

Price

The main thing that's so great about these guys is the price. I bought that 40-port unit for US $26 including delivery to my door. That's only $0.65 per serial port.

They're cheap enough that I've got a flock of these things at home. Everything with a serial port in my house is wired up. I can take them to customer sites for labbing purposes and don't care if I get them back.

For comparison, the cheapest NM-16A (16 port asynchronous serial module) on ebay right now is $184. Figure about $20 each for a couple of octal cables, plus $30 for a 2600 series router to run it all, and the cheapest Cisco alternative runs over $250, nearly $16 per serial port.

I know that people are buying Cisco 2511s and NM-16As for their home labs, and I'm guessing that ignorance of cheaper options is the reason the Cisco stuff bids to such ridiculous levels on ebay.

In an effort to spread the love for these relics, I'm sharing what I know about them.

Identification

The same basic product is available as both Xyplex MaxServer and MRV MaxServer. I've seen an MRV unit once, but never configured one. It looks identical, but may behave differently.

There is an older model MaxServer 1600 that has 8 or 16 ports, a rounded face with integrated rack ears, and no onboard 10Base-T transceiver. You'll need an AUI transceiver for 10Base-T to use one of these older units. I haven't used one since 1999 or 2000, so I don't remember if they're completely the same, and can't recommend them.

The 1620 and 1640 units have a label on the bottom indicating that their "Model" is MX-1620-xxx or MX-1640-xxx. I'm not sure what the xxx suffix means, but I've got units with suffixes 002, 004 and 014, and I can't tell any difference between them.

Flash Cards

Sometimes these boxes come with a hard-to-find PCMCIA card. If you have the card, then these guys can run totally independently. Without the card, a TFTP server is required for boot and configuration. The TFTP server can be anywhere on the network. It doesn't need L2 adjacency to the MaxServer. If one unit has the card, it can serve other units without cards.

The card has to be some ancient small capacity "linear" flash card, and not just any linear flash card will do. I've tried lots of cards from different vendors, and only found a single 3rd party card that's recognized by the Xyplex (pictured below). They run fine without the cards, so don't sweat looking for the card.

Rack Mounting
OEM rack ears are tough to find.

But the screws and ears from a Cisco 2600 work fine with the addition of a couple holes to the MaxServer chassis:

Cisco 2600, MaxServer with Cisco ear installed, unmolested MaxServer

Use a 1/16" bit and chuck it deep in the drill so that it won't reach more than about 1/4" into the chassis when it punches through.

Software
You'll need a file called xpcsrv20.sys. The md5sum of the file I use is:

b7252070000988ce47c62f1d46acbe11

Drop this file on the root of your TFTP server. While you're there, if your TFTP server doesn't allow remote file creation (most don't), create a couple of empty files for your server to use for its configuration. The format is:

x<last-6-of-MAC>.prm and x<last-6-of-MAC >.bck

So, I've got files named x0eb6b1.prm and x0eb6b1.bck that are used by my MaxServer with hardware address 0800.870e.b6b1. The address is printed on a label near the Ethernet port.

Serial Pinout
Connecting these guys to a Cisco console requires a rollover cable. I like to make my own using 2-pair cable for reasons I've previously explained.

Configuration
Here's how I go about configuring one of these guys for:

Static IP addressing

TFTP download of the system software

TFTP download and storage of configuration elements

Telnet access to the serial ports

Telnet access for admin functions

The first step is to reset the box to factory defaults.

Connect a terminal to the first serial port using the same gear you'd use on a Cisco router's console port. 9600,8,N,1

With the system powered on, use a paperclip to manipulate the button behind the hole near the console LED on the front panel.

Press the button

Release the button

Press the button

The numbered LEDs should scan back and forth, then stop with two LEDs lit.

Release the button.

Hit Enter on the terminal a couple of times. You should see something like:

Type access and hit Enter. The letters won't echo on screen. You'll get a configuration menu. The following text is the steps I follow for initial configuration purge. Bold text is stuff that I've typed.

Still at the Modify Unit Configuration Menu, configure the first initialization record. This information is what's used by the pre-boot environment. We have to configure the IP information twice: the first instance represents what's used by the bootloader (or whatever), and the second instance is the IP information used by the running system. I'm putting the terminal server at 192.168.15.11/24, and its TFTP server is at 10.122.218.33

At this point, the system should grab xpcsrv20.sys from the TFTP server. Give this a minute to complete. Next you'll get a prompt like this:

Welcome to the Xyplex Terminal Server.Enter username>

With the way I use these boxes, it doesn't matter what username you type (at this prompt and at later prompts). Just type something in. I suppose it would matter if you configured RADIUS authentication, but I've never done that. The default password is system.

Enter username> foo
Xyplex -901- Default Parameters being used

Xyplex> set priv
Password> system (doesn't echo)

Welcome to the Xyplex Terminal Server.
For information on software upgrades contact your local representative,
or call Xyplex directly at

The prompt with >> indicates that we're in privileged user mode. Configuration parameters are entered with either set or define commands. set commands take effect immediately but are not persistent across reboots. define commands don't take effect until after reboot. Issuing set server change enabled causes define commands to take effect immediately, in addition to persisting across reboots, so I start with that one.

Once things are configured you access these ports in pretty much the same way that you would with a Cisco box configured for "reverse telnet". Each serial port is listening on a TCP port: (portnum * 100) + 2000

So, port 1 is listening on TCP/2100, port 2 on TCP/2200, etc...

Useful commands

The only reason I log into the Xyplex directly is to kill off a stuck telnet session. Log in and access a privileged session like this:

Monday, November 21, 2011

I noticed this morning that a new X2 module appeared in the Cisco docs last week:

Cisco X2-10GB-T

The Cisco 10GBASE-T Module supports link lengths of up to 100m on CAT6A or CAT7 copper cable.

Unfortunately, while the compatibility guide lists this new transciever, it doesn't indicate that it's supported by any current switches.

As far as I'm aware, this module is the 4th 10GBASE-T device in Cisco's lineup, having been preceeded by:

6716/6816-10T card for the 6500 (these are the same card with different DFCs)

4908-10G-RJ45 half-card for the 4900M

2232TM Fabric Extender

This new module isn't showing up in the pricing tool yet.

I'm not too excited about 10GBASE-T connections (and Greg Ferro hates them!), but I've seen a couple of use cases where having just a single port or two on a stackable switch would have been helpful.

It'll be interesting to see how these get deployed. A 10GBASE-T SFP+ module would be much more helpful these days, but I suspect that power, packaging or both are standing in the way of a 10gigabit version of the GLC-T.

Update 11/30/2011 -- I missed a 10GBASE-T offering (or maybe it just appeared). There's also the C3KX-NM-10GT, a dual-port 10GBASE-T network module for the 3560-X and 3750-X switches.

Thursday, November 17, 2011

My first couple of blog posts were about building 1Gb/s top of rack switching using the Nexus 5000 product line. This is a new series comparing some options for 10Gb/s top of rack switching with Nexus 5500 switches.

These posts assume a brand-new deployment and attempt to cover all of the bits required to implement the solution. I'm including everything from the core layer's SFP+ downlink modules through the SFP+ ports for server access. I'm assuming that the core consists of SFP+ interfaces capable of supporting twinax cables, and that each server will provide it's own twinax cable of the appropriate length.

Each scenario supports racks housing sixteen 2U servers. Each server has a pair of SFP+ based 10Gb/s NICs and a single 100Mb/s iLO interface.

Option 4 - Top of Rack Nexus 2232 + Catalyst 2960

This is a 3-rack pod consisting of two Nexus 5548 switches in a central location near the core, and Nexus 2232 and Catalyst 2960 deployed at the top of rack.

Option 4

Each Nexus 2232 uplinks to a single Nexus 5548, and the Catalyst 2960s are uplinked via vPC to the fabric extenders. Connecting the 2960 to the FEX requires some special consideration. In this case, there are two reasonably safe ways to do it:

Configure BPDUfilter on the uplink EtherChannel and BPDUguard on all the field ports. Put each 2960 into its own VLAN. Any cable mistakenly linked between racks should be killed off by BPDUguard immediately, and the 2960s can't form a loop because they're in different VLANs anyway.

Configure the lanbase-routing feature on the 2960 and run each 2960 uplink as a /31 routed link. BPDUfilter will still be required on the 2960 (routed interfaces aren't supported by lanbase-routing so you have to use SVI+VLAN), but a loop cannot form because each 2960 (uplink and downlink) is in a different VLAN. Might not be possible with this specific model of 2960. Upgrading the 2960 to a WS-C3560V2-24TS-S ($3K for limited L3 features).

Even if a loop does form, the 5K and 7K should be able to move the 2960's 6.5Mpps maximum capacity with no problem :-) If we're not comfortable with all of this, the 2960s can uplink to the 5548s for an additional $3600 (list) and twelve strands of fiber.

48 servers have 960Gb/s of access bandwidth with 2:1 oversubscription inside the pod. The pod's uplink is oversubscribed by 6:1, same as Option 1 and option 3.

Because the Nexus 5548 are installed in a central location (not in the server row), management connections (Ethernet and serial) do not require any special consideration. Only multimode fiber needs to be installed into the server row.

The 2960 console connections are less critical. My rule on this is: If the clients I'm supporting can't be bothered to provision more than a single NIC, then they can't be all that important. Redundancy and supportability considerations at the network layer may be compromised.

The advantages of this configuration include:

Plenty of capacity for adding servers, because each 10Gb/s FEX is only half-full (oversubscription would obviously increase).

Use of inexpensive twinax connections for 7K<-->5K links. There are more boxes here, but the overall price is lower becasue of this change.

Centralized switch management - serial and Ethernet management links are all in one place.

This model translates directly to 10GBASE-T switching. When servers begin shipping with onboard 10GBASE-T interfaces, we switch from Nexus 2232PP to 2232TM, and the architecture continues to work. This isn't possible with top-of-rack Nexus 5500s right now.

Basically, it's the same topology as option 3, but I've swapped out the 2224s in favor of 2960s to save a few bucks. Spending $25,000 for 100Mb/s iLO ports drove me a little crazy.