network engineering, system administration, security tools

Corrupted frames are the devils spawn. A few noisy links causing frame corruption can quickly degrade network performance, and troubleshooting them is getting harder. These integrity errors generally occur when signal noise causes a binary ‘1’ to be mistaken for a binary ‘0’ or vice-versa. This post takes a look at integrity errors and the impacts of corrupted frames in a cut-through switched network. Throughout this post I’ll use the term ‘CRC errors’ term to refer to frame integrity errors which were detected by CRC comparison.

FCS and the CRC

The Frame Check Sequence (FCS) is a 4-byte (32-bit) trailer added to the end of every ethernet frame. The originator of the frame calculates a cyclic redundancy check (CRC) code against the layer-2 frame it is sending, and sends this CRC code as the frame FCS trailer. The receiver verifies the received CRC code against received frame. If there’s a mismatch, the receiver will drop the received frame and increment the ‘CRC Errors’ counter on the receive interface.

Disambiguation

There are other error checking mechanisms in the TCP/IP stack which often get confused with each other. The IP header (just the header, not the payload!) and full TCP segment both use a 16-bit one-complement checksum to detect errors. Although checksums and CRCs fulfill a similar function they are implemented differently and the terms are not interchangable.

The 32-bit CRC used in the layer-2 FCS is a strong error detection mechanism. However It is worth calling out that no mechanism is foolproof, and that errors can also creep in within the switch itself. Check out router freak’s excellent post on error detection.

For the rest of this post I’ll refer exclusively to the layer-2 frame check sequence and the cyclic redundancy check used to implement it.

How CRC works in store and forward mode

In a traditional network which uses store and forward switching, the entire frame will be read into the switch’s buffer before a switching decision is made. In store-and-forward mode the switch can wait for the CRC code in the FCS trailer, and check it against the CRC calculated from the received frame. The switch will either discard the frame (incrementing it’s Rx CRC-error counter for that interface) or forward the frame out an egress interface.

Troubleshooting store-and-forward CRC errors

Thus, troubleshooting CRC errors in a store-and-forward world allows you to make one very important assumption. “Received frames with detected corruption do not not get propagated”.

When a corrupt frame is detected then you can deduce that the corruption was introduced either within the sender switch or on the link between the sender and receiver interfaces. So if you observe CRC-error counters incrementing on a particular link, you know where the problem lies immediately. You don’t yet know ‘what’ the problem cause is but you know approximately ‘where’ the problem is; i.e. at either end of a single physical link. Troubleshooting steps would look loosely like:

Clear counters and monitor

Look for other interfaces with CRC’s (multiple ports might indicate a board/fabric problem).

Cut-through and the corruption propagation pandemic

The switch may not be able to prevent the forwarding of corrupt frames when it operates in cut-through switching mode. Remember that cut-through switches will begin forwarding the frame out the egress interface before the full frame is received. This reduces switching latency but introduces a thorny problem if the frame was corrupted. By the time the CRC is received the frame is already on the wire. The frame transmission has to be completed but it needs an FCS trailer.

What to FCS value should the switch use?

Remember that the switch still has to calculate the checksum and append the FCS for this transmitted frame. If it calculates a new CRC value for the known-corrupt frame, the frame corruption would be masked and go undetected until it arrives at it’s destination

The compromise here is to ‘stomp’ the outbound CRC. Stomping the CRC ensures that the next-hop receiver will correctly identify this frame as having a CRC error. Being honest, I’m not sure how this stomping is actually implemented. You could set all 1s or you could re-use the received CRC as long as you knew the L2 frame hadn’t changed.

Have a look at Cisco Nexus 5000 stomp procedure. If the sending device increments the Stomp and Tx CRC error counters you know that the device knowingly propagated CRC-errored frames. That’s nice, but if you’ve got a large network with a single bad top-of-rack cable then you’d see Tx and Rx CRC’s all across your cut-through switching domain. The behavior on the next-hop cut-through switch would be the same; mark the Rx/Tx/Stomp counters, but still propagate the frame. CRC errors, spreading like the plague!

Troubleshooting cut-through CRC errors

Let’s get this straight. The dodgy links causing the CRC’s will still cause the same level of pain and upset to your customers, and application owners will still observe the impact as IP packet loss. Corrupt frames will be propagated farther than normal within your network but hopefully there aren’t enough errored frames for that to be a bandwidth concern.

No, the problem here is identifying the source of the CRCs. Your monitoring system will now detect CRCs at multiple points in the network for a single noisy-link event. The more truly-errored links in your network the harder it will be to trace them back through the network.

Actions and summary

Be aware that your CRC troubleshooting approach needs to change if you enable cut-through switching.

Monitor every port. It’s very likely that the problems will originate at the edge of your network. If you see Rx CRC at the edge of your network you’re back to single-link troubleshooting. If you have trouble monitoring all your links, check out StatSeeker which is very well suited to this job. There are some good engineer reviews by the Lone SysAdmin, LameJournal and the NetworkingNerd.

Treat CRC errors seriously and act fast. CRC’s really hurt your customers so you should already be reacting fast. However you need to know that CRC fault-location becomes much harder in a cut-through environment when there are multiple errored links. So detect and correct noisy links early, or triage and troubleshooting will get harder still.

Other war stories

Here’s an interesting cut-through war-story where the the ether-type was being mangled and thus the dot1q header was not interpreted. This lead to unicast flooding of corrupt frames on the default vlan.