Troubleshooting the Switch Fabric Module

Troubleshooting the Switch Fabric Module

Table of Contents:

Introduction

Determining the Last Power-Down Reason

Troubleshooting Symptoms

Information to Collect if You Open a TAC Case

1.

Introduction

This document is primarily for troubleshooting the Switch Fabric Module (SFM) on an E-Series system, but it can also be applied to C-Series SFMs. In the E-Series, the SFM is a discrete component, called a field replaceable unit (FRU). In the C-Series, the switch fabric is integrated into the RPM. Nevertheless, FTOS commands for managing the SFM, including all those described in this document, except where noted, are useful on the C-Series. In rare cases, an SFM fails to initialize at bootup or after an upgrade, or it may power down unexpectedly during operation. This document addresses those cases.

remote-power-off – Reported most often since the SFM is powered off and on when the system reboots, both prior to rebooting and again at system initialization. A "remote-power-off" reason also is reported when the reset sfm slot number command is issued, as this command actually power-cycles the SFM.Note: This command is only available in FTOS 6.5.4.0 and later, and on the E-Series.

card-removed - If you remove and then reinsert an SFM, the show trace output will report card-removed as the last power-cycle reason. This status is not reported when the software detects an inability to read certain information over an internal bus and interprets this state as the SFM being removed.

spurious reset

In addition, if you remotely reset the standby card from the CLI, the trace will display a reason of "remote reset".

3.

Troubleshooting Symptoms

The FTOS Chassis Manager (CHMGR) process monitors the health and status of the SFM. When the process detects a problem with the SFM, RPM0 reports a minor alarm and resets the card in an attempt to restore the SFM. The TSM process reports that an SFM has been found, and the minor alarm condition is cleared. When the RPM reports "No working standby SFM", the switch is running without the standby SFM. One reason may be that an SFM in a particular slot is not yet online after reset. Once this SFM comes online, then the minor alarm is cleared, the chassis manager detects the new SFM and, depending on the chassis and the number of SFMs, the "Found X SFMs" message is displayed. In general, to troubleshoot a problem with the SFM, start by capturing the following output:

If an SFM flaps or cycles through the minor alarm condition, the system may not be getting sufficient power. Under this condition, the system brings down the SFM first. Each SFM is configured with a voltage threshold, and, based on that value, the corresponding SFM will go down first. This process of SFM flapping occurs until the voltage to the system is stabilized. To determine whether there is sufficient power, physically verify if any Valere power rectifiers are experiencing a brick failure. See also the separate document, Troubleshooting Low Power Conditions.The following sections explain how to troubleshoot specific errors on the SFM. General Access ErrorsThere are two types of SFM general access errors:

"m" - MDIO error

"I" - I2C access error

These access errors typically point to a hardware issue.

To determine whether your SFM is experiencing a general access error, look for a relevant syslog message, such as "SFM 3 found general access error."

A major alarm is reported under several conditions. One such condition is exceeding the SFM safe operating temperature, as detected by environmental-monitoring hardware and software.The showenvironment command may capture the high temperature condition in addition to the error messages:

When this condition is occurring, either the SFM genuinely is too hot, or a sensor has malfunctioned. If directly adjacent SFMs are normal temperature, suspect a faulty sensor.If directly adjacent SFMs are not normal temperature, suspect a genuine overheating condition.

When the system detects a genuine over-temperature condition, it powers off the SFM until it cools down and until software determines it’s safe to re-power. Upon re-power, the SFM reset reason will be reported as "over-temperature" by the hardware. If software detects the over-temperature event and manually shuts down the SFM, the system will report an SFM reset reason of "remote power-off".

To view the programmed alarm thresholds levels, execute the show alarms threshold command:.

Verify that a face plate is covering all slots without a line card.Without such plates, a high-temperature condition can occur within five minutes.Spare blanks are available from Force10 Networks.

Ensure that the chassis is not placed on the floor.

Verify sufficient cooling tiles close to the chassis.

If a faulty sensor is suspected, reset the SFM remotely with the reset sfm slot number command. If the temperature really is high, then the SFM will probably not turn on and should be removed just a few inches so that the card no longer connects to the backplane and still allows proper airflow for the rest of the chassis.NOTE: This command is only available in FTOS 6.5.4.0 and later, and on the E-Series.NOTE: Exercise care when removing the SFM; if it is 85 degrees, it could be hot to the touch.

Resetting the active SFM via the reset sfm command can result in traffic disruption, and this message:

Information to Collect if You Open a TAC Case

The level of information provided to Force10 Networks’ Technical Assistance Center (TAC) determines the troubleshooting detail that TAC can provide.With limited information, TAC normally recommends reseating an SFM reported in an error message and closely monitoring the SFM. If the SFM fails again, contact TAC to request further troubleshooting assistance. Please use the Create Service Request form on the isupport page and include the following information if available:

Console captures showing the error messages

Console captures showing the troubleshooting steps taken and the boot sequence during each step

Saved messages to a syslog server, if one is used.

Output from the show trace command

Output from the show tech-support command

Quick Tips content is self-published by the Dell Support Professionals who resolve issues daily. In order to achieve a speedy publication, Quick Tips may represent only partial solutions or work-arounds that are still in development or pending further proof of successfully resolving an issue. As such Quick Tips have not been reviewed, validated or approved by Dell and should be used with appropriate caution. Dell shall not be liable for any loss, including but not limited to loss of data, loss of profit or loss of revenue, which customers may incur by following any procedure or advice set out in the Quick Tips.