An internal fault is any condition that is considered to be unacceptable for normal system operation. When the system has a fault, the Fault LED () will turn on. When domains encounter hardware errors, the auto-diagnosis and auto-restoration features will detect, diagnose, and attempt to deconfigure components associated with hardware errors (see Automatic Diagnosis and Recovery Overview for details). However, further troubleshooting by the system administrator may be necessary when there are other system problems or error conditions that are not handled by the auto-diagnosis engine.

This chapter provides general guidelines for troubleshooting system problems and covers the following topics:

Platform, Domain, and System Messages

TABLE 12-1 identifies different ways to capture error messages and other system information displayed on the platform or console.

TABLE 12-1 Capturing Error Messages and Other System Information

Error Logging System

Definition

/var/adm/messages

File in the Solaris operating environment containing messages that are reported by the Solaris operating environment as determined by syslog.conf. This file does not contain any system controller or domain console messages.

Note: Messages diverted to external syslog hosts can be found in the /var/adm/messages file of the syslog host.

Platform console

Contains and displays system controller error and event messages.

Domain console

Contains and displays:

Messages written to the domain console by the Solaris operating environment

System controller error and event messages

Note: System controller messages that relate to a domain are reported to the domain console only and are not reported to the Solaris operating environment.

loghost

Used to collect system controller messages. You must set up a syslog loghost for the platform shell and for each domain shell, to capture platform and domain console output. To permanently save loghost error messages, you must set up a loghost server. For details on setting up the loghost for the platform and domains, see TABLE 3-1.

The system controller log files are necessary because they contain more information than the showlogs system controller command. Also, with the system controller log files, your service provider can obtain a persistent, stored history of the system, which can help during troubleshooting.

showlogs

System controller command that displays system controller messages for the platform and domain that are stored in a dynamic buffer. Once the buffer is filled, the old messages are overwritten.

The message buffer is cleared under these conditions:

When you reboot the system controller

When the system controller loses power

However, in systems with enhanced-memory SCs (SC V2s), certain log messages are maintained in persistent storage. These logs persist even after the system is rebooted or the system loses power. The showboards -p command enables you to view specific persistent logs.

showerrorbuffer

System controller command that displays system error information stored in the system error buffer. The output provides details about the error, such as a fault condition. You and your service provider can review this information to analyze a failure or problem. The first error entry in the buffer is retained for diagnostic purposes. However, once the buffer becomes full, subsequent error messages cannot be stored and are discarded. The error buffer must be cleared by your service provider after the error condition is resolved.

In systems with enhanced-memory SCs (SC V2s), these system error messages are retained in persistent storage. These system error messages persist even after the SC is rebooted or the SC loses power.

showfru

System controller command that displays the field-replaceable unit (FRUs) installed in a Sun Fire midrange system. Your service provider uses this information to track the FRUs in a system.

Platform and Domain Status Information From System Controller Commands

TABLE 12-2 identifies system controller commands that provide platform and domain status information that can be used for troubleshooting purposes.

Displays the assignment information and status for all the components in the system.

showenvironment

x

x

Displays the current environmental status, temperatures, currents, voltages, and fan status for the platform or the domain.

showdomain -v

x

Displays the domain configuration parameters.

showerrorbuffer

x

Shows the contents of the system errors in the system error buffer.

showfru -r manr

x

Displays the manufacturing records of FRUs installed in a Sun Fire midrange system.

showlogs -v or

showlogs -v d domainID

x

x

Displays the system controller-logged events stored in the dynamic buffer.

showlogs -p f filter

x

x

For systems with SC V2s, displays the system controller-logged messages recorded in persistent storage.

showplatform -v or
showplatform -d domainID

x

Shows the configuration parameters for the platform and specific domain information.

showresetstate -v or

showresetstate -v -f URL

x

Prints a summary report of the contents of registers from every CPU in the domain that has a valid saved state. If you specify the -f URL option with the showresetstate command, the report summary is written to a URL, which can be reviewed by your service provider.

showsc -v

x

Shows the system controller and clock failover status, ScApp and RTOS versions, and uptime.

For additional information on these commands, refer to their command descriptions in the Sun Fire Midrange System Controller Command Reference Manual.

Diagnostic and System Configuration Information From Solaris Operating Environment Commands

You can obtain diagnostic and system configuration information through the Solaris operating environment, with the following commands:

prtconf command

The prtconf command prints the system configuration information. The output includes:

Total amount of memory

Configuration of the system peripherals formatted as a device tree

This command has many options. For command syntax, options, and examples, see the prtconf(1M) man page in your Solaris operating environment release.

prtdiag command

The prtdiag command displays the following information to the domain of your Sun Fire midrange system:

Configuration

Diagnostic (any failed FRUs)

Total amount of memory

For more information on this command, see the prtdiag (1M) man page in your Solaris operating environment release.

sysdef command

The Solaris operating environment sysdef utility outputs the current system definition in tabular form. It lists:

All hardware devices

Pseudo devices

System devices

Loadable modules

Values of selected kernel tunable parameters

This command generates the output by analyzing the named bootable operating system file (namelist) and extracting configuration information from it. The default system namelist is /dev/kmem.

The Solaris operating environment utility format, which is used to format drives, can also be used to display both logical and physical device names. For command syntax, options, and examples, see the format(1M) man page in your Solaris operating environment release.

Domain Not Responding

If a domain is not responding, the domain is most likely in one of the following states:

Paused due to a hardware error

If the system controller detects a hardware error, and the reboot-on-error parameter in the setupdomain command is set to true, the domain is automatically rebooted after the auto-diagnosis engine reports and deconfigures components associated with the hardware error.

However, if the reboot-on-error parameter is set to false, the domain is paused. If the domain is paused, reset the domain by turning the domain off with the setkeyswitch off command and then turning the domain on with the setkeyswitch on command.

Hung

A domain can be hung because

The domain heartbeat stops.

The domain does not respond to interrupts.

In such cases, the system controller automatically performs an XIR and reboots the domain, provided that the hang-policy parameter of the setupdomain command is set to reset.

However, if the domain hangs and the hang-policy parameter of the setupdomain command is set to notify, the system controller reports that the domain is hung but does not automatically recover the domain. In this case, you must recover the hung domain as explained in the following procedure.

A domain is considered to be hard hung when the Solaris operating environment and OpenBoot PROM (OBP) are not responding at the domain console.

To Recover From a Hung Domain

Note - This procedure assumes that the system controller is functioning and that the hang-policy parameter of the setupdomain command is set to notify.

1. Determine the status for the domain as reported by the system controller.

Type one of the following system controller commands:

showplatform -p status (platform shell)

showdomain -p status (domain shell)

These commands provide the same type of information in the same format. If the output in the Domain Status field displays Not Responding, the system controller has determined that the domain is hung.

2. Reset the domain:

Note - A domain cannot be reset while the domain keyswitch is in the secure position.

In order for the system controller to perform this operation, you must confirm it. For a complete definition of this command, refer to the reset command in the Sun Fire Midrange System Controller Command Reference Manual.

The manner in which the domain recovery occurs is determined by the OBP.error-reset-recovery parameter settings in the setupdomain command. For details on the domain parameters, refer to the setupdomain command in the Sun Fire Midrange System Controller Command Reference Manual.

Board and Component Failures

The auto-diagnosis engine can diagnose and identify certain types of components, such as CPU/Memory boards and I/O assemblies, associated with hardware failures. However, other components, such as the System Controller boards, Repeater boards, power supplies, and fan trays are not handled by the auto-diagnosis engine.

Handling Component Failures

This section describes what to do when the following components fail:

CPU/Memory boards

I/O assemblies

Repeater boards

System Controller boards

Power supplies

Fan trays

For additional information about these components, refer to the Sun Fire 6800/4810/4800/3800 Systems Service Manual or the Sun Fire E6900/E4900 Systems Service Manual.

To Handle Failed Components

1. Capture and collect system information for troubleshooting purposes.

In a redundant SC configuration, wait for automatic SC failover to occur. After the failover, review the showlogs command output, the platform loghost, if configured, and platform messages for the working SC to obtain information on the failure condition.

If you have one SC and it fails, collect data from the platform and domain console or loghosts, and output from the showlogs and showerrorbuffer commands.

Power supply failure - If you have do not have a redundant power supply, collect troubleshooting data as described in TABLE 12-1 and TABLE 12-2.

Fan tray failure - If you have do not have a redundant fan tray, collect troubleshooting data as described in TABLE 12-1 and TABLE 12-2.

2. Contact your service provider for further assistance.

Your service provider will review the troubleshooting data that you gathered and will initiate the appropriate service action.

Recovering from a Repeater Board Failure

If a Repeater board failure occurs, you can use remaining domain resources until the failed board can be replaced. You must set the partition mode parameter (of the setupplatform command) to dual-partition mode and adjust the domain resources to use available domains, as indicated in TABLE 12-3.

TABLE 12-3 Adjusting Domain Resources When a Repeater Board Fails

Midrange Server

RP0 Failure

RP1 Failure

RP2 Failure

RP3 Failure

Use Available Domains

Sun Fire E6900 and 6800

X

C and D

X

C and D

X

A and B

X

A and B

Sun Fire E4900/4810/

4800/3800 systems

X

Not applicable

Not applicable

C

Not applicable

X

Not applicable

A

If you are running host-licensed software on a domain affected by a Repeater board failure, you can also swap the HostID/MAC address of the affected domain with that of an available domain. You can then use the hardware of the available domain to run the host-licensed software without encountering license restrictions. Use the HostID/MAC Address Swap parameter in the setupplatform command to swap the HostID/MAC address between a pair of domains. For details, see Swapping Domain HostID/MAC Addresses.