There are several troubleshooting options that you can implement when you set up and configure the Netra 440 server. By setting up your system with troubleshooting in mind, you can save time and minimize disruptions if the system encounters any problems.

Updated Troubleshooting Information

Sun will continue to gather and publish information about the Netra 440 server long after the initial system documentation is shipped. You can obtain the most current server troubleshooting information in the Product Notes and at Sun web sites. These resources can help you understand and diagnose problems that you might encounter.

Web Sites

SunSolve Online

This site presents a collection of resources for Sun technical and support information. Access to some of the information on this site depends on the level of your service contract with Sun. This site includes the following:

Sun Install Check tool - A utility you can use to verify proper installation and configuration of a new Netra server. This resource checks a Netra server for valid patches, hardware, operating environment, and configuration.

Sun System Handbook - A document that contains technical information and provides access to discussion groups for most Sun hardware, including the Netra 440 server.

Support documents, security bulletins, and related links.

The SunSolve Online Web site is at:

http://sunsolve.sun.com

Big Admin

This web site is a one-stop resource for Sun system administrators. The Big Admin web site is at:

Firmware and Software Patch Management

Sun makes every attempt to ensure that each system is shipped with the latest firmware and software. However, in complex systems, bugs and problems are discovered in the field after systems leave the factory. Often, these problems are fixed with patches to the system's firmware. Keeping your system's firmware and Solaris OS current with the latest recommended and required patches can help you avoid problems that others might have already discovered and solved.

Firmware and operating system updates are often required to diagnose or fix a problem. Schedule regular updates of your system's firmware and software so that you will not have to update the firmware or software at an inconvenient time.

You can find the latest patches and updates for the Netra 440 server at the Web sites listed in Web Sites.

Sun Install Check Tool

When you install the SunSM Install Check tool, you also install Sun Explorer Data Collector. The Sun Install Check tool uses Sun Explorer Data Collector to help you confirm that Netra 440 server installation has been completed optimally. Together, they can evaluate your system for the following:

Minimum required operating system level

Presence of key critical patches

Proper system firmware levels

Unsupported hardware components

If potential issues are identified, the software generates a report that will provide specific instructions to remedy the issues.

You can download the Sun Install Check tool software and documentation at:

Sun Explorer Data Collector

The Sun Explorer Data Collector is a system data collection tool that Sun support services engineers sometimes use when troubleshooting Sun SPARC and x86 systems. In certain support situations, Sun support services engineers might ask you to install and run this tool. If you installed the Sun Install Check tool at initial installation, you also installed Sun Explorer Data Collector. If you did not install the Sun Install Check tool, you can install Sun Explorer Data Collector later without the Sun Install Check tool. By installing this tool as part of your initial system setup, you avoid having to install the tool at a later, and often inconvenient time.

Sun Remote Services Net Connect

SunSM Remote Services (SRS) Net Connect is a collection of system management services designed to help you better control your computing environment. These Web-delivered services enable you to monitor systems, to create performance and trend reports, and to receive automatic notification of system events. These services help you to act more quickly when a system event occurs and to manage potential issues before they become problems.

Configuring the System for Troubleshooting

System failures are characterized by certain symptoms. Each symptom can be traced to one or more problems or causes by using specific troubleshooting tools and techniques. This section describes troubleshooting tools and techniques that you can control through configuration variables.

Hardware Watchdog Mechanism

The hardware watchdog mechanism is a hardware timer that is continually reset as long as the operating system is running. If the system hangs, the operating system is no longer able to reset the timer. The timer then expires and causes an automatic externally initiated reset (XIR), displaying debug information on the system console. The hardware watchdog mechanism is enabled by default. If the hardware watchdog mechanism is disabled, the Solaris OS must be configured before the hardware watchdog mechanism can be reenabled.

The configuration variable error-reset-recovery allows you to control how the hardware watchdog mechanism behaves when the timer expires. The following are the error-reset-recovery settings:

boot(default)-Resets the timer and attempts to reboot the system

sync(recommended)-Attempts to automatically generate a core dump file dump, reset the timer, and reboot the system

none(equivalent to issuing a manual XIR from the ALOM system controller) -Drops the server to theokprompt, enabling you to issue commands and debug the system

For more information about the hardware watchdog mechanism and XIR, refer to the Netra 440 Server System Administration Guide (817-3884-xx).

Automatic System Recovery Settings

The automatic system recovery (ASR) features enable the system to resume operation after experiencing certain nonfatal hardware faults or failures. When ASR is enabled, the system's firmware diagnostics automatically detect failed hardware components. An auto-configuring capability designed into the OpenBoot firmware enables the system to unconfigure failed components and to restore system operation. As long as the system is capable of operating without the failed component, the ASR features enable the system to reboot automatically, without operator intervention.

How you configure ASR settings effects not only how the system handles certain types of failures but also on how you go about troubleshooting certain problems.

Configuring your system this way ensures that diagnostic tests run automatically when most serious hardware and software errors occur. With this ASR configuration, you can save time diagnosing problems since POST and OpenBoot Diagnostics test results are already available after the system encounters an error.

For more information about how ASR works, and complete instructions for enabling ASR capability, refer to the Netra 440 Server System Administration Guide (817-3884-xx).

Remote Troubleshooting Capabilities

You can use the Advanced Lights Out Manager (ALOM) system controller to troubleshoot and diagnose the system remotely. The ALOM system controller lets you do the following:

Turn system power on and off

Control the Locator LED

Change OpenBoot configuration variables

View system environmental status information

View system event logs

In addition, you can use the ALOM system controller to access the system console, provided it has not been redirected. System console access enables you to do the following:

For more information about the system console, refer to the Netra 440 Server System Administration Guide.

System Console Logging

Console logging is the ability to collect and log system console output. Console logging captures console messages so that system failure data, like Fatal Reset error details and POST output, can be recorded and analyzed.

Console logging is especially valuable when troubleshooting Fatal Reset errors and RED State Exceptions. In these conditions, the Solaris OS terminates abruptly, and although it sends messages to the system console, the operating sysem software does not log any messages in traditional file system locations like the /var/adm/messages file. The following is an excerpt from the /var/adm/messages file.

The error logging daemon, syslogd, automatically records various system warnings and errors in message files. By default, many of these system messages are displayed on the system console and are stored in the /var/adm/messages file. You can direct where these messages are stored or have them sent to a remote system by setting up system message logging. For more information, refer to "How to Customize System Message Logging" in the System Administration Guide: Advanced Administration, which is part of the Solaris System Administrator Collection.

In some failure situations, a large stream of data is sent to the system console. Because ALOM log messages are written into a "circular buffer" that holds 64 Kbyte of data, it is possible that the output identifying the original failing component can be overwritten. Therefore, you may want to explore further system console logging options, such as SRS Net Connect or third-party vendor solutions. For more information about SRS Net Connect, see Sun Remote Services Net Connect.

Certain third-party vendors offer data logging terminal servers and centralized system console management solutions that monitor and log output from many systems. Depending on the number of systems you are administering, these might offer solutions for logging system console information.

For more information about the system console, refer to the Netra 440 Server System Administration Guide.

The Core Dump Process

In some failure situations, a Sun engineer might need to analyze a system core dump file to determine the root cause of a system failure. Although the core dump process is enabled by default, you should configure your system so that the core dump file is saved in a location with adequate space. You might also want to change the default core dump directory to another locally mounted location so that you can better manage any system core dumps. In certain testing and preproduction environments, this is recommended since core dump files can take up a large amount of file system space.

Swap space is used to save the dump of system memory. By default, Solaris software uses the first swap device that is defined. This first swap device is known as the dump device.

During a system core dump, the system saves the content of kernel core memory to the dump device. The dump content is compressed during the dump process at a 3:1 ratio; that is, if the system were using 6 Gbyte of kernel memory, the dump file will be about 2 Gbyte. For a typical system, the dump device should be at least one-third the size of the total system memory.

See To Enable the Core Dump Process for instructions on how to calculate the amount of available swap space. You would normally enable the core dump process just prior to placing a system into the production environment.

To Enable the Core Dump Process

1. Access the system console.

Refer to the Netra 440 Server System Administration Guide.

2. Check that the core dump process is enabled.

As superuser, type the dumpadm command.

# dumpadm

Dump content: kernel pages

Dump device: /dev/dsk/c0t0d0s1 (swap)

Savecore directory: /var/crash/machinename

Savecore enabled: yes

By default, the core dump process is enabled in Solaris 8.

3. Verify that there is sufficient swap space to dump memory.

Type the swap -l command.

# swap -l

swapfile dev swaplo blocks free

/dev/dsk/c0t3d0s0 32,24 16 4097312 4062048

/dev/dsk/c0t1d0s0 32,8 16 4097312 4060576

/dev/dsk/c0t1d0s1 32,9 16 4097312 4065808

To determine how many bytes of swap space are available, multiply the number in the blocks column by 512. Taking the number of blocks from the first entry, c0t3d0s0, calculate as follows:

4097312 x 512 = 2097823744

The result is approximately 2 Gbyte.

4. Verify that there is sufficient file system space for the core dump files.

Type the df -k command.

# df -k /var/crash/`uname -n`

By default the location where savecore files are stored is:

/var/crash/`uname -n`

For instance, for the mysystem server, the default directory is:

/var/crash/mysystem

The file system specified must have space for the core dump files.

If you see messages from savecore indicating not enough space in the /var/crash/ file, any other locally mounted (not NFS) file system can be used. Following is a sample message from savecore.