This chapter describes how to dynamically reconfigure the CPU/Memory boards on the Sun Fire entry-level midrange systems system.

Dynamic Reconfiguration

Overview

DR software is part of the Solaris operating environment. With the DR software you can dynamically reconfigure system boards and safely remove them or install them into a system while the Solaris operating environment is running and with minimum disruption to user processes running on the system. You can use DR to do the following:

Minimize the interruption of system applications while installing or removing a board.

Disable a failing device by removing it before the failure can crash the operating system.

Display the operational status of boards.

Initiate system tests of a board while the system continues to run.

Command Line Interface

The Solaris cfgadm(1M) command provides the command line interface for the administration of DR functionality.

DR Concepts

Quiescence

During the unconfigure operation on a system board with permanent memory (OpenBoot PROM or kernel memory), the operating environment is briefly paused, which is known as operating environment quiescence. All operating environment and device activity on the baseplane must cease during a critical phase of the operation.

Note - Quiescence may take several minutes, depending on workload and system configuration.

Before it can achieve quiescence, the operating environment must temporarily suspend all processes, CPUs, and device activities. It may take a few minutes to achieve quiescence depending on system usage and activities currently in progress. If the operating environment cannot achieve quiescence, it displays the reasons, which may include the following:

An execution thread did not suspend.

Real-time processes are running.

A device exists that cannot be paused by the operating environment.

The conditions that cause processes to fail to suspend are generally temporary. Examine the reasons for the failure. If the operating environment encountered a transient condition--a failure to suspend a process--you can try the operation again.

RPC or TCP Time-out or Loss of Connection

Time-outs occur by default after two minutes. Administrators may need to increase this time-out value to avoid time-outs during a DR-induced operating system quiescence, which may take longer than two minutes. Quiescing a system makes the system and related network services unavailable for a period of time that can exceed two minutes. These changes affect both the client and server machines.

Suspend-Safe and Suspend-Unsafe Devices

When DR suspends the operating environment, all of the device drivers that are attached to the operating environment must also be suspended. If a driver cannot be suspended (or subsequently resumed), the DR operation fails.

A suspend-safe device does not access memory or interrupt the system while the operating environment is in quiescence. A driver is suspend-safe if it supports operating environment quiescence (suspend/resume). A suspend-safe driver also guarantees that when a suspend request is successfully completed, the device that the driver manages will not attempt to access memory, even if the device is open when the suspend request is made.

A suspend-unsafe device allows a memory access or a system interruption to occur while the operating environment is in quiescence.

Attachment Points

An attachment point is a collective term for a board and its slot. DR can display the status of the slot, the board, and the attachment point. The DR definition of a board also includes the devices connected to it, so the term `occupant' refers to the combination of board and attached devices.

A slot (also called a receptacle) has the ability to electrically isolate the occupant from the host machine. That is, the software can put a single slot into low-power mode.

Receptacles can be named according to slot numbers or can be anonymous (for example, a SCSI chain). To obtain a list of all available logical attachment points, use the -l option with the cfgadm(1M) command.

There are two formats used when referring to attachment points:

A physical attachment point describes the software driver and location of the slot. An example of a physical attachment point name is:

/devices/ssm@0,0:N0.SBx

where N0 is node 0 (zero),

SB is a system board,

x is a slot number. A slot number can be 0, 2 or 4 for a system board.

A logical attachment point is an abbreviated name created by the system to refer to the physical attachment point. Logical attachment points take the following form:

N0.SBx

Note that cfgadm will also show the I/O assembly N0.IB6, but as this is non-redundant no DR actions will be allowed on this attachment point.

DR Operations

There are four main types of DR operation.

TABLE 10-1 Types of DR Operation

Connect

The slot provides power to the board and monitors its temperature.

Configure

The operating environment assigns functional roles to a board, and loads device drivers for the board, and brings the devices on that board into use by the Solaris operating environment.

Unconfigure

The system detaches a board logically from the operating environment. Environmental monitoring continues, but devices on the board are not available for system use.

Disconnect

The system stops monitoring the board, and power to the slot is turned off.

If a system board is in use, stop its use and disconnect it from the system before you power it off. After a new or upgraded system board is inserted and powered on, connect its attachment point and configure it for use by the operating environment. The cfgadm(1M) command can connect and configure (or unconfigure and disconnect) in a single command, but if necessary, each operation (connection, configuration, unconfiguration, or disconnection) can be performed separately.

Hot-Plug Hardware

Hot-plug devices have special connectors that supply electrical power to the board or module before the data pins make contact. Boards and devices that have hot-plug connectors can be inserted or removed while the system is running. The devices have control circuits to ensure they have a common reference and power control during the insertion process. The interfaces are not powered on until the board is home and the System Controller instructs them to.

The CPU/Memory boards used in the Sun Fire entry-level midrange systems system are hot-plug devices.

Conditions and States

A state is the operational status of either a receptacle (slot) or an occupant (board). A condition is the operational status of an attachment point.

Before you attempt to perform any DR operation on a board or component from a system, you must determine state and condition. Use the cfgadm(1M) command with the -la options to display the type, state, and condition of each component and the state and condition of each board slot in the system. See the section Component Types for a list of the component types.

Board States and Conditions

This section contains descriptions of the states and conditions of CPU/Memory boards (also known as system slots).

Board Receptacle States

A board can have one of three receptacle states: empty, disconnected, or connected. Whenever you insert a board, the receptacle state changes from empty to disconnected. Whenever you remove a board the receptacle state changes from disconnected to empty.

Caution - Physically removing a board that is in the connected state, or that is powered on and in the disconnected state, crashes the operating system and can result in permanent damage to that system board.

TABLE 10-2 Board Receptacle States

Name

Description

empty

A board is not present.

disconnected

The board is disconnected from the system bus. A board can be in the disconnected state without being powered off. However, a board must be powered off and in the disconnected state before you remove it from the slot.

connected

The board is powered on and connected to the system bus. You can view the components on a board only after it is in the connected state.

Board Occupant States

A board can have one of two occupant states: configured or unconfigured. The occupant state of a disconnected board is always unconfigured.

TABLE 10-3 Board Occupant States

Name

Description

configured

At least one component on the board is configured.

unconfigured

All of the components on the board are unconfigured.

Board Conditions

A board can be in one of four conditions: unknown, ok, failed, or unusable.

TABLE 10-4 Board Conditions

Name

Description

unknown

The board has not been tested.

ok

The board is operational.

failed

The board failed testing.

unusable

The board slot is unusable.

Component States and Conditions

This section contains descriptions of the states and conditions for components.

Component Receptacle States

A component cannot be individually connected or disconnected. Thus, components can have only one state: connected.

Component Occupant States

A component can have one of two occupant states: configured or unconfigured.

TABLE 10-5 Component Occupant States

Name

Description

configured

Component is available for use by the Solaris operating environment.

unconfigured

Component is not available for use by the Solaris operating environment.

Component Conditions

A component can have one of three conditions: unknown, ok, failed.

TABLE 10-6 Component Conditions

Name

Description

unknown

Component has not been tested.

ok

Component is operational.

failed

Component failed testing.

Component Types

You can use DR to configure or to unconfigure several types of component.

TABLE 10-7 Component Types

Name

Description

cpu

Individual CPU

memory

All the memory on the board

Nonpermanent and Permanent Memory

Before you can delete a board, the environment must vacate the memory on that board. Vacating a board means flushing its nonpermanent memory to swap space and copying its permanent (that is, kernel and OpenBoot PROM memory) to another memory board. To relocate permanent memory, the operating environment on a system must be temporarily suspended, or quiesced. The length of the suspension depends on the system configuration and the running workloads. Detaching a board with permanent memory is the only time when the operating environment is suspended; therefore, you should know where permanent memory resides so that you can avoid significantly impacting the operation of the system. You can display the permanent memory by using the cfgadm(1M) command with the -v option. When permanent memory is on the board, the operating environment must find another memory component of adequate size to receive the permanent memory. If that is not possible the DR operation will fail.

Limitations

Memory Interleaving

System boards cannot be dynamically reconfigured if system memory is interleaved across multiple CPU/Memory boards.

Reconfiguring Permanent Memory

When a CPU/Memory board containing non-relocatable (permanent) memory is dynamically reconfigured out of the system, a short pause in all domain activity is required which may delay application response. Typically, this condition applies to one CPU/Memory board in the system. The memory on the board is identified by a non-zero permanent memory size in the status display produced by the
cfgadm -av command.

DR supports reconfiguration of permanent memory from one system board to another only if one of the following conditions is met:

The target system board has the same amount of memory as the source system board;

-OR-

The target system board has more memory than the source system board. In this case, the additional memory is added to the pool of available memory.

The board is assigned, but the hardware has not been configured to use it. The board may be reassigned by the chassis port or released.

Active

The board is being actively used. You cannot reassign an active board.

Displaying Basic Board Status

The cfgadm program displays information about boards and slots. Refer to the cfgadm(1) man page for options to this command.

Many operations require that you specify the system board names. To obtain these system names, type:

# cfgadm

When used without options, cfgadm displays information about all known attachment points, including board slots and SCSI buses. The following display shows a typical output.

CODE EXAMPLE 10-1 Output of the Basic cfgadm Command

# cfgadm

Ap_Id Type Receptacle Occupant Condition

N0.IB6 PCI_I/O_Boa connected configured ok

N0.SB0 CPU_Board connected configured unknown

N0.SB4 unknown empty unconfigured unknown

c0 scsi-bus connected configured unknown

c1 scsi-bus connected unconfigured unknown

c2 scsi-bus connected unconfigured unknown

c3 scsi-bus connected configured unknown

Displaying Detailed Board Status

For a more detailed status report, use the command cfgadm -av. The -a option lists attachment points and the -v option turns on expanded (verbose) descriptions.

CODE EXAMPLE 10-2 is a partial display produced by the cfgadm -av command. The output appears complicated because the lines wrap around in this display. (This status report is for the same system used in CODE EXAMPLE 10-1.) FIGURE 10-1 provides details of each display item.

Command Options

The slot provides power to the board and begins monitoring the board. The slot is assigned if it was not previously assigned.

disconnect

The system stops monitoring the board and power to the slot is turned off.

configure

The operating system assigns functional roles to a board and loads device drivers for the board and for the devices attached to the board.

unconfigure

The system detaches a board logically from the operating system and takes the associated device drivers offline. Environmental monitoring continues, but any devices on the board are not available for system use.

The options provided by the cfgadm -x command are listed in TABLE 10-10.

TABLE 10-10 cfgadm -x Command Options

cfgadm -x Option

Function

poweron

Powers on a CPU/Memory board.

poweroff

Powers off a CPU/Memory board.

The cfgadm_sbd man page provides additional information on the cfgadm -c and cfgadm -x options. The sbd library provides the functionality for hot-plugging system boards of the class sbd, through the cfgadm framework.

Testing Boards and Assemblies

To Test a CPU/Memory Board

Before you can test a CPU/Memory board, it must first be powered on and disconnected. If these conditions are not met, the board test fails.

You can use the Solaris cfgadm command to test CPU/memory boards. As superuser, type:

# cfgadm -t ap-id

To change the level of diagnostics that cfgadm runs, supply a diagnostic level for the cfgadm command as follows:

# cfgadm -o platform=diag=<level> -t ap-id

where level is a diagnostic level, and ap-id is one of the following: N0.SB0, N0.SB2 or N0.SB4.

If you do not supply level, the default diagnostic level is set to the default. The diagnostic levels are:

TABLE 10-11 Diagnostic Levels

Diagnostic Level

Description

init

Only system board initialization code is run. No testing is done. This is a very fast pass through POST.

quick

All system board components are tested with few tests and test patterns.

default

All system board components are tested with all tests and test patterns, except for memory and Ecache modules. Note that max and default are the same definition.

max

All system board components are tested with all tests and test patterns, except for memory and Ecache modules. Note that max and default are the same definition.

mem1

Runs all tests at the default level, plus more exhaustive DRAM and SRAM test algorithms. For Memory and Ecache modules, all locations are tested with multiple patterns. More extensive, time-consuming algorithms are not run at this level.

mem2

The same as mem1, with the addition of a DRAM test that does explicit compare operations of the DRAM data.

Installing or Replacing CPU/Memory Boards

Caution - Physical board replacement should only be carried out by qualified service personnel.

To Install a New Board

Caution - For complete information about physically removing and replacing CPU/Memory boards, refer to the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate. Failure to follow the stated procedures can result in damage to system boards and other components.

Note - When replacing boards, you sometimes need filler panels.

If you are unfamiliar with how to insert a board into the system, read the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate before you begin this procedure.

1. Make sure you are properly grounded with a wrist strap.

2. After locating an empty slot, remove the system board filler panel from the slot.

3. Insert the board into the slot within one minute to prevent the system overheating.

Refer to the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate for complete step-by-step board insertion procedures.

4. Power on, test, and configure the board using the cfgadm -c configure command:

# cfgadm -c configure ap_id

where ap_id is one of the following: N0.SB0, N0.SB2 or N0.SB4.

To Hot-Swap a CPU/Memory Board

Caution - For complete information about physically removing and replacing boards, refer to the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate. Failure to follow the stated procedures can result in damage to system boards and other components.

1. Make sure you are properly grounded using a wrist strap.

2. Power off the board with cfgadm.

# cfgadm -c disconnect ap_id

where ap_id is one of the following: N0.SB0, N0.SB2 or N0.SB4.

This command removes the resources from the Solaris operating environment and the OpenBoot PROM, and powers off the board.

3. Verify the state of the Power and Hotplug OK LEDs.

The green Power LED will flash briefly as the CPU/Memory board is cooling down. In order to safely remove the board from the systems the green Power LED must be off and the amber Hotplug OK LED must be on.

4. Complete the hardware removal and installation of the board.

For more information refer to the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate.

5. After removing and installing board, bring the board back to the Solaris operating environment with the Solaris dynamic reconfiguration cfgadm command.

# cfgadm -c configure ap_id

where ap_id is one of the following: N0.SB0, N0.SB2 or N0.SB4.

This command powers the board on, tests it, attaches the board, and brings all of its resources back to the Solaris operating environment.

6. Verify that the green Power LED is lit.

To Remove a CPU/Memory Board From the System

Note - Before you begin this procedure, make sure you have ready a system board filler panel to replace the system board you are going to remove. A system board filler panel is a metal board with slots that allow cooling air to circulate.

1. Detach and power off the board from the system by using the cfgadm -c disconnect command.

# cfgadm -c disconnect ap_id

where ap_id is one of the following: N0.SB0, N0.SB2 or N0.SB4.

Caution - For complete information about physically removing and replacing boards, refer to the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate. Failure to follow the stated procedures can result in damage to system boards and other components.

2. Remove the board from the system.

Refer to the Sun Fire E2900 System Service Manual or Sun Fire V1280/Netra 1280 Service Manual, as appropriate for complete step-by-step board removal procedures.

3. Insert a system board filler panel into the slot within one minute of removing the board to prevent system overheating.

To Disconnect a CPU/Memory Board Temporarily

You can use DR to power down the board and leave it in place. For example, you might want to do this if the board fails and a replacement board or a system board filler panel is not available.

Detach and power off the board using the cfgadm -c disconnect command.

Cannot Unconfigure a CPU Before All Memory is Unconfigured

All memory on a system board must be unconfigured before you try to unconfigure a CPU. If you try to unconfigure a CPU before all memory on the board is unconfigured, the system displays an error message such as:

Unable to Unconfigure Memory on a Board With Permanent Memory

To unconfigure the memory on a board that has permanent memory, move the permanent memory pages to another board that has enough available memory to hold them. Such an additional board must be available before the unconfigure operation begins.

Memory Cannot Be Reconfigured

If the unconfigure operation fails with a message such as the following, the memory on the board could not be unconfigured:

Unable to Unconfigure a CPU

CPU unconfiguration is part of the unconfiguration operation for a CPU/Memory board. If the operation fails to take the CPU offline, the following message is logged to the console:

WARNING: Processor number failed to offline.

This failure occurs if:

The CPU has processes bound to it.

The CPU is the last one in a CPU set.

The CPU is the last online CPU in the system.

Unable to Disconnect a Board

It is possible to unconfigure a board and then discover that it cannot be disconnected. The cfgadm status display lists the board as not detachable. This problem occurs when the board is supplying an essential hardware service that cannot be relocated to an alternate board.

Configure Operation Failure

CPU/Memory Board Configuration Failure

Cannot Configure Either CPU0 or CPU1 While the Other Is Configured

Before you try to configure either CPU0 or CPU1, make sure that the other CPU is unconfigured. Once both CPU0 and CPU1 are unconfigured, it is then possible to configure both of them.

CPUs on a Board Must Be Configured Before Memory

Before configuring memory, all CPUs on the system board must be configured. If you try to configure memory while one or more CPUs are unconfigured, the system displays an error message such as: