The Modular Manual Browser

OLAR_intro(5) OLAR_intro(5)
NAME
OLAR_intro, olar_intro - Introduction to Online Addition and Removal (OLAR)
Management
DESCRIPTION
Introduction to Online Addition and Removal (OLAR) Management
Online addition and removal management is used to expand capacity, upgrade
components, and replace failed components, while the operating system ser-
vices and applications continues to run. This functionality, sometimes
referred to as "hot-swap", provides the benefits of increased system uptime
and availability during both scheduled and unscheduled maintenance. Start-
ing with Tru64 UNIX Version 5.1A, CPU OLAR is supported. Additional OLAR
capabilities are planned to be added for subsequent releases of the operat-
ing system.
OLAR management is integrated with the SysMan suite of system management
applications, which provides the ability to manage all aspects of the sys-
tem from a centralized location.
You must be a privileged user to perform OLAR management operations. Or,
you may configure privileges for selective authorized user or group access
using Division of Privileges (DOP), as described below.
Note that only one administrator at a time can initiate OLAR operations;
other administrators will be prevented from initiating OLAR operations
while another operation completes.
CPU OLAR Overview
Tru64 UNIX supports the ability to add, replace, and/or remove individual
CPU modules on supported AlphaServer systems while the operating system and
applications continue to run. Newly inserted CPUs are automatically recog-
nized by the operating system, but will not start scheduling and executing
processes until the CPU module is powered on and placed online through any
of the supported management applications as described below. Conversely,
before a CPU can be physically removed from the system, it must be placed
offline and then powered off. Processes queued for execution on a CPU that
is to be placed offline are simply migrated to run-queues of other running
(online) processors.
By default, CPUs that are placed offline will persist across reboot and
system initialization, until the CPU is explicitly placed online. This
behavior differs from the default behavior of previous versions of Tru64
UNIX, where a CPU that was placed offline would return to service automati-
cally after reboot or system restart. Note that for backward compatibility,
the psradm(8) and offline(8) commands still provide the non-persistent off-
line behavior. While the psradm(8) and offline(8) commands are still pro-
vided, they are not recommended for performing OLAR operations.
On platforms supporting this functionality, any CPU can participate in an
OLAR operation, including the primary CPU and/or I/O interrupt handling
CPUs. These roles will be delegated to other running CPUs in the event that
a currently running primary or I/O interrupt handler needs to be placed
offline or removed.
Why Perform OLAR on CPUs
OLAR of CPUs may be performed for the following reasons:
Computational Capacity Expansion
A system manager wants to provide additional computational capacity to
the system without having to bring the system down. As an example, an
AlphaServer GS320 with available CPU slots can have it's CPU capacity
expanded by adding additional CPU modules to the system while the
operating system and applications continue to run.
Maintenance Upgrade
A system manager wants to upgrade specific system components to the
latest model without having to bring the system down. As an example, a
GS160 with earlier model Alpha CPU modules can be upgraded to later
model CPUs with higher clock rates, while the operating system contin-
ues to run.
Failed Component Replacement
A system component is indicating a high incidence of correctable errors
and the system manager wants to perform a proactive replacement of the
failing component before it results in a hard failure. As an example,
the Component Indictment facility (described below) has indicated
excessive correctable errors in a CPU module and has therefore recom-
mended its replacement. Once the CPU module has been placed offline and
powered off, either through the Automatic Deallocation Facility (also
described below) or through manual intervention, the CPU module can be
replaced while the operating system continues to run.
Cautions Before Performing OLAR on CPUs
Before performing an OLAR operation, be aware of the following cautions:
+ When offlining or removing one or more CPUs, processes scheduled to
run on the affected CPUs will be scheduled to execute on other running
CPUs, thus redistributing the processing capacity among the remaining
CPUs. In general, this will result in a system performance degrada-
tion, proportional to the number of CPUs taken out of service and the
current system load, for the period of the OLAR operation. Multi-
threaded applications that are written to take advantage of known CPU
concurrencies can expect to encounter significant performance degrada-
tion during the period of the OLAR operation.
+ The OLAR management utilities do not presently operate with processor
sets. Processor sets are groups of processors that are dedicated for
use by selected processes (see processor_sets(4)). If a process has
been specifically bound to run on a processor set (see runon(1),
assign_pid_to_pset(3) ), and an OLAR operation is attempted on the
last running CPU in the processor set, you will not be notified by the
OLAR utilities that you are effectively shutting down the entire pro-
cessor set. Offlining the last CPU in a processor set will cause all
processes bound to that processor set to suspend until the processor
set has at least one running CPU. Therefore, use caution when perform-
ing CPU OLAR operations on systems that have been configured with
processor sets.
+ If a process has been specifically bound to execute on a CPU (see
runon(1), bind_to_cpu(3), and bind_to_cpu_id(3) for more information),
and an OLAR operation is attempted on that CPU, you will be notified
by the OLAR utilities that processes have been bound to the CPU prior
to any operation being performed. You may choose to continue or cancel
the OLAR operation. By choosing to continue, processes bound to a CPU
will suspend their execution until such time that the process is un-
bound, or the CPU is placed back online. Note that choosing to off-
line a CPU that has processes bound may cause detrimental consequences
to the application, depending upon the characteristics of the applica-
tion.
+ If a process has been specifically bound to execute on a Resource
Affinity Domain (RAD) (see runon(1) and rad_bind_pid(3) for more
information), and an OLAR operation is attempted on the last running
CPU in the RAD, you will be notified by the OLAR utilities that
processes have been bound to the RAD and that the last CPU in the RAD
has been requested to be placed offline. By choosing to continue,
processes bound to the RAD will suspend their execution until such
time that the process is un-bound, or at least one CPU in the RAD is
placed online. Note that choosing to offline the last CPU in a RAD
with processes bound may cause detrimental consequences to the appli-
cation, depending upon the characteristics of the application.
+ If you are using program profiling utilities such as dcpi, kprofile,
or uprofile, that are aware of the system's CPU configuration,
unpredictable results may occur when performing OLAR operations. It is
therefore recommended that these profiling utilities be disabled prior
to performing an OLAR operation. Ensure that all the processes includ-
ing any associated daemons that are related to these utilities have
been stopped before performing OLAR operations on system CPUs.
The device drivers used by these profiling utilities are usually con-
figured into the kernel dynamically, so the tools can be disabled
before each OLAR operation with the following commands:
#sysconfig-upfm#sysconfig-upcount
The appropriate driver can be re-enabled with one of the following:
#sysconfig-cpfm#sysconfig-cpcount
The automatic deallocation of CPUs, enabled through the Automatic
Deallocation Facility, should be disabled whenever the pfm or pcount
device drivers are configured into the kernel, or vice versa. Refer
to the documentation and reference pages for these utilities for addi-
tional information.
General Procedures for Online Addition and Removal of CPUs
Caution
Pay attention to the system safety notes as outlined in the
GS80/160/320 Service Manual.
+ Removing a CPU Module
To perform an online removal of a CPU module, follow these steps using
your preferred management application, described in the section "Tools
for Managing OLAR".
1. Off-line the CPU. The operating system will stop scheduling and
executing tasks on this CPU. Using your preferred OLAR management
application, make note of the quad building block (QBB) number
where this CPU is inserted. This is the "hard" (or physical) QBB
number, and does not change if the system is partitioned.
2. Power the CPU module off. The LED on the CPU module will
illuminate yellow, indicating that the CPU module is un-powered,
and safe to be removed.
3. Physically remove the CPU module. Note that the operating system
automatically recognizes that the CPU module has been physically
removed. There is no need to perform a scan operation to update
the hardware configuration.
+ Adding a CPU module
To perform an online addition of a CPU module, follow these steps
using your preferred management application, described in the section
"Tools for Managing OLAR".
1. Select an available CPU slot in one of the configured quad build-
ing blocks (QBB). If there are available slots in several QBBs,
it is typically best to equally distribute the number of CPUs
among the configured QBBs.
2. Insert the CPU module into the CPU slot. Ensure that you align
the color-coded decal on the CPU module with the color-code decal
on the CPU slot. The LED on the CPU module will illuminate yel-
low, indicating that the CPU module is un-powered. Note that the
CPU will be automatically recognized by the operating system,
even though it is un-powered. There is no need to perform a scan
operation for the operating system to identify the CPU module.
3. Power the CPU module on. The CPU module will undergo a short
self-test (7-10 secs), after which the LED will illuminate green,
indicating the module is powered-on and has passed its self-test.
4. On-line the CPU. Once the CPU is on-line, the operating system
will automatically begin to schedule and execute tasks on this
CPU.
Tools for Managing OLAR
When it is necessary to perform an OLAR operation, use the following tools
which are provided as part of the SysMan suite of system management
utilities.
Manage CPUs
"Manage CPUs" is a task-oriented application that provides the following
functions:
+ Change the state of a CPU to online or offline
+ Power on or power off a CPU
+ Determine the status of each inserted CPU
The "Manage CPUs" application can be run equivalently from an X Windows
display, a terminal with curses capability, or locally on a PC (as
described below), thus providing a great deal of flexibility when perform-
ing OLAR operations.
Note
You must be a privileged user to run the "Manage CPUs" application.
Non-root users may also run the "Manage CPUs" application if they are
assigned the "HardwareManagement" privilege. To assign a user the
"HardwareManagement" privilege, issue the following command to launch
the "Configure DOP" application:
#sysmandopconfig[-display <hostname>]
Please refer to the dop(8) reference page and the on-line help in the
'dopconfig' application for further information. Additionally, the
Manage CPUs application provides online help capabilities that
describe the operation of this application.
The "Manage CPUs" application can be invoked using one of the following
methods:
+ SysMan Menu
1. At the command prompt in a terminal window, enter the following
command:
[Note that the "DISPLAY" shell environment variable must be set,
or the "-display" command line option must be used, in order to
launch the X Windows version of SysMan Menu. If there is no
indication of which graphics display to use, or if invoking from
a character cell terminal, then the curses version of SysMan Menu
will be launched.]
#sysman[-display <hostname>]
2. Highlight the "Hardware" entry and press "Select"
3. Highlight the "Manage CPUs" entry and press "Select"
+ SysMan command line accelerator
To launch the Manage CPUs application directly via the command prompt
in a terminal window, enter the following command:
#sysmanhw_manage_cpus[-displayhostname]
[Note that the "DISPLAY" shell environment variable must be set, or
the "-display" command line option must be used, in order to launch
the X Windows version of Manage CPUs. If there is no indication of
which graphics display to use, or if invoking from a character cell
terminal, then the curses version of Manage CPUs will be launched.]
+ System Management Station
To launch the Manage CPUs application from the System Management Sta-
tion, do the following:
1. At the command prompt in a terminal window from a system that
supports graphical display, enter the following command:
#sysman-station [-displayhostname]
When the System Management Station launches, two separate windows
will appear. One window is the Status Monitor view, and the other
window is the Hardware view, providing a graphical depiction of
the hardware connected to your system.
2. Select the Hardware view window.
3. Select the CPU for an OLAR operation by left-clicking once with
the mouse.
4. Select Tools from the menu bar, or right-click once with the
mouse. A list of menu options will appear.
5. Select Daily Administration from the list.
6. Select the Manage CPUs application.
+ Manage CPUs from a PC or Web Browser
You can also perform OLAR management from your PC desktop or from
within a web browser. Specifically, you can run Manage CPUs via the
System Management Station client installed on your desktop, or by
launching the System Management Station client from within a browser
pointed to the Tru64 UNIX System Management home page. For a detailed
description of options and requirements, visit the Tru64 UNIX System
Management home page, available from any Tru64 UNIX system running
V5.1A (or higher), at the following URL:
http://hostname:2301/SysMan_Home_Page
where "hostname" is the name of a Tru64 UNIX Version 5.1A, (or higher)
system.
hwmgr Command Line Interface (CLI)
In addition to its set of generic hardware management capabilities, the
hwmgr(8) command line interface incorporates the same level of OLAR manage-
ment functionality as the Manage CPUs application. You must be root to run
the hwmgr command; this command does not currently operate with DOP.
The following describes the OLAR specific commands supported by hwmgr. To
obtain general help on the use of hwmgr, issue the command:
#hwmgr-help
To obtain help on a specific option, issue the command:
#hwmgr-help"option"
where option is the name of the option you want help on.
1. To obtain the status and state information of all hardware components
the operating system is aware of, issue the following command:
#hwmgr-statuscompSTATUSACCESSHEALTHINDICTHWID:HOSTNAMESUMMARYSTATESTATELEVELNAME-------------------------------------------------------------3:wild-oneonlineavailabledmapi49:wild-oneonlineavailabledsk250:wild-oneonlineavailabledsk351:wild-oneonlineavailabledsk452:wild-oneonlineavailabledsk556:wild-oneonlineavailableCompaqAlphaServerGS1606/73157:wild-oneonlineavailableCPU058:wild-oneonlineavailableCPU259:wild-oneonlineavailableCPU460:wild-oneonlineavailableCPU6
or, to obtain status on an individual component, use the hardware id
(HWID) of the component and issue the command:
#hwmgr-statuscomp-id58STATUSACCESSHEALTHINDICTHWID:HOSTNAMESUMMARYSTATESTATELEVELNAME-------------------------------------------------------------58:wild-oneonlineavailableCPU2
To see the complete list of options for "-status", issue the command:
#hwmgr-helpstatus2. To view a hierarchical listing of all hardware components the operat-
ing system is aware of, issue the command:
#hwmgr-viewhierHWID:hardwarehierarchy(!)warning(X)critical(-)inactive(see-status)-------------------------------------------------------------------------1:platformCompaqAlphaServerGS1606/7319:buswfqbb010:connectionwfqbb0slot011:buswfiop012:connectionwfiop0slot013:buspci014:connectionpci0slot1ooo57:cpuqbb-0CPU058:cpuqbb-0CPU2
This example shows that CPU0 and CPU2 are children of bus name
"wfqbb0", and that their physical location is (hard) qbb-0. Note that
hard QBB numbers do not change as the system partitioning changes.
To quickly identify which QBB a CPU is associated with, issue the com-
mand:
#hwmgr-viewhier-id58HWID:hardwarehierarchy-----------------------------------------------------58:cpuCPU0qbb-03. To offline a CPU that is currently in the online state, issue the com-
mand
#hwmgr-offline-id58
or
#hwmgr-offline-nameCPU2
Note that device names are case sensitive. In this example, CPU2 must
be upper case. To verify the new status of CPU2, issue the command:
#hwmgr-statuscomp-id58STATUSACCESSHEALTHINDICTHWID:HOSTNAMESUMMARYSTATESTATELEVELNAME--------------------------------------------------------------58:wild-onecriticalofflineavailableCPU2
Note that the offline state will be saved across future reboots of the
operating system, including power cycling the system. If you want the
component to return to the online state the next time the operating
system is booted, use the "-nosave" switch.
#hwmgr-offline-nosave-id58
or
#hwmgr-offline-nosave-nameCPU2
Once again, to verify the status of CPU2, issue the command:
#hwmgr-statuscomp-id58STATUSACCESSHEALTHINDICTHWID:HOSTNAMESUMMARYSTATESTATELEVELNAME----------------------------------------------------------------------58:wild-onecriticaloffline(nosave)availableCPU24. To power off a CPU that is currently in the offline state, issue the
command:
#hwmgr-poweroff-id58
or
#hwmgr-poweroff-nameCPU2
Note that a component must be in the offline state before power can be
removed using hwmgr. Once power has been removed from a component, is
it safe to remove that component from the system.
5. To power on a CPU that is currently powered off, issue the command:
#hwmgr-poweron-id58
or
#hwmgr-poweron-nameCPU26. To place a CPU online so that the operating system can start schedul-
ing processes to run on that CPU, issue the command:
#hwmgr-online-id58
or
#hwmgr-online-nameCPU2
Refer to the hwmgr(8) reference page for additional information on the use
of hwmgr.
Component Indictment Overview
Component indictment is a proactive notification from a fault analysis
utility, indicating that a component is experiencing high incidence of
correctable errors, and therefore should be serviced and/or replaced. Com-
ponent indictment involves the process of analyzing specific failure pat-
terns from error log entries, either immediately or over a given time
interval, and recommending a component's removal. The fault analysis util-
ity signals the running operating system that a given component is suspect,
causing the operating system to distribute this information via an EVM
indictment event such that interested applications, including the System
Management Station, Insight Manager, and the Automatic Deallocation Facil-
ity can update their state information, as well as take appropriate action
if so configured (see the discussion on Automatic Deallocation Facility
below).
It is possible for more than one component to be indicted simultaneously if
the exact source of error cannot be pinpointed. In these cases, the most
likely suspect will be indicted with a `high` probability. The next likely
suspect will be indicted with a `medium` probability, and the least likely
suspect will be indicted with a `low` probability. When this situation
arises, the indictment events can be tied together by examining the
"report_handle" variable within the indictment events. Indictment events
for the same error will contain the same "report_handle" value.
The indicted state of a component will persist across reboot and system
initialization if no action is taken to remedy the suspect component, such
as an online repair operation. Once an indictment has occurred for a given
component, another indictment event will not be generated for that com-
ponent unless the utility has determined, through additional analysis, that
the original indictment probably should be updated. In this case, the com-
ponent will be re-indicted with the new probability. Once the indicted com-
ponent has been serviced, it is necessary to manually clear the indicted
component state with the following hwmgr command:
#hwmgr-unindict-id &lt&lt;hwid&gt&gt;
where <id> is the hardware id (HWID) of the component
Allowing the operator to manually clear the indicted problem state, ensures
positive identification of when a replaced component is operating properly.
All component indictment EVM events have an event prefix of
sys.unix.hw.state_change.indicted. You may view the complete list of all
possible component indictment events that may be posted, including a
description of each event, by issuing the command:
#evmwatch-i-f'[namesys.unix.hw.state_change.indicted]'|evmshow-t"@name"-x|more
You may view the list of indictment events that have occurred by issuing
the command:
#evmget-f'[namesys.unix.hw.state_change.indicted]'|evmshow-t"@name"
CPU modules and memory pages are currently supported for component indict-
ment.
Compaq Analyze, included as part of the Web-Based Enterprise Services
(WEBES) 4.0 product (or higher), is the fault analysis utility that sup-
ports component indictment on a Tru64 UNIX (V5.1A or higher) system. The
WEBES product is included as part of the Tru64 UNIX operating system dis-
tribution, and must be installed after installation of the base operating
system. Please refer to the Compaq Analyze documentation, distributed with
the WEBES product, for a list of AlphaServer platforms that support the
component indictment feature.
Automatic Deallocation Facility Overview
The Automatic Deallocation Facility provides the ability to automatically
take an indicted component out of service, thus providing the automated
ability for the system to heal itself while furthering the reliability and
availability of the system. The Automatic Deallocation Facility currently
supports the ability to stop using CPUs and memory pages that have been
indicted.
The ability to tailor the behavior of the automatic deallocation facility
can be user-controlled on both single and clustered systems, through the
use of the text-based OLAR Policy Configuration files. When operating in a
clustered environment, automatic deallocation policy applies to all members
in a cluster by default. This is specified through the cluster-wide file
/etc/olar.config.common. However, individual cluster-wide policy variables
can be overridden using the member-specific configuration file
/etc/olar.config.
The OLAR Policy Configuration files contain configuration variables that
control specific behaviors of the Automatic Deallocation Facility.
Behaviors such as whether or not to enable automatic deallocation, and what
times of the day automatic deallocation should be enabled can be defined.
Additionally, the ability to specify a user-supplied script or executable
that provides the gating factor as to whether an automatic deallocation
operation should occur, can be provided as well.
Automatic deallocation is supported for those platforms that support the
component indictment feature, as described in the Component Indictment
Overview section above.
Refer to the olar.config(4) reference page for additional information about
the OLAR Policy Configuration files.
SEE ALSO
Commands: sysman(8), sysman_menu(8), sysman_station(8), hwmgr(8), codcon-fig(8), dop(8)
Files: olar.config.common(4)
SystemAdministrationConfiguringandManagingSystemsforIncreasedAvailabilityGuide