Administering Planned Outages

Maintenance on the Primary Node

Maintenance, such as hardware repair or an operating system upgrade, requires a planned outage so that the primary role can be moved to the secondary node. Plan it for a part of the business cycle that is less busy and give advance notification to users. To administer a planned outage on the primary node, perform the following steps:

From the PFSCTL command line, enter the move_primary command to move the primary role to the secondary instance:

PFSCTL> move_primary

Complete maintenance.

Restore the pack to the secondary role on the idle node.

PFSCTL> restore

Note:

The system is now resilient, but the primary and secondary roles are reversed from the initial states. If you want to restore the nodes to their initial states, then continue with the following step.

Move the primary role to the original primary node and the secondary role to the original secondary node (optional):

PFSCTL> switchover

Maintenance on the Secondary Node

Maintenance on the secondary node does not interrupt operation, but the system is not resilient while the secondary node is down. To administer a planned outage on the secondary node, perform the following steps:

Stop the secondary instance:

PFSCTL> stop_secondary

Complete maintenance.

Restore the pack on the secondary node.

PFSCTL> restore

Recovering from an Unplanned Outage on One Node

When an unplanned outage occurs on the primary node, Oracle Real Application Clusters Guard automatically fails over to the secondary node and notifies the user that a role change has occurred. At this point, Oracle Real Application Clusters Guard is operating in a nonresilient state with the primary role on the former secondary node.

After you have performed root cause analysis and repaired the source of the fault, restore the secondary role on the former primary node by using the restore command:

PFSCTL> restore

The primary and secondary roles have now been reversed. Choose one of the following actions:

Operate with Reversed Primary and Secondary Roles

After restoring both packs, you can continue to operate with primary and secondary roles that are reversed from the initial state. For sites with symmetric configurations, there is no need to return to the original state. Returning to the original roles requires a planned outage and can be avoided. In fact, some users intentionally operate with role reversal on a fixed schedule (such as every three months) in order to test the capabilities of the system.

Return to the Original Primary/Secondary Configuration

Returning to the original primary/secondary configuration requires a planned outage while the primary role is moved. Plan it for a less busy part of your business cycle and give advance notice to users. Execute it as follows:

# pfsctl
PFSCTL> switchover

Choose a Less Critical Application to Restore

If your system includes more than one uniquely identified database on each node, then performance may be degraded after a failover. For example, if you have a two-node cluster in a primary/secondary configuration and you are also running an unrelated database on the secondary node, then the secondary node runs the primary services as well as the unrelated database after failover and may be overloaded. In this situation, you should move the less critical service to the other node when it is restored.

Perform the following steps for each of the services that are moved to the restored node:

Set the ORACLE_SERVICE and DB_NAME environment variables. For example:

$ export ORACLE_SERVICE=SALES
$ export DB_NAME=sales

Restore the instance with secondary role:

# pfsctl
PFSCTL> restore

Move the primary role to the original primary node:

PFSCTL> switchover

Recovering from Unplanned Outages on Both Nodes

Figure 6-1 Failure of Both Instances, Part 1

During normal operation, both Node A and Node B are up and operational. Pack A is running on its home node, Node A, and has the primary role. It contains the primary instance and an IP address. Pack B is running on its home node, Node B, and has the secondary role. It contains the secondary instance and an IP address.

Pack A starts on Node B in foreign mode. This means that only its IP address is activated on Node B.

Now both Pack A and Pack B are running on Node B. Pack B contains the primary instance and its IP address. Pack A contains only an IP address. Nothing is running on Node A. The system is not resilient.

If the primary instance fails, then Pack A and Pack B contain only IP addresses.

Figure 6-2 Failure of Both Instances, Part 2

Pack B starts on its foreign node (Node A). Pack A is still running on Node B. Only the IP addresses are up on the nodes. Because there is no instance running, Pack B restarts on its home node and tries to restart the primary instance. If restarting the instance is unsuccessful, Pack B again starts on its foreign node. The outcome of double instance failure is:

Both packs are running on their foreign nodes.

Only the IP addresses are up.

No instances are running.

Diagnose and repair the cause of the failures. To restart the instances, you must perform the following steps:

Halt both of the packs. Enter the following command:

PFSCTL> pfshalt

You should see output similar to the following:

pfshalt command succeeded.

Start both of the packs. Enter the following command:

PFSCTL> pfsboot

You should see output similar to the following:

pfsboot command succeeded.

Administering Failover of the Applications

Oracle Real Application Clusters Guard restores service quickly. The application must restart transactions when it receives an Oracle message that indicates that failure has occurred.

Failing over the application when the primary instance fails is straightforward. The application sessions receive the ORA-1089 and ORA-1034 Oracle errors for new requests and the ORA-1041, ORA-3113, and ORA-3114 Oracle errors for active requests. These errors must be trapped by the application. At reconnection, the application connects transparently to the new primary instance. For example, in the case of a Web server, the server threads are restarted for each connection pool against the new primary instance. The current transactions are then resubmitted by the clients.

Failing over the application when the primary node fails is not straightforward because of TCP/IP time-out. TCP/IP time-out is a significant problem for high availability. It occurs when a node fails without closing the sockets, because new requests can be made to an IP address that is unavailable. For active requests, the delays to the client are the values for TCP_IP_ABORT_CINTERVAL and TCP_IP_ABORT_INTERVAL. For sessions that are waiting for read/write completion, the delay is the value for TCP_KEEPALIVE_INTERVAL. The values for these TCP/IP parameter should be tuned at each site.

Note:

These parameters are specific to your operating system. See your operating system-specific documentation for more information.

TCP/IP time-outs are addressed in Oracle Real Application Clusters Guard by using relocatable IP addresses and the call-home feature. Because Oracle Real Application Clusters Guard moves the IP addresses, active requests for an address do not wait to time out. Requests for connection are refused immediately and are routed transparently to the new primary instance. Requests that issue SQL statements receive a broken pipe error (ORA-3113), allowing the application to restart. The application should detect this error and take appropriate action.

Enhancing Application Failover with Role Change Notification

The role change notification in Oracle Real Application Clusters Guard can enhance application failover. The feature allows you to implement actions such as running or halting applications when the notification of a role change (UP, PLANNED_UP, PLANNED_DOWN, DOWN, CLEANUP) is received. For example, when the instance starts, the notification can be used to start the applications. When the instance terminates, the notification can be used to halt the applications. It is also possible to halt the application when a role starts. This allows secondary applications to halt when the primary role fails over, for example.

Automatic role change notification behaves as follows:

An UP notification occurs

After the instance (primary or secondary) starts

After an instance role changes from secondary to primary

A DOWN notification occurs before the instance (primary or secondary) is shut down

A CLEANUP notification occurs after the instance (primary or secondary) is shut down

Manual role notification occurs only when PFSCTL commands are executed, for example, during planned outages. Manual role notification behaves as follows:

A PLANNED_UP notification occurs before the instance (primary or secondary) starts

A PLANNED_DOWN notification occurs before the instance (primary or secondary) is shut down

Changing the Configuration

Most configuration changes can be made to an Oracle Real Application Clusters Guard environment by switching over to the secondary instance, applying the change, and switching back (optional). The following types of configuration changes are described in this section:

There are several ways to change Oracle Real Application Clusters Guard configuration parameters, depending on what kind of parameter needs to be changed. For example, changing $ORACLE_HOME requires the packs to be re-created, while changing the port numbers requires that the packs, the database, and the listener be halted.

Changing the Configuration of Both Instances of Oracle9i Real Application Clusters

To change initialization parameters for both instances, perform the following steps:

Note:

This applies only to initialization parameters that are not included in the mandatory parameters listed in the $ORACLE_SERVICE_config.pfs, $ORACLE_SERVICE_config.Host.ded.pfs, and init_$ORACLE_SID_Host.ora files. Changing the INSTANCE_NAMES parameter, for example, requires the catpfs.sql script to be rerun.

Modify the desired parameters for both instances.

Stop the secondary instance.

PFSCTL> stop_secondary

Restart the secondary instance.

PFSCTL> restore

Move the primary role to the secondary instance.

PFSCTL> move_primary

Restore the secondary instance on the former primary node.

PFSCTL> restore

Reverse the roles to their original locations, if desired. (Use the switchover command.)

Figure 6-3 Using the pfsboot Command During Normal Operation

Before the command is entered, no packs are running. When the pfsboot command is entered, Oracle Real Application Clusters Guard first starts Pack A on Node A, which becomes the primary node. Then Oracle Real Application Clusters Guard starts Pack B on Node B, which becomes the secondary node.

Figure 6-4 shows what happens when PFS_KEEP_PRIMARY is set to $PFS_TRUE and the second pack does not start.

Figure 6-4 Using the pfsboot Command When PFS_KEEP_PRIMARY=$PFS_TRUE and the Secondary Pack Does Not Start

When the pfsboot command is entered, Oracle Real Application Clusters Guard starts Pack A on Node A, which becomes the primary node. However, when Oracle Real Application Clusters Guard tries to start Pack B on Node B, it fails for some reason. If PFS_KEEP_PRIMARY is set to $PFS_TRUE, then Pack A remains up. The system runs without resilience while you diagnose the cause of the failure on Node B.

Figure 6-5 shows what happens when PFS_KEEP_PRIMARY is set to $PFS_FALSE and the second pack does not start.

Figure 6-5 Using the pfsboot Command When PFS_KEEP_PRIMARY=$PFS_FALSE and the Secondary Pack Does Not Start

When the pfsboot command is entered, Oracle Real Application Clusters Guard starts Pack A on Node A, which becomes the primary node. If Oracle Real Application Clusters Guard fails to start Pack B on Node B and PFS_KEEP_PRIMARY is set to $PFS_FALSE, then Oracle Real Application Clusters Guard shuts down Pack A on Node A. No packs are running.

Making Online Changes to the ORAPING_CONFIG Table

The heartbeat monitor uses a database table, ORAPING_CONFIG, to record the configuration information. The use of a table ensures that both instances of the cluster always use the same value. This table is refreshed on an interval defined by the CONFIG_INTERVAL parameter.

Table 6-1 Parameters in the ORAPING_CONFIG Table

Number of times to try to execute the heartbeat monitor cycle before declaring failure

SPECIAL_WAIT

300

Time in seconds to wait for special events to complete

RECOVERY_RAMPUP_TIME

300

Time in seconds to wait for ramp-up after failover

CYCLE_TIME

120

Time in seconds to execute heartbeat monitor and sleep cycle

CONNECT_TIMEOUT

30

Time in seconds to establish heartbeat monitor connection

CONFIG_INTERVAL

600

Time in seconds to wait before reading the ORAPING_CONFIG table

TRACE_FLAG

0

Flag to enable (1) or disable (0) SQL trace

TRACE_ITERATIONS

1

Number of heartbeat monitor cycles to trace if trace is enabled

LOGON_STORM_THRESHOLD

50

If the number of sessions logging on to the database exceeds the value of LOGON_STORM_THRESHOLD during the heartbeat monitor cycle, then Oracle Real Application Clusters Guard ignores the CONNECT_TIMEOUT parameter.

Suppose performance issues arise during initial testing of the system. Then you can run Oracle Real Application Clusters Guard with the values in the ORAPING_CONFIG table raised to a level that allows problems to persist long enough for detailed analysis. You can lower the configuration values when the system is stable.

Another reason to change the values in the ORAPING_CONFIG table is to customize them for different workloads. False failovers can occur when workloads are so large that timeouts occur simply because the system is busy.

To change the values in the ORAPING_CONFIG table, perform steps similar to the following:

Connect as the $ORACLE_USER and view the default values in the ORAPING_CONFIG table. Enter the following commands:

To find the Oracle Real Application Clusters Guard logs, change to the pfsdump directory. Enter a command similar to the following:

$ cd /mnt1/oracle/admin/sales/pfs/pfsdump

List the contents of the directory. You should see output similar to the following:

pfs_sales_host1.debug pfs_sales_host1_ping.log
pfs_sales_host1.log

Allow sufficient space for the log files. If the log files become too large, then copy them manually to a backup location. Oracle Real Application Clusters Guard automatically opens a new copy of the file that has been archived when it writes to the file again.

Recovering from a Failover While Datafiles Are in Backup Mode

When datafiles are in backup mode, they appear to instance recovery as if they are past versions. Oracle issues a message at the next startup that says media recovery is required. Media recovery is not required. Solve the problem by taking the following actions:

Stop the packs.

Mount the database.

Take each affected datafile out of backup mode.

Restart the packs.

Note:

RMAN does not encounter this problem. If you use RMAN, this procedure is not necessary.

The steps are shown in more detail as follows:

Halt the packs. Enter the following command:

PFSCTL> pfshalt

Mount one of the instances. Enter commands similar to the following:

$ sqlplus "system/manager as sysdba"
SQL> startup mount;

Identify the datafiles that are in backup mode. Enter commands similar to the following: