Table 4-1 describes unscheduled outages that affect the primary or secondary site components.

Table 4-1 Unscheduled Outages

Outage Type

Description

Examples

Site failure

The entire site where the current production database resides is unavailable. This includes all tiers of the application.

Disaster at the production site such as a fire, flood, or earthquake

Malicious attack on the site

Power outages. If there are multiple power grids and backup generators for critical systems, then this should affect only part of the data center.

Clusterwide failure

The whole cluster hosting the RAC database is unavailable or fails. This includes failures of nodes in the cluster, and any other components that result in the cluster being unavailable and the Oracle database and instances on the site being unavailable.

The last surviving node on the RAC cluster fails and cannot be restarted

Both of the redundant cluster interconnects fail or clusterware failure or problem

Database corruption is severe enough to disallow continuity on the current data server

Disk storage fails

Computer failure (node)

A node of the RAC cluster is unavailable or fails

A database tier node fails or has to be shut down because of bad memory or bad CPU

The database tier node is unreachable

Both of the redundant cluster interconnects fail, resulting in another node taking ownership

Computer failure (instance)

A database instance is unavailable or fails

An instance of the RAC database on the data server fails because of a software bug, an operating system problem, or a hardware problem.

Storage failure

Storage holding some or all of the database contents becomes unavailable, because it has shut down or is no longer accessible

Disk drive failure

Disk controller failure

Storage array failure

Data corruption

Parts of the database are unavailable because of media corruption, inaccessibility, or inconsistencies

A datafile is accidentally removed or is unavailable

Media corruption affects blocks of the database

Oracle block corruption is caused by operating system or other node-related problems

Human error

Parts of the database are unavailable, and transactional or logical data inconsistencies arise. Usually caused by an operator or bugs in the application code.

Localized damage (needs surgical repair) Human error results in a table being dropped or in rows being deleted from a table

Widespread damage (needs drastic action to avoid downtime) Application errors result in logical corruption in the database, or operator error results in a batch job being run more times than specified.

4.1.1.1 Managing Unscheduled Outages on the Primary Site

If the primary site contains the production database and the secondary site contains the standby database, then outages on the primary site are the most crucial. Solutions for these outages are critical for maximum availability of the system. Only the Oracle Database 10g with Data Guard, and the Oracle Database 10g with RAC and Data Guard (MAA) architectures have a secondary site to protect from site disasters.

Footnote 6 Recovery times from human errors depend primarily on detection time. If it takes seconds to detect a malicious DML or DLL transaction, then it typically only requires seconds to flashback the appropriate transactions.

4.1.1.2 Managing Unscheduled Outages on the Secondary Site

For most cases, outages on the secondary site can be managed with no effect on availability of the primary database located on the primary site. However, if the configuration is in maximum protection mode, then unscheduled outages on the last surviving standby database will cause outages on the production database to ensure no data loss when failing over to the standby database. After downgrading the data protection mode, you can restart the production database even without accessibility to the standby databases. Outages on the secondary site might affect the maximum time to recovery (MTTR) if there are concurrent failures on the primary site.

If there is only one standby database and if maximum database protection is configured, then the production database will shut down to ensure that there is no data divergence with the standby database.

There is no effect on production availability if the production database Oracle Net descriptor is configured to use connect-time failover to an available standby instance.

4.1.2.1 Managing Scheduled Outages on the Primary Site

If the primary site contains the production database and the secondary site contains the standby database, then outages on the primary site are the most crucial. Solutions for theses outages are critical for continued availability of the system.

4.1.2.2 Managing Scheduled Outages on the Secondary Site

Outages on the secondary site do not affect availability because the clients always access the primary site. Outages on the secondary site might affect the RTO if there are concurrent failures on the primary site. Outages on the secondary site can be managed with no effect on availability. If maximum protection database mode is configured, then downgrade the protection mode before scheduled outages on the standby instance or database so that there will be no downtime on the production database.

Table 4-6 describes the recovery steps for scheduled outages on the secondary site.

Downtime needed for upgrade, but there is no effect on primary node unless the configuration is in maximum protection database mode.

Downtime needed for upgrade, but there is no effect on primary node unless the configuration is in maximum protection database mode.

4.1.2.3 Preparing for Scheduled Outages on the Secondary Site

To achieve continued service during scheduled outages on a secondary site when in maximum protection mode, downgrade the maximum protection mode to maximum availability or maximum performance temporarily. When scheduling secondary site maintenance, consider that the duration of a site-wide or clusterwide outage adds to the time that the standby database lags behind the production database, which in turn lengthens the time to restore fault tolerance. See Section 2.4.2, "Data Protection Mode" for an overview of the Data Guard protection modes.

4.2Recovering from Unscheduled Outages

This section describes best practices for recovering from various types of unscheduled outages.

4.2.1 Complete Site Failover

With complete site failover, the database, the middle-tier application server, and all user connections fail over to a secondary site that is prepared to handle the production load.

4.2.1.1 When to Use Complete Site Failover

If the standby site meets the prerequisites, then complete site failover is recommended for the following scenarios:

Primary-site disaster, such as natural disasters or malicious attacks

Primary network-connectivity failures

Primary site power failures

4.2.1.2 Best Practices for Complete Site Failover

Site failover can be expedited in minutes by using the following practices:

Use Data Guard configuration best practices

Use Data Guard fast-start failover to automatically fail over to the standby database, with a recovery time objective (RTO) of less than 30 seconds

Maintain a running middle-tier application server on the secondary site to avoid the startup time

Automate the DNS failover procedure

Data loss is dependent on the Data Guard configuration and the use of synchronous or asynchronous redo shipping.

4.2.1.3 Repair Solution

A wide-area traffic manager on the primary and secondary sites provides the site failover function. The wide-area traffic manager can redirect traffic automatically if the primary site, or a specific application on the primary site, is not accessible. It can also be triggered manually to switch to the secondary site for switchovers. Traffic is directed to the secondary site only when the primary site cannot provide service due to an outage or after a switchover. If the primary site fails, then user traffic is directed to the secondary site automatically.

Figure 4-1 illustrates the possible network routes before site failover:

Client requests enter the client tier of the primary site and travel by the WAN traffic manager.

Client requests are sent through the firewall into the demilitarized zone (DMZ) to the application server tier.

Requests are forwarded through the active load balancer to the application servers.

Requests are sent through another firewall and into the database server tier.

The application requests, if required, are routed to a RAC instance.

Responses are sent back to the application and clients by a similar path.

Figure 4-2 illustrates the network routes after site failover. Client or application requests enter the secondary site at the client tier and follow exactly the same path on the secondary site that they followed on the primary site.

The following steps describe the effect on network traffic of a failover or switchover:

The administrator has failed over or switched over the production database to the secondary site. This is automatic if you are using Data Guard fast-start failover.

The administrator starts the middle-tier application servers on the secondary site, if they are not already running.

The wide-area traffic manager selection of the secondary site can be automatic in the case of an entire site failure. The wide-area traffic manager at the secondary site returns the virtual IP address of a load balancer at the secondary site and clients are directed automatically on the subsequent reconnect. In this scenario, the site failover is accomplished by an automatic domain name system (DNS) failover.

Alternatively, a DNS administrator can manually change the wide-area traffic manager selection to the secondary site for the entire site or for specific applications. The following is an example of a manual DNS failover:

Change the DNS to point to the secondary site load balancer:

The master (primary) DNS server is updated with the new zone information, and the change is announced with DNS NOTIFY.

The slave DNS servers are notified of the zone update with a DNS NOTIFY announcement, and the slave DNS servers pull the new zone information.

Note:

The master and slave servers are authoritative name servers. Therefore, they contain trusted DNS information.

Clear affected records from caching DNS servers.

A caching DNS server is used primarily for performance and fast response. The caching server obtains information from an authoritative DNS server in response to a host query and then saves (caches) the data locally. On a second or subsequent request for the same data, the caching DNS server responds with its locally stored data (the cache) until the time-to-live (TTL) value of the response expires. At this time, the server refreshes the data from the zone master. If the DNS record is changed on the primary DNS server, then the caching DNS server does not pick up the change for cached records until TTL expires. Flushing the cache forces the caching DNS server to go to an authoritative DNS server again for the updated DNS information.

Flush the cache if the DNS server being used supports such a capability. The following is the flush capability of common DNS BIND versions:

BIND 9.2.0 and 9.2.1: The entire cache can be flushed with the command rndcflush.

BIND 8 and BIND 9 up to 9.1.3: Restarting the named server clears the cache.

Refresh local DNS service caching.

Some operating systems might cache DNS information locally in the local name service cache. If so, this cache must also be cleared so that DNS updates are recognized quickly.

Solaris:nscd

Linux:/etc/init.d/nscd restart

Microsoft Windows:ipconfig /flushdns

Apple Mac OS X:lookupd -flushcache

The secondary site load balancer directs traffic to the secondary site middle-tier application server.

The secondary site is ready to take client requests.

Failover also depends on the client's Web browser. Most browser applications cache the DNS entry for a period of time. Consequently, sessions in progress during an outage might not fail over until the cache timeout expires. To resume service to such clients, close the browser and restart it.

4.2.2 Database Failover with a Standby Database

Failover is the operation of transitioning one of the standby databases to the role of production database. A failover operation can be invoked when an unplanned failure occurs on the production database, and there is no possibility of recovering the production database in a timely manner.

Oracle Data Guard Broker for information about using Enterprise Manager or the Data Guard broker command-line for database Failover

Data Guard failover is a series of steps to transition a standby database into a production database. The standby database essentially assumes the role of production. A Data Guard failover is accompanied by an application failover and, in some cases, preceded by a site failover. After the Data Guard failover, the secondary site contains the production database. The former production database must be reinstated as a new standby database to restore resiliency. The standby database can be quickly re-created by using Flashback Database. See Section 4.3.2, "Restoring a Standby Database After a Failover".

With Data Guard the process of failover can be completely automated using fast-start failover, or the failover operation can be user driven, also referred to as manual failover. Fast-start failover eliminates the uncertainty of a process that requires manual intervention and automatically executes a zero data loss failover within seconds of an outage being detected. A manual failover allows for a failover process where decisions are user driven. Manual failover can be accomplished by using Oracle Enterprise Manager, by issuing commands at the Oracle Data Guard broker command-line interface, or by issuing the SQL statements described in subsequent sections.

4.2.2.2 Best Practices for Implementing Data Guard Failover

A fast-start failover is completely automated and requires no user intervention. A manual failover, being user-driven, can be performed using Enterprise Manager, the Data Guard broker command-line interface, or SQL*Plus commands:

If you determine that a failure occurred on the primary database and there is no possibility of recovering the primary database in a timely manner, you can start the Failover operation. In configurations with both physical and logical standby databases, Oracle recommends using the physical standby database as the failover target because it will allow the logical standby database to continue to function as a logical standby to the new primary database. If the failover is made to the logical standby, any physical standbys in the configuration will need to be re-created from a backup of the new primary database.

The failover operation enables you to choose one of the following two types of failover operations:

Complete

This operation attempts to minimize data loss by applying all available redo on the standby database.

Immediate

No additional data is applied on the standby database; data might be lost. This is the fastest type of failover.

During the failover, the selected standby database (also referred to as the target standby database) transitions into the primary role. If the failover target is a physical standby database, it is restarted. When completed, the Data Guard Overview page reflects the updated configuration, as shown in Figure 4-6.

In the figure, the Data Guard Status column indicates that the original primary database (North_Sales) is disabled and can no longer be managed through Enterprise Manager until it has been re-enabled as a physical standby database.

4.2.2.2.2 Using SQL to Fail Over to a Physical Standby Database

Follow these steps to fail over to a physical standby database:

If the standby database is a Real Application Clusters database, then issue a SHUTDOWNABORT on all additional standby instances.

Initiate the failover by issuing the following SQL command on the target standby database:

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH;

If the network between the primary and standby sites is unavailable, then the standby RFS processes will wait for the network connections to time out through normal TCP timeout processing before shutting down. While the RFS processes are in this TCP timeout processing, the standby database will not be able to fail over unless you include the FORCE keyword on the RECOVERMANAGEDSTANDBYDATABASEFINISH statement.

Convert the physical standby database to the primary role:

ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;

If the standby database was never opened read-only since the last time it was started, then open the new primary database by issuing the following SQL statement:

ALTER DATABASE OPEN;

If the standby database has been opened read-only, then restart the new primary database before starting Redo Apply.

4.2.2.2.3 Using SQL to Fail Over to a Logical Standby Database

Follow these steps to fail over to a logical standby database:

If the standby database is a Real Application Clusters database , then issue a SHUTDOWNABORT on all additional standby instances.

Initiate the failover by issuing the following SQL command on the target standby database:

ALTER DATABASE ACTIVATE LOGICAL STANDBY DATABASE FINISH APPLY;

This statement stops the RFS process, applies any remaining redo data, stops SQL Apply, and activates the logical standby database in the primary role. To avoid waiting for the redo in the standby redo log file to be applied prior to performing the failover, omit the FINISHAPPLY clause on the statement.

Although omitting the FINISHAPPLY clause will accelerate failover, omitting it will also cause the loss of any unapplied redo data in the standby redo log. To gauge the amount of redo that will be lost, query the V$LOGSTDBY_PROGRESS view. The LATEST_SCN column value indicates the last SCN received from the primary database, and the APPLIED_SCN column value indicates the last SCN applied to the standby database. All SCNs between these two values will be lost.

4.2.3 Database Switchover with a Standby Database

A database switchover performed by Oracle Data Guard is a planned transition that includes a series of steps to switch roles between a standby database and a production database. Following a successful switchover operation, the standby database assumes the production role and the production database becomes a standby database. In a RAC environment, a switchover requires that only one instance is active for each database, production and standby. At times the term switchback is also used within the scope of database role management. A switchback operation is a subsequent switchover operation to return the roles to their original state.

Oracle Data Guard Broker for information about using Enterprise Manager or the Data Guard broker command-line for database switchover

4.2.3.1 When to Perform a Data Guard Switchover

Switchover is a planned operation. Switchover is the capability to switch database roles between the production and standby databases. Switchover can occur whenever a production database is started, the target standby database is available, and all the archived redo logs are available. Switchovers are typically completed in less than five minutes and, in some cases, are optimized to be less than a minute. Switchovers are useful in the following situations:

Scheduled maintenance such as hardware maintenance or firmware patches on the production host

Resolution of data failures when the production database is still opened

Testing and validating the secondary resources, as a means to test disaster recovery readiness

Switchover is not possible or practical under the following circumstances:

If the primary and standby databases are RAC, then cleanly shut down all instances except one. To expedite this operation, issue a SHUTDOWNABORT.

Issue the following SQL statement on the primary database to convert it to a standby database:

ALTER DATABASE COMMIT TO SWITCHOVER TO STANDBY WITH SESSION SHUTDOWN;

After the statement in the previous step completes:

Issue the following SQL statement on the old standby database:

ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;

Immediately after issuing the COMMITTOSWITCHOVERTOPRIMARY statement, restart the old primary database as the new standby database and bring it to the mount state.

When the switchover command completes, issue the ALTERDATABASEOPEN statement on the new primary database to bring it to the open state.

Opening the new primary database from the mount state is possible only if the standby database was never opened read-only since the last time the database was started. If the database has been opened read-only, it will need to be restarted.

If the primary and standby databases are RAC, then start all instances.

Restart user sessions and application processing.

4.2.3.2.3 Using SQL for Data Guard Switchover to a Logical Standby Database

When performing a switchover using SQL*Plus commands it is possible for the old standby database that is to become the new primary database to build and transmit the LogMiner dictionary to the current primary database (the new standby database) prior to the actual switchover. This reduces the total time needed to perform the switchover. The following steps describe how to perform this optimized method:

Issue the following SQL statement on the primary database to enable receipt of redo from the current standby database:

ALTER DATABASE PREPARE TO SWITCHOVER TO LOGICAL STANDBY;

On the current logical standby database, build the LogMiner dictionary and transmit this dictionary to the current primary:

If the primary and standby databases are configured in a RAC, then start all instances.

Restart user sessions and application processing.

4.2.4 RAC Recovery for Unscheduled Outages

This solution is leveraged automatically when there is a node or instance failure. Surviving instances will automatically recover the failed instances and potentially aid in the automatic client failover. Recover times can be bounded by adopting the database and RAC configuration best practices and can usually lead to instance recovery times of seconds to minutes in very large busy systems with no data loss.

4.2.4.1 Automatic Instance Recovery for Failed Instances

Instance failure occurs when software or hardware problems disable an instance. After instance failure, Oracle automatically uses the online redo log file to perform database recovery as described in this section.

Instance recovery in RAC does not include restarting the failed instance or the recovery of applications that were running on the failed instance. Applications that were running continue by using service relocation and fast application notification (as described in Section 4.2.4.2, "Automatic Service Relocation").

When one instance performs recovery for another instance, the recovering instance:

Reads redo log entries generated by the failed instance and uses that information to ensure that committed transactions are recorded in the database. Thus, data from committed transactions is not lost

Rolls back uncommitted transactions that were active at the time of the failure and releases resources used by those transactions

When multiple node failures occur, as long as one instance survives, RAC performs instance recovery for any other instances that fail. If all instances of a RAC database fail, then on subsequent restart of any one instance, crash recovery will occur and all committed transactions will be recovered. If Data Guard is available, you can fail over automatically with Data Guard fast-start failover once all instances are down.

4.2.4.2 Automatic Service Relocation

Service reliability is achieved by configuring and failing over among redundant instances. More instances are enabled to provide a service than would otherwise be needed. If a hardware failure occurs and adversely affects a RAC database instance, then CRS automatically moves any services on that instance to another available instance, as configured with DBCA or Enterprise Manager. Then, Cluster Ready Services (CRS) attempt to restart the failed nodes and instances.

CRS recognizes when a failure affects a service and automatically fails over the service and redistributes the clients across the surviving instance supporting the service. In parallel, CRS attempts to restart and integrate the failed instances and dependent resources back into the system. Notification of failures using fast application notification (FAN) events occur at various levels within the Oracle Server architecture. The response can include notifying external parties through Oracle Notification Service (ONS), advanced queueing, or FAN callouts, recording the fault for tracking, event logging, and interrupting applications. Notification occurs from a surviving node when the failed node is out of service. The location and number of nodes serving a service is transparent to applications. Auto restart and recovery are automatic, including all the subsystems, such as the listener and the ASM instance, not just database.

4.2.4.3 Oracle Cluster Registry

Loss of the Oracle Cluster Registry file affects the availability of RAC and Cluster Ready Services. The OCR file can be restored from a physical backup that is automatically created or from an export file that is manually created by using the ocrconfig tool. Additionally, starting with Oracle Database 10g Release 10.2, Oracle can optionally mirror the OCR so that a single OCR device failure can be tolerated.

4.2.5 Application Failover

With proper configuration, applications can be configured to receive fast and efficient notification when application services become unavailable. Once notified, application connections occur transparently to surviving resources such as nodes in a RAC database or a standby database in a remote datacenter than has assumed the primary role following a failover.

In a RAC configuration, services are essential to achieving fast and transparent application failover. If a service becomes unavailable for a particular instance, the service will fail over to an available instance in the cluster, thereby allowing applications to continue processing. Clients are notified of the service relocation through Fast Application Notification (FAN).

Services are also essential for client failover across sites in a Data Guard configuration. After a site failure in a Data Guard configuration, the new production database will also be configured to automatically publish the production service while notifying affected clients that the services are no longer available on the failed production database through FAN events.

FAN notifications and service relocation enable automatic and fast redirection of clients in the event of a failures for both RAC and Data Guard environments.

If RAC and Data Guard are not used, fix the underlying problem and then restart ASM and the database instances

ASM disk failure

One or more ASM disks fail, but all disk groups remain online.

All data remains accessible. This is possible only with normal or high redundancy disk groups.

ASM automatically rebalances to the remaining disk drives and reestablishes redundancy. There must be enough free disk space in the remaining disk drives to restore the redundancy or the rebalance may fail with an ORA-15041

Note: External redundancy disk groups should use mirroring in the storage array to protect from disk failure. Disk failures should not be exposed to ASM.

4.2.6.1 ASM Instance Failure

If the ASM instance fails, database instances accessing ASM storage from the same node will shut down.If the primary database is using RAC, then application failover will occur automatically and clients connected to the database instance will reconnect to remaining instances providing the required service in the cluster and continue processing. The recovery time is typically in seconds.

If the primary database does not use RAC, then an ASM instance failure will shut down the entire database. If Data Guard is being used and Data Guard fast-start failover is configured, a database failover will be triggered automatically and clients will automatically reconnect to the new primary database following the failover. The recovery time is the time it takes to complete an automatic Data Guard fast-start failover operation. If fast-start failover is not configured, then recovering from this outage is a manual process, which can be accomplished by either restarting the ASM and database instances manually, or by performing a Data Guard failover. If neither RAC nor Data Guard is being used, then restart the ASM instance and restart database instances manually. The recovery time depends on the length of time to start the ASM instance, and the length of time to start the database instances and perform crash recovery.

4.2.6.2 ASM Disk Failure

If an ASM disk group is configured as an external redundancy type, then a failure of a single disk should be handled by the storage array and should not be seen by the ASM instance, and all operations of ASM and databases using the disk group will continue normally. However, if the failure of an external redundancy disk group is seen by the ASM instance, then the ASM instance will take the disk group offline immediately, causing Oracle instances accessing the disk group to crash. If the disk failure is temporary, then ASM and the database instances can be restarted, and crash recovery will occur after the disk group is brought back online. If an ASM disk group is configured as a normal or a high-redundancy type, then disk failure is handled transparently by ASM and the databases accessing the disk group are not affected. An ASM instance automatically starts an ASM rebalance operation to distribute the data on one or more failed disks to alternative disks in the ASM disk group. While the rebalance operation is in progress, subsequent disk failures may affect disk group availability if the disk contains data that has yet to be re-mirrored. When the rebalance operation completes successfully, the ASM disk group is no longer at risk in the event of a subsequent failure. Multiple disk failures are handled similarly, provided the failures affect only one failure group in an ASM disk group.The failure of multiple disks in multiple failure groups where a primary extent and all of its mirror extents have been lost will cause the disk group to go offline.

4.2.6.2.1 Using Enterprise Manager to Repair ASM Disk Failure

Figure 4-10 shows Enterprise Manager reporting disk failures. Three alerts appear at 11:19:29. The first alert is an Offline Disk Count. The second and third alerts are Disk Status messages for data area disk DATA.XBBT1D06_DATA and recovery area disk RECO.XBBT1D06_RECO:

Figure 4-11 shows Enterprise Manager reporting the status of data area disk group DATA and recovery area disk group RECO. The red arrows under Member Disks indicate that one disk has failed in each disk group. The numbers under Pending Operations indicate that one operation is pending for each disk group.

Figure 4-12 shows Enterprise Manager reporting a pending REBAL operation on the DATA disk group. The operation is about one-third done, as shown in % Complete, and the Remaining Time is estimated to be 16 minutes.

4.2.6.3 Data Area Disk Group Failure

A data area disk group failure should occur only when there have been multiple failures. For example, if the data-area disk group is defined as external redundancy, a single-disk failure should not be exposed to ASM. However, multiple disk failures in a storage array may be seen by ASM causing the disk group to go offline. Similarly, multiple disk failures in different failure groups in a normal or high-redundancy disk group may cause the disk group to go offline.

When one or more disks fail in a normal or high redundancy disk group, but the ASM disk group is accessible, there is no loss of data and no immediate loss of accessibility. An ASM instance automatically starts an ASM rebalance operation to distribute the data on the one or more failed disks to alternative disks in the ASM disk group. When the rebalance operation completes successfully, the ASM disk group is no longer at risk in the event of a second failure. There must be enough disk space on the remaining disks in the disk group for the rebalance to complete successfully.

The two possible solutions when the data area disk group fails are summarized in Table 4-8.

Table 4-8 Recovery Options for Data Area Disk Group Failure

Recovery Option

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

Data Guard failover

Five minutes or less

Varies depending on the data protection level chosen

Local recovery

Database restore and recovery time

Zero

If Data Guard is being used and fast-start failover is configured, then an automatic failover will occur when the database shuts down due to the data-area disk group going offline. If fast-start failover is not configured, then perform a manual failover.

If you decide to perform a Data Guard failover, then the recovery time objective (RTO) will be expressed in terms of minutes or perhaps seconds, depending on the presence of the Data Guard observer process and fast-start failover. However, if a manual failover occurs and not all data is on the standby site, then data loss might result.

After Data Guard failover has completed, and the application is available, the data area disk group failure must still be resolved. Continue with the following "Local Recovery Steps" procedure to resolve the ASM disk group failure.

The RTO for local recovery only is based primarily on the time required to:

Repair and replace the failed storage components

Restore and recover the database

Because the loss affects only the data-area disk group, there is no loss of data. All transactions are recorded in the Oracle redo log members that reside in the flash recovery area, so complete media recovery is possible.

If Data Guard is not being used, then perform the following local recovery steps. The time required to perform local recovery depends on how long it takes to restore and recover the database. There is no data loss when performing local recovery.

Local Recovery Steps

Perform these steps after one or more failed disks have been replaced and access to the storage has been restored:

4.2.6.4Flash Recovery Area Disk Group Failure

When the flash recovery-area disk group fails, the database crashes because the control file member usually resides in the flash recovery area and Oracle requires that all control file members are accessible. The flash recovery area can also contain the flashback logs, redo log members and all backups.

A flash recovery area disk group failure should occur only when there have been multiple failures. For example, if the flash recovery-area disk group is defined as external redundancy, a single-disk failure should not be exposed to ASM. However, multiple-disk failures in a storage array may be seen by ASM causing the disk group to go offline. Similarly, multiple-disk failures in different failure groups in a normal or high-redundancy disk group may cause the disk group to go offline.

Table 4-9 summarizes the two possible solutions when the flash recovery-area disk group fails.

Table 4-9 Recovery Options for Flash Recovery Area Disk Group Failure

Recovery Option

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

Local Recovery

Five minutes or less

Zero

Data Guard Failover or Switchover

Five minutes or less

Zero

Because the loss affects only the flash recovery-area disk group, there is no loss of data. No database media recovery is required, because the data files and the online redo log files are still present and available in the data area. A fast local restart is to startup the primary database after removing the controlfile member located in the failed flash recovery area and point to a new flash recovery area for local archiving (see "Local Restart Steps" discussion later in this section for the step-by-step procedure). However, this is a temporary fix until a new flash recovery area is created to replace the failed storage components. Oracle recommends using the "Local Recovery Steps" discussion later in this section.

If you decide to perform a Data Guard failover, then the RTO will be expressed in terms of minutes or perhaps seconds depending on the presence of the Data Guard observer process and fast-start failover. After Data Guard failover has completed, and the application is available, the flash recovery area disk group failure must still be resolved. Continue with the instructions in the following "Local Recovery Steps" section to resolve the ASM disk group failure.

If the protection level is maximum performance or the standby database is unsynchronized with the primary database, then temporarily start up the primary database by removing the controlfile member and pointing to a temporary flash recovery area (file system) in the SPFILE. Issue a Data Guard switchover to ensure no data loss. After Data Guard switchover has completed, and the application is available, the flash recovery area disk group failure must still be resolved. Shut down the affected database and continue with the instructions in the following "Local Recovery Steps" section to resolve the ASM disk group failure.

The RTO for local recovery only is based primarily on the time to repair and replace the failed storage components and then on the time to restore the control-file copy. Because the loss affects only the flash recovery-area disk group, there is no loss of data. No database media recovery is required, because the data files and the online redo log files are still present and available in the data area. As mentioned previously, you can start up the primary database by removing the controlfile member and pointing to a new flash recovery area. However, this is a temporary fix filled with availability and performance risks unless the flash recovery area is configured properly. Therefore, Oracle recommends the "Local Recovery Steps" that follow.

Local Restart Steps

For a fast local restart, perform the following steps on the primary database:

Change the CONTROL_FILES initialization parameter to refer only to members in the Data Area. For example:

ALTER SYSTEM SET CONTROL_FILES='+DATA/sales/control1.dbf' SCOPE=spfile;

Change local archive destinations and/or the flash recovery area to the local redundant, scalable destination. For example:

ALTER SYSTEM SET DB_RECOVERY_FILE_DEST='+DATA' SCOPE=spfile;

Startup with new settings:

STARTUP MOUNT:

You may need to disable and reenable Flashback Database because the flashback logs were damaged or lost:

If you performed an Oracle Data Guard failover to a new primary database, then you cannot use this procedure to reintroduce the old primary database as a standby database. This is because Flashback Database log files that are required as part of reintroducing the database have been lost. You must perform a full reinstantiation of the standby database.

Replace or get access to new storage to be leveraged as flash recovery area

4.2.7 Recovering from Data Corruption (Data Failures)

Recovering from data corruption is an unscheduled outage scenario. Data corruption is usually—but not always—caused by some activity or failure that occurs outside the database, even though the problem might be evident within the database.

Data corruption in data files has two categories:

Data file block corruption

A corrupt data file block can be accessed, but the contents in the block are invalid or inconsistent. The typical cause of data file corruption is a faulty hardware or software component in the I/O stack, which includes, but is not limited to, the file system, volume manager, device driver, host bus adapter, storage controller, and disk drive.

The database usually remains available when corrupt blocks have been detected, but some corrupt blocks might cause widespread problems, such as corruption in a file header or with a data dictionary object, or corruption in a critical table that renders an application unusable.

A data fault is detected when it is recognized by the user, administrator, RMAN backup, or application because it has affected the availability of the application. For example:

A single corrupt data block in a user table that cannot be read by the application because of a bad spot of the physical disk

A single corrupt data block because of block inconsistencies detected by Oracle. The block will be marked corrupted and any application accessing the block will receive an ORA-1578 error.

A database that automatically shuts down because of the invalid blocks of a data file in the SYSTEM tablespace caused by a failing disk controller

Media failure

This category of data corruption results from a physical hardware problem or user error. The system cannot successfully read or write to a file that is necessary to operate the database.

In all environments, you can resolve a data corruption outage by one of the following methods:

Other outages that result in database objects becoming unavailable or inconsistent are caused by human error, such as dropping a table or erroneously updating table data. Information about recovering from human error can be found in Section 4.2.8, "Recovering from Human Error".

If the data corruption impacts nondata files, then the repair may be slightly different. Table 4-10 provides a matrix of the key non database object corruption and the recommended repair.

Table 4-10 Non Database Object Corruption and Recommended Repair

Object or Component Affected

Impact

Repair

Any control file

Database fails

Data Guard fast-start failover will automatically fail over to the standby database

Redo log member

None

Investigate failure and check system

Drop and re-create redo log member

Active log group that is archived and not needed for crash recovery

Database fails

Restart database after dropping affected redo log group

Active redo log group not archived and not needed for crash recovery

Database fails

Restart database after dropping affected log group

Create a new backup

Refresh the standby database either by applying an incremental backup or re-creating the standby database from the primary or a backup of the primary database

Active or current redo log group that is still needed for crash recovery

Database fails

Use one of the following solutions:

Data Guard failover

Flashback Database—flash the database back to a consistent time and then issue an OPEN RESETLOGS

Archived redo log file

None

Create database backup

Refresh the standby database either by applying an incremental backup or re-creating the standby database from the primary or a backup of the primary database

4.2.7.1 Use Data Guard to Recover From Data Corruption and Data Failure

Failover is the operation of transitioning the standby databases as the new production database. A database switchover is a planned transition in which a standby database and a production database switch roles. Either of these operations can occur in less than 5 minutes and with no data loss.Use Data Guard switchover or failover for data corruption or data failure when:

The database is down or when the database is up but the application is unavailable because of data corruption or failure, and the time to restore and recover locally is long or unknown.

4.2.7.2 Use RMAN Block Media Recovery

Block media recovery (BMR) recovers one block or a set of data blocks marked "media corrupt" in a data file by using the RMAN BLOCKRECOVER command. When a small number of data blocks are marked media corrupt and require media recovery, you can selectively restore and recover damaged blocks rather than whole data files. This results in lower RTO because only blocks that need recovery are restored and only necessary corrupt blocks undergo recovery. Block media recovery minimizes redo application time and avoids I/O overhead during recovery. It also enables affected data files to remain online during recovery of the corrupt blocks. The corrupt blocks, however, remain unavailable until they are completely recovered.

Use block media recovery when:

A small number of blocks require media recovery and the blocks that need recovery are known. If a significant portion of the datafile is corrupt, or if the amount of corruption is unknown, then a different recovery method should be used.

Blocks are marked corrupt (verified with the RMAN BACKUP VALIDATE command) and only when complete recovery is required.

Backup of the data file containing the corrupted blocks is available locally or can be retrieved from a remote location including from a a physical standby database.

4.2.7.3 Use RMAN Data File Media Recovery

Data file media recovery recovers an entire datafile or set of data files for a database by using the RMAN RECOVER command. When a large or unknown number of data blocks are marked media-corrupt and require media recovery, or when an entire file is lost, the affected data files must be restored and recovered.

Use RMAN file media recovery when the following conditions are true:

The number of blocks requiring recovery is large or unknown

Block media recovery is not available (for example, if incomplete recovery is required, or if only incremental backups are available for the data file requiring recovery)

4.2.7.4 Re-Create Objects Manually

Some database objects, such as small look-up tables or indexes, can be recovered quickly by manually re-creating the object instead of doing media recovery.

Use manual object re-creation when:

You must re-create a small index because of media corruption. Creating an index online enables the base object to be used concurrently.

You must re-create a look-up table or when the scripts to re-create the table are readily available. Dropping and re-creating the table might be the fastest option.

4.2.8 Recovering from Human Error

Oracle flashback technology revolutionizes data recovery. In the past it took seconds to damage a database but hours to days to recover it. With flashback technology, the time to correct errors can be as short as the time it took to make the error. Fixing human errors that require rewinding the database, table, transaction, or row level changes to a previous point in time is easy and does not require any database or object restoration. Flashback technology provides fine-grained analysis and repair for localized damage such as erroneous row deletion. Flashback technology also enables correction of more widespread damage such as accidentally running the wrong application batch job. Furthermore, flashback technology is exponentially faster than a database restoration.

Flashback technologies are applicable only to repairing the following human errors:

Flashback Query enables you to view data at a point in time in the past. It can be used to view and reconstruct lost data that was deleted or changed by accident. Developers can use this feature to build self-service error correction into their applications, empowering end users to undo and correct their errors.

Note: Changes are propagated to physical and logical standby databases.

Flashback Version Query

Flashback Version Query uses undo data stored in the database to view the changes to one or more rows along with all the metadata of the changes.

Note: Changes are propagated to physical and logical standby databases.

Flashback Transaction Query

Flashback Transaction Query enables you to examine changes to the database at the transaction level. As a result, you can diagnose problems, perform analysis, and audit transactions.

Note: Changes are propagated to physical and logical standby databases.

Flashback Drop

Flashback Drop provides a way to restore accidentally dropped tables.

Note: Changes are propagated to physical standby databases.

Flashback Table

Flashback Table enables you to quickly recover a table to a point in time in the past without restoring a backup.

Note: Changes are propagated to physical and logical standby databases.

Flashback Database

Flashback Database enables you to quickly return the database to an earlier point in time by undoing all of the changes that have taken place since that time. This operation is fast because you do not have to restore the backups.

Flashback Database uses the Oracle Database flashback logs, while all other features of flashback technology use the Oracle Database unique undo and multiversion read consistency capabilities. See the configuration best practices for the database—as documented in Section 2.2, "Configuring Oracle Database 10g"—for configuring flashback technologies to ensure that the resources from these solutions are available at a time of failure.

4.2.8.1 Resolving Table Inconsistencies

Oracle provides a FLASHBACK DROP statement to recover from an accidental DROP TABLE statement, and a FLASHBACK TABLE statement to restore a table to a previous point in the database.

Flashback Table provides the DBA the ability to recover a table, or a set of tables, to a specified point in time quickly and easily. In many cases, Flashback Table alleviates the more complicated point in time recovery operations. For example:

This statement rewinds any updates to the ORDERS and ORDER_ITEMS tables that have been done between the current time and specified timestamp in the past. Flashback Table performs this operation online and in place, and it maintains referential integrity constraints between the tables.

Dropping or deleting database objects by accident is a common mistake. Users soon realize their mistake, but by then it is too late and there has been no way to easily recover the dropped tables and its indexes, constraints, and triggers. Objects once dropped were dropped forever. Loss of very important tables or other objects (like indexes, partitions or clusters) required DBAs to perform a point-in-time recovery, which can be time-consuming and lead to loss of recent transactions.

Flashback Drop provides a safety net when dropping objects in Oracle Database 10g. When a user drops a table, Oracle places it in a recycle bin. Objects in the recycle bin remain there until the user decides to permanently remove them or until space limitations begin to occur on the tablespace containing the table. The recycle bin is a virtual container where all dropped objects reside. Users view the recycle bin and undrop the dropped table and its dependent objects. For example, the employees table and all its dependent objects would be undropped by the following statement:

FLASHBACK TABLE employees TO BEFORE DROP;

4.2.8.2 Resolving Row and Transaction Inconsistencies

Resolving row and transaction inconsistencies might require a combination of Flashback Query, Flashback Version Query, Flashback Transaction Query, and the compensating SQL statements constructed from undo statements to rectify the problem. This section describes a general approach using a human resources example to resolve row and transaction inconsistencies caused by erroneous or malicious user errors.

Flashback Query, a feature introduced in the Oracle9i Database, enables an administrator or user to query any data at some point in time in the past. This powerful feature can be used to view and reconstruct data that might have been deleted or changed by accident. For example:

SELECT * FROM EMPLOYEES
AS OF TIMESTAMP
TO_DATE('28-Jun-06 14:00','DD-Mon-YY HH24:MI')
WHERE ...

This partial statement displays rows from the EMPLOYEES table starting from 2 p.m. on June 28, 2006. Developers can use this feature to build self-service error correction into their applications, empowering end users to undo and correct their errors without delay, rather than burdening administrators to perform this task. Flashback Query is very simple to manage, because the database automatically keeps the necessary information to reconstruct data for a configurable time into the past.

Flashback Version Query provides a way to view changes made to the database at the row level. It is an extension to SQL and enables the retrieval of all the different versions of a row across a specified time interval. For example:

This statement displays each version of the row, each entry changed by a different transaction, between 2 and 3 p.m. today. A DBA can use this to pinpoint when and how data is changed and trace it back to the user, application, or transaction. This enables the DBA to track down the source of a logical corruption in the database and correct it. It also enables application developers to debug their code.

Flashback Transaction Query provides a way to view changes made to the database at the transaction level. It is an extension to SQL that enables you to see all changes made by a transaction. For example:

This statement shows all of the changes that resulted from this transaction. In addition, compensating SQL statements are returned and can be used to undo changes made to all rows by this transaction. Using a precision tool like this, the DBA and application developer can precisely diagnose and correct logical problems in the database or application.

Consider a human resources (HR) example involving the SCOTT schema. The HR manager reports to the DBA that there is a potential discrepancy in Ward's salary. Sometime before 9:00 a.m., Ward's salary was increased to $1875. The HR manager is uncertain how this occurred and wishes to know when the employee's salary was increased. In addition, he instructed his staff to reset the salary to the previous level of $1250. This was completed around 9:15 a.m.

The following steps show how to approach the problem.

Assess the problem.

Fortunately, the HR manager has provided information about the time when the change occurred. You can query the information as it was at 9:00 a.m. with Flashback Query.

To can confirm that you have the correct employee by the fact that Ward's salary was $1875 at 09:00 a.m. Rather than using Ward's name, you can now use the employee number for subsequent investigation.

Query past rows or versions of the data to acquire transaction information.

Although it is possible to restrict the row version information to a specific date or SCN range, you might want to query all the row information that is available for the employee WARD using Flashback Version Query.

You can see that WARD's salary was not the only change that occurred in the transaction. The information that was changed for the other four employees at the same time as WARD can now be passed back to the HR manager for review.

Determine if the corrective statements should be executed.

If the HR manager decides that the corrective changes suggested by the UNDO_SQL column are correct, then the DBA can execute those statements individually.

Query the FLASHBACK_TRANSACTION_QUERY view for additional transaction information. For example, to determine the user that performed the erroneous update, issue the following query:

In this example, the query shows that the user MSMITH was responsible for the erroneous transaction.

4.2.8.3 Resolving Database-Wide Inconsistencies

To bring an Oracle database to a previous point in time, the traditional method is point-in-time recovery. However, point-in-time recovery can take hours or even days, because it requires the whole database to be restored from backup and recovered to the point in time just before the error was introduced into the database. With the size of databases constantly growing, it will take hours or even days just to restore the whole database.

Flashback Database is a new strategy for doing point-in-time recovery. It quickly rewinds an Oracle database to a previous time to correct any problems caused by logical data corruption or user error. Flashback logs are used to capture old versions of changed blocks. One way to think of it is as a continuous backup or storage snapshot. When recovery must be performed the flashback logs are quickly replayed to restore the database to a point in time before the error and just the changed blocks are restored. It is extremely fast and reduces recovery time from hours to minutes. In addition, it is easy to use. A database can be recovered to 2:05 p.m. by issuing a single statement. Before the database can be recovered, all instances of the database must be shut down and one of the instances subsequently mounted. The following is an example of a FLASHBACK DATABASE statement.

FLASHBACK DATABASE TO TIMESTAMP SYSDATE-1;

No restoration from tape, no lengthy downtime, and no complicated recovery procedures are required to use it. You can also use Flashback Database and then open the database in read-only mode and examine its contents. If you determine that you flashed back too far or not far enough, then you can reissue the FLASHBACK DATABASE statement or continue recovery to a later time to find the proper point in time before the database was damaged. Flashback Database works with a production database, a physical standby database, or a logical standby database.

Flashback Database does not automatically fix a dropped tablespace, but it can be used to dramatically reduce the downtime. You can flash back the production database to a point before the tablespace was dropped and then restore a backup of the corresponding datafiles from the affected tablespace and recover to a time before the tablespace was dropped.

Follow these recommended steps to use Flashback Database to repair a dropped tablespace:

Determine the SCN or time you dropped the tablespace.

Flash back the database to a time before the tablespace was dropped. You can use a statement similar to the following:

FLASHBACK DATABASE TO BEFORE SCN drop_scn;

Restore, rename, and bring datafiles online.

Restore only the datafiles from the affected tablespace from a backup.

4.3Restoring Fault Tolerance

Whenever a component in a high-availability architecture fails, then the full protection—or fault tolerance—of the architecture is compromised and possible single points of failure exist until the component is repaired. Restoring the high-availability architecture to full fault tolerance to reestablish full RAC, Data Guard, or MAA protection requires repairing the failed component. While full fault tolerance might be sacrificed during planned downtime, the method of repair is well understood because it is planned, the risk is controlled, and it ideally occurs at times best suited for continued application availability. However, for unplanned downtime, the risk of exposure to a single point of failure must be clearly understood.

This section provides the following topics that describe the steps needed to restore database fault tolerance:

4.3.1 Restoring Failed Nodes or Instances in a RAC Cluster

Ensuring that application services fail over quickly and automatically in a RAC cluster—or between primary and secondary sites—is important when planning for both scheduled and unscheduled outages. To ensure that the environment is restored to full fault tolerance after any errors or issues are corrected, it is also important to understand the steps and processes for restoring failed instances or nodes within a RAC cluster or databases between sites.

Adding a failed node back into the cluster or restarting a failed RAC instance is easily done after the core problem that caused the specific component to originally fail has been corrected. However, you should also consider:

When to perform these tasks so as to incur minimal or no effect on the current running environment

Resetting network components (such as load balancer) which were modified for failover and now must be reset

Failing back or rebalancing existing connections

After the problem that caused the initial node or instance failure has been corrected, a node or instance can be restarted and added back into the RAC environment at any time. Processing to complete the reconfiguration of a node may require additional system resources.

Table 4-13 summarizes additional processing that may be required when adding a node.

Table 4-13 Additional Processing When Restarting or Rejoining a Node or Instance

Action

Additional Resources

Restarting a node or rejoining a node into a cluster

When using only Oracle Clusterware, there is no impact when a new node joins the cluster.

When using vendor clusterware, there may be performance degradation while reconfiguration occurs to add a node back into the cluster. The impact on current applications should be evaluated with a full test workload.

Restarting or rejoining a RAC instance

When you restart a RAC instance, there might be some potential performance impact while lock reconfiguration takes place. The impact on current applications is usually minimal, but it should be evaluated with a full test workload.

See Also:

Your vendor-specified cluster management documentation for detailed steps on how to start and join a node back into a cluster

4.3.1.1Recovering Service Availability

After a failed node has been brought back into the cluster and its instance has been started, Cluster Ready Services (CRS) automatically manages the virtual IP address used for the node and the services supported by that instance automatically. A particular service might or might not be started for the restored instance. The decision by CRS to start a service on the restored instance depends on how the service is configured and whether the proper number of instances are currently providing access for the service. A service is not relocated back to a preferred instance if the service is still being provided by an available instance to which it was moved by CRS when the initial failure occurred. CRS restarts services on the restored instance if the number of instances that are providing access to a service across the cluster is less than the number of preferred instances defined for the service. After CRS restarts a service on a restored instance, CRS notifies registered applications of the service change.

For example, suppose the HR service is defined with instances A and B as preferred and instances C and D as available in case of a failure. If instance B fails and CRS starts up the HR service on C automatically, then when instance B is restarted, the HR service remains at instance C. CRS does not automatically relocate a service back to a preferred instance.

Suppose a different scenario in which the HR service is defined with instances A, B, C, and D as preferred and no instances defined as available, spreading the service across all nodes in the cluster. If instance B fails, then the HR service remains available on the remaining three nodes. CRS automatically starts the HR service on instance B when it rejoins the cluster because it is running on fewer instances than configured. CRS notifies the applications that the HR service is again available on instance B.

4.3.1.2 Considerations for Client Connections After Restoring a RAC Instance

After a RAC instance has been restored, additional steps might be required, depending on the current resource utilization and performance of the system, the application configuration, and the network load balancing that has been implemented.

Existing connections (which might have failed over or started as a new session) on the surviving RAC instances, are not automatically redistributed or failed back to an instance that has been restarted. Failing back or redistributing users might or might not be necessary, depending on the current resource utilization and the capability of the surviving instances to adequately handle and provide acceptable response times for the workload. If the surviving RAC instances do not have adequate resources to run a full workload or to provide acceptable response times, then it might be necessary to move (disconnect and reconnect) some existing user connections to the restarted instance.

New connections are started as they are needed, on the least-used node, assuming connection load balancing has been configured. Therefore, the new connections are automatically load-balanced over time.

An application service can be:

Partitioned with services running on a subset of RAC instances

Nonpartitioned so that all services run equally across all nodes

This is valuable for modularizing application and database form and function while still maintaining a consolidated data set. For the cases where an application is partitioned or has a combination of partitioning and nonpartitioning, the response time and availability aspects for each service should be considered. If redistribution or failback of connections for a particular service is required, then you can rebalance workloads manually with the DBMS_SERVICE.DISCONNECT_SESSION PL/SQL procedure. You can use this procedure to disconnect sessions associated with a service while the service is running.

For load-balancing application services across multiple RAC instances, Oracle Net connect-time failover and connection load balancing are recommended. This feature does not require changes or modifications for failover or restoration. It is also possible to use hardware-based load balancers. However, there might be limitations in distinguishing separate application services (which is understood by Oracle Net Services) and restoring an instance or a node. For example, when a node or instance is restored and available to start receiving new connections, a manual step might be required to include the restored node or instance in the hardware-based load balancer logic, whereas Oracle Net Services does not require manual reconfiguration.

Table 4-14 summarizes the considerations for new and existing connections after an instance has been restored. The considerations differ depending on whether the application services are partitioned, nonpartitioned, or are a combination of both. The actual redistribution of existing connections might or might not be required depending on the resource utilization and response times.

Table 4-14 Restoration and Connection Failback

Application Services

Failback or Restore Existing Connections

Failback or Restore New Connections

Partitioned

Existing sessions are not automatically relocated back to the restored instance. Use DBMS_SERVICE.DISCONNECT_SESSION to manually disconnect sessions and allow them to be reestablished on one of the remaining instances that provides the service.

Automatically routes to the restored instance by using the Oracle Net Services configuration.

Nonpartitioned

No action is necessary unless the load must be rebalanced, because restoring the instance means that the load there is low. If the load must be rebalanced, then the same problems are encountered as if application services were partitioned.

Automatically routes to the restored instance (because its load should be lowest) by using the Oracle Net Services configuration

Figure 4-13 shows a two-node partitioned RAC database. Each instance services a different portion of the application (HR and Sales). Client processes connect to the appropriate instance based on the service they require.

If one RAC instance fails, then the service and existing client connections can be automatically failed over to another RAC instance. In this example, the HR and Sales services are both supported by the remaining RAC instance. In addition, new client connections for the Sales service can be routed to the instance now supporting this service.

After the failed instance has been repaired and restored to the state shown in Figure 4-13 and the Sales service is relocated to the restored instance failed-over clients and any new clients that had connected to the Sales service on the failed-over instance might have to be identified and failed back. New client connections, which are started after the instance has been restored, should automatically connect back to the original instance. Therefore, over time, as older connections disconnect, and new sessions connect to the Sales service, the client load migrates back to the restored instance. Rebalancing the load immediately after restoration depends on the resource utilization and application response times.

Figure 4-15 shows a nonpartitioned application. Services are evenly distributed across both active instances. Each instance has a mix of client connections for both HR and Sales.

If one RAC instance fails, then CRS moves the services that were running on the failed instance. In addition, new client connections are routed only to the available RAC instances that offer that service.

After the failed instance has been repaired and restored to the state shown in Figure 4-15, some clients might have to be moved back to the restored instance. For nonpartitioned applications, identifying appropriate services is not required for rebalancing the client load among all available instances. Also, this is necessary only if a single instance is not able to adequately service the requests.

New client connections that are started after the instance has been restored should automatically connect back to the restored instance because it has a smaller load. Therefore, over time, as older connections disconnect and new sessions connect to the restored instance, the client load will again evenly balance across all available RAC instances. Rebalancing the load immediately after restoration depends on the resource utilization and application response times.

4.3.2 Restoring a Standby Database After a Failover

Following unplanned downtime on a production database that requires a failover, full fault tolerance is compromised until the standby database is reestablished. Full database protection should be restored as soon as possible. The steps for restoring fault tolerance differ slightly between physical and logical standby databases.

Reinstating databases is automated if you are using Data Guard fast-start failover. After a fast-start failover completes, the observer automatically attempts to reinstate the former primary database as a standby database. Reinstatement restores high availability to the broker configuration so that, in the event of a failure of the new primary database, another fast-start failover can occur. The reinstated database can act as the fast-start failover target for the new primary database, making a subsequent fast-start failover possible. The new standby database is a viable target of a failover when it begins applying redo data received from the new primary database.

The broker can reinstate the former primary database as a standby database without the need to reenable the primary database or to manually perform a Flashback Database operation. To reinstate the former primary database, the database must be started and mounted, but it cannot be opened. The broker reinstates the database as a standby database of the same type (physical or logical) as the former standby database.

If the former primary database cannot be reinstated automatically, you can manually reinstate it using either the DGMGRL REINSTATE command or Enterprise Manager. Step-by-step instructions for manual reinstatement are described in Oracle Data Guard Broker.

Standby databases do not have to be reinstantiated if you use the Oracle Flashback Database feature. Flashback Database has the following advantages:

Saves hours of database restoration time

Reduces overall complexity in restoring fault tolerance

Reduces the time that the system is vulnerable because the standby database is re-created more quickly

4.3.2.1 Restoring a Standby Database After a Fast-Start Failover

Following a fast-start failover, the observer periodically attempts to reconnect to the original primary database. When the observer regains network access to the original primary database, it initiates a request for the Data Guard broker to automatically reinstate it as a standby database to the new primary. This quickly restores disaster protection and high availability for the new primary database.

You can enable fast-start failover from any site, including the observer site, in Enterprise Manager while connected to any database in the broker configuration. The broker simplifies switchovers and failovers by allowing you to invoke them using a single key click in Oracle Enterprise Manager, as shown in Figure 4-16.

Figure 4-16 Fast-Start Failover and the Observer Are Successfully Enabled

4.3.2.2 Reinstating a Standby Database Using Enterprise Manager After a Failover

Furthermore, you can leverage Enterprise Manager to reinstate the old primary as the new standby. Figure 4-17 shows an example of the warning message that shows in Enterprise Manager when a reinstatement is needed.

Figure 4-17 Reinstating the Former Primary Database After a Fast-Start Failover

4.3.3 Restoring ASM Disk Groups after a Failure

4.3.4 Restoring Fault Tolerance After Planned Downtime on Secondary Site or Clusterwide Outage

After performing the planned maintenance on the secondary site, the standby database and log apply services must be restarted, and then Data Guard will automatically catch up. You can leverage Enterprise Manager and Data Guard broker to monitor the Data Guard state.

The following steps are required to restore full fault tolerance after planned downtime on a secondary site or clusterwide outage:

Note:

The following steps can be accomplished manually (as described below) or automatically using Enterprise Manager.

Start the standby database

You might have to restore the standby database from local backups, local tape backups, or from the primary site backups if the data in the secondary site has been damaged. Re-create the standby database from the new production database by following the steps for creating a standby database in Oracle Data Guard Concepts and Administration.

After the standby database has been reestablished, start the standby database.

For a logical standby database, verify that there are no errors from the logical standby process and that the recovery has applied the redo from the archived redo logs:

SELECT THREAD#, SEQUENCE# SEQ#
FROM DBA_LOGSTDBY_LOG LOG, DBA_LOGSTDBY_PROGRESS PROG
WHERE PROG.APPLIED_SCN BETWEEN LOG.FIRST_CHANGE# AND LOG.NEXT_CHANGE#
ORDER BY NEXT_CHANGE#;

Restore production database protection mode

If you had to change the protection mode of the production database from maximum protection to either maximum availability or maximum performance because of the standby database outage, then change the production database protection mode back to maximum protection depending on your business requirements.

4.3.5 Restoring Fault Tolerance After a Standby Database Data Failure

Following unplanned downtime on the standby database that requires a full or partial datafile restoration (such as data or media failure), full fault tolerance is compromised until the standby database is brought back into service. Full database protection should be restored as soon as possible.

To repair data corruption and data failures on a logical standby database, you require a backup of the logical standby file and not a backup from the primary database. Otherwise, you need to reinstantiate or re-create the relevant objects that got affected by the corruption.To repair data corruption or data failures on the standby database, you can leverage the following repair solutions:

If you had to change the protection mode of the production database from maximum protection to either maximum availability or maximum performance because of the standby database outage, then change the production database protection mode back to maximum protection (depending on your business requirements).

4.3.6 Restoring Fault Tolerance After the Production Database Was Opened Resetlogs

If the production database is activated because it was flashed back to correct a logical error or because it was restored and recovered to a point in time, then the corresponding standby database might require additional maintenance. No additional work is required if the production database did complete recovery with no resetlogs.

After opening the production database with the RESETLOGS option, execute the queries shown in Table 4-17.

Table 4-17 Queries to Determine RESETLOGS SCN and Current SCN OPEN RESETLOGS

Database

Query

Production

SELECT TO_CHAR(RESETLOGS_CHANGE# - 2) FROM V$DATABASE;

Physical standby

SELECT TO_CHAR(CURRENT_SCN) FROM V$DATABASE;

Logical standby

SELECT APPLIED_SCN FROM DBA_LOGSTDBY_PROGRESS;

Table 4-18 shows the actions you take to restore fault tolerance if the standby database is behind the primary database's resetlogs SCN.

Table 4-18 SCN on Standby Database is Behind Resetlogs SCN on the Production Database

Database

Action

Physical standby

Ensure that the standby database has received a new archived redo log file from the production database.

Issue the FLASHBACK DATABASE TO SCNflashback_scn statement where flashback_scn is the SCN returned from the production database query in Table 4-17. The SCN returned from the production database query is 2 less than the RESETLOGS_CHANGE#.

Retrieve production database flashback time or SCN. The flashback time or SCN must be extracted from the production database alert log. This assumes that the clocks are the same between the database machines or the flashback time will need to be adjusted.

The last SQL statement queries the APPLIED_SCN column of the DBA_LOGSTDBY_PROGRESS view, the results of this query should confirm that SQL Apply has applied less than or up to the APPLIED_SCN obtained in step 2.If not, you need to flash back the database further.

Open the logical standby database with resetlogs:

SHUTDOWN;
STARTUP MOUNT EXCLUSIVE;
ALTER DATABASE OPEN RESETLOGS;

Archive the current log on the primary database:

ALTER SYSTEM ARCHIVE LOG CURRENT;

Start SQL Apply:

ALTER DATABASE START LOGICAL STANDBY APPLY;

4.3.7 Restoring Fault Tolerance After Dual Failures

If a dual failure affecting both the standby and production databases occurs, then you must re-create the production database first. Because the sites are identical, the production database can be created wherever the most recent backup resides.

Table 4-20 summarizes the recovery strategy depending on the type of backups that are available.

Table 4-20 Re-Creating the Production and Standby Databases

Available Backups

Re-Creating the Production Database

Local backup on production and standby databases

Restore backup from the production database. Recover and activate the database as the new production database.

Local backup only on standby database. Tape backups on standby database.

Restore the local standby backup to the standby database. Recover and activate the database as the new production database.

Tape backups only

Restore tape backups locally. Recover the database and activate it as the new production database.

4.4.1.1 Migrating to ASM Storage

If you have an existing Oracle database that stores database files on a file system or on raw devices, you can migrate some or all of these database files to ASM. The DBMS_FILE_TRANSFER package can be used for this purpose.

4.4.1.2 Adding and Removing Storage

Disks can be added to and removed from ASM with no downtime. When disks are added or removed, ASM automatically starts a rebalance operation to evenly spread the disk group contents over all drives in the disk group.The best practices for adding or removing storage include:

Investigate methods of adding storage to, and removing storage from, the host operating system with no downtime.

Use a single ALTER DISKGROUP command when adding or removing multiple disk drives.

For example, if the storage maintenance is to add new drives and remove existing drives, use a single ALTER DISKGROUP command with the ADD DISK clause to add the new drives, and the DROP DISK clause to remove the existing drives. For example:

When dropping disks from a disk group, specify the WAIT option in the REBALANCE clause so the ALTER DISKGROUP statement does not return until the contents of the drives being dropped have been moved to other drives. Once the statement completes, the drives can be safely removed from the system. For example:

4.4.2RAC Database Patches

With RAC, you can apply certain database patches to one node or instance at a time, which enables continual application and database availability. "One-off" patches or interim patches to database software are usually applied to implement known fixes for software problems an installation has encountered or to apply diagnostic patches to gather information regarding a problem. Such patch application is often carried out during a scheduled maintenance outage.

Oracle now provides the capability to do rolling patch upgrades with Real Application Clusters with little or no database downtime. The tool used to achieve this is the opatch command-line utility.

The advantage of a RAC rolling upgrade is that it enables at least some instances of the RAC installation to be available during the scheduled outage required for patch upgrades. Only the RAC instance that is currently being patched must be brought down. The other instances can continue to remain available. This means that the effect on the application downtime required for such scheduled outages is further minimized. Oracle's opatch utility enables the user to apply the patch successively to the different instances of the RAC installation.

Rolling upgrade is available only for patches that have been certified by Oracle to be eligible for rolling upgrades. Typically, patches that can be installed in a rolling upgrade include:

Patches that do not affect the contents of the database such as the data dictionary

Patches not related to RAC internode communication

Patches related to client-side tools such as SQL*PLUS, Oracle utilities, development libraries, and Oracle Net

Patches that do not change shared database resources such as datafile headers, control files, and common header definitions of kernel modules

Rolling upgrade of patches is currently available for one-off patches only. It is not available for patch sets.

Rolling patch upgrades are not available for deployments where the Oracle Database software is shared across the different nodes. This is the case where the Oracle home is on Cluster File System (CFS) or on shared volumes provided by file servers or NFS-mounted drives. The feature is only available where each node has its own copy of the Oracle Database software.

4.4.2.1 Best Practices To Minimize Downtime

Use the following recommended practices for all database patch upgrades:

Always confirm with Oracle Support Services that the patch is valid for your problem and for your deployment environment.

Have a plan for applying the patch as well as a plan for backing out the patch.

Apply the patch to your test environment first and verify that it fixes the problem.

When you plan the elapsed time for applying the patch, include time for starting up and shutting down the other tiers of your technology stack if necessary.

If the patch is not a candidate for RAC rolling upgrade and you can incur the downtime for applying the patch, go to Section 4.4.3, "Database Upgrades" to assess whether or not other solutions are feasible.

The following are additional recommended practices for RAC rolling upgrades.

If multiple instances share an Oracle home, then all of them will be affected by application of a patch. The DBA should verify that this will not cause unintentional side effects. Also, all such instances on a node must be shut down during the patch application. Scheduled outage planning should take this into account. As a best practice, only similar applications should share an Oracle home on a node. This provides greater flexibility for patching.

The Oracle inventory on each node is a repository of the Oracle Database software installed on the node. The inventory is node-specific. It is shared by all Oracle software installed on the node. It is similar across nodes only if all nodes are exactly the same in terms of the Oracle Database software deployed, the deployment configuration, and patch levels. Because the Oracle inventory greatly aids the patch application and patch management process, it is recommended that its integrity be maintained. Oracle inventory should be backed up after each patch installation to any Oracle software on a specific node. This applies to the Oracle inventory on each node of the cluster.

Use the Oracle Universal Installer to install all Oracle database software. This creates the relevant repository entries in the Oracle inventory on each node of the cluster. Also, use the Oracle Universal Installer to add nodes to an existing RAC cluster.

However, if this was not done or is not feasible for some reason, adding information about an existing Oracle database software installation to the Oracle inventory can be done with the attach option of the opatch utility. Node information can be also added with this option.

The nature of the rolling patch upgrade enables it to be applied to only some nodes of the RAC cluster. So an instance can be operating with the patch applied, while another instance is operating without the patch. This is not possible for nonrolling patch upgrades. Apply nonrolling patch upgrades to all instances before the RAC deployment is activated. A mixed environment is useful if a patch must be tested before deploying it to all the instances. Applying the patch with the -local option is the recommended way to do this.

In the interest of keeping all instances of the RAC cluster at the same patch level, it is strongly recommended that after a patch has been validated, it should be applied to all nodes of the RAC installation. When instances of a RAC cluster have similar patch software, services can be migrated among instances without running into the problem a patch might have fixed.

All patches (including those applied by rolling upgrades) should be maintained online and not removed once they have been applied. This is useful if a patch must be rolled back or applied again.

The patches should be stored in a location that is accessible by all nodes of the cluster. Thus all nodes of the cluster are equivalent in their capability to apply or roll back a patch.

Rolling patch upgrades, just like any other patch upgrade, should be done when no other patch upgrade or Oracle installation is being done on the node. Application of multiple patches is a sequential process. The scheduled outage should be planned accordingly.

If multiple patches have to be applied and they must be applied at the same time, and if only some of these patches are eligible for rolling upgrade, then apply all of them in a nonrolling manner. This reduces the overall time required to get through the patching process.

For patches that are not eligible for rolling upgrade, the next best option for RAC deployments is the minimize_downtime option of the apply command.

Perform the rolling upgrade when system usage is low. This ensures minimal disruption of service for the end user.

4.4.3.2 Data Guard SQL Apply (Logical Standby)

Data Guard SQL Apply can be used to upgrade a database with minimal downtime by means of a process called rolling upgrade. Data Guard currently supports homogeneous environments where the primary and standby databases are running on the same platform.

Note the following points when deciding if Data Guard SQL Apply is the appropriate method for minimizing downtime during a database upgrade:

Support for rolling upgrade starts with Oracle Database 10g release 1 (10.1.0.3). The supported versions, for both the source database and the target database, are more restrictive than Oracle Streams.

Downtime required for a database upgrade (rolling upgrade) when using Data Guard SQL Apply is determined by the time needed to:

Perform a Data Guard switchover

Reconnect the clients to the new database

Use Data Guard SQL Apply for rolling database upgrade when DBUA will not complete the upgrade within the maintenance window and the application does not use user-defined types.

4.4.3.3 Oracle Streams

Oracle Streams can be used to upgrade the database software from one version to another with minimal downtime. This is because Oracle Streams supports a configuration in which the primary database and its replica are running on different database versions.

Note the following points when deciding if Oracle Streams is an appropriate method for a database upgrade:

Oracle Streams does not support user-defined types, such as object types, REF values, varrays, and nested tables. However, shadow tables can be created on the primary database that do not have the unsupported data types and the shadow tables can be replicated.

The source database must be running Oracle9i release 2 or higher.

The administrative effort required to set up and maintain the Oracle Streams environment is more than if using Data Guard SQL Apply for a database upgrade.

There might be a performance impact on the source database while the source and target databases run in parallel as changes are propagated to the target database.

Downtime required for a database upgrade when using Oracle Streams is determined by the time needed to apply the remaining transactions in the queue and to reconnect the clients to the new database.

Consider using Oracle Streams if the application already uses Streams or when clients do not use user-defined types and the extra administrative effort is worth the opportunity for a very small outage time.

4.4.3.4 Transportable Tablespaces

Transportable tablespaces can be used to accomplish a database upgrade by transporting all user datafiles into a pre-created, prepared target database.

Note the following points when deciding if transportable tablespaces is the appropriate method for performing a database upgrade:

The SYSTEM tablespace cannot be moved with transportable tablespaces. The target database SYSTEM tablespace contents, including user definitions and objects necessary for the application, must be built manually. Use Data Pump to move the contents of the SYSTEM tablespace.

Downtime required for a database upgrade when using transportable tablespaces is determined by the time needed to:

Place the source database tablespaces in read-only mode

Perform a network import of the transportable metadata

If the target database is on a remote system, then include the time to transfer all datafiles from the source system to the target system

The time it takes to transfer the datafiles can be reduced significantly by using a storage infrastructure that can make the datafiles available to the target system without the need to physically move the files, or by using a physical standby database.

Use transportable tablespaces to perform a database upgrade when DBUA will not complete within the maintenance window, and Oracle Streams or Data Guard SQL Apply cannot be used due to data type restrictions.

4.4.4.2 Oracle Streams

Oracle Streams can be used to move a database from one platform to another with minimal downtime. This is because Oracle Streams supports a configuration in which the primary database and its replica are running on different platforms.

Note the following points when deciding if Oracle Streams is an appropriate method for a platform migration:

Oracle Streams does not support user-defined types, such as object types, REF values, varrays, and nested tables.

To perform an upgrade using Oracle Streams, the source database must be running Oracle9i release 2 or higher.

The administrative effort required to set up and maintain the Oracle Streams environment is more than if using Data Guard SQL Apply for a database upgrade.

There might be a performance impact on the source database while the source and target databases run in parallel as changes are propagated to the target database.

Downtime required for a platform migration when using Oracle Streams is determined by the time needed to apply the remaining transactions in the queue and to reconnect clients to the new database.

Consider using Oracle Streams when the application does not use user-defined types and the extra administrative effort is worth the opportunity for a very small outage time.

4.4.4.3 Oracle Data Pump

Oracle Data Pump technology enables very high-speed movement of data and metadata from one database to another, across different platforms and different database versions.

Note the following points when deciding if Data Pump is an appropriate method for a platform migration:

Oracle Data Pump is available only on Oracle Database 10g Release 1 (10.1) and later releases.

Downtime required for a platform migration when using Data Pump is determined by the time needed to perform a full database network import. A network import uses a database link between the target system and the remote source system to retrieve data and write it directly into the target system, without the use of dump files.

Use Data Pump when moving a database to a platform with different endian format when the network import time is acceptable.

4.4.4.4 Transportable Tablespaces

Transportable tablespaces can be used to accomplish a platform migration by transporting all user datafiles into a pre-created, prepared target database.

Note the following points when deciding if transportable tablespaces is the appropriate method for performing a platform migration:

The SYSTEM tablespace cannot be moved with transportable tablespaces. the target database SYSTEM tablespace contents, including user definitions and objects necessary for the clients, must be built manually. Use Data Pump to move the necessary contents of the SYSTEM tablespace.

Downtime required for a platform migration or database upgrade when using transportable tablespaces is determined by the time needed to:

Place the source database tablespaces in read-only mode

Perform a network import of the transportable metadata

Transfer all datafiles from the source system to the target system

This time can be reduced significantly by using a storage infrastructure that can make the datafiles available to the target system without the need to physically move the files

Convert all datafiles to the new platform format using RMAN

Use transportable tablespaces to migrate to a new platform when Oracle Data Pump will not complete within the maintenance window, and Oracle Streams or Data Guard SQL Apply cannot be used due to data type restrictions.

4.4.4.5 Data Guard Redo Apply (Physical Standby Database)

Data Guard Redo Apply can be used to change the location of a database to a remote site with minimal downtime by setting up a temporary standby database at a remote location and performing a switchover operation.

Downtime required for a location migration when using Data Guard Redo Apply is determined by the time required to perform a switchover operation.

4.4.5 Online Database and Application Upgrades

An Oracle database upgrade is the process of transforming an existing, prior release of an Oracle Database system into the current release of the Oracle Database system and can be a very lengthy process. An application upgrade may include a database upgrade and any application code and schema changes required. If database upgrade with Data Guard is not applicable and zero to minimum downtime is required for the database or application upgrade, then configure Oracle Streams to perform a database upgrade with little or no downtime. To do so, you use Oracle Streams to configure a single-source replication environment with the following databases:

Source Database: The original database that is being upgraded

Capture Database: The database where a capture process captures changes made to the source database during the upgrade

Destination Database: The copy of the source database where an apply process applies changes made to the source database during the upgrade process. The apply process can apply to the same or different schema and object structure using

Specifically, you can use the following general steps to perform a database upgrade while the database is online:

Create an empty destination database.

Configure an Oracle Streams single-source replication environment where the original database is the source database and a copy of the database is the destination database for the changes made at the source.

Perform the database upgrade on the destination database. During this time the original source database is available online.

Use Oracle Streams to apply the changes made at the source database to the destination database.

When the destination database has caught up with the changes made at the source database, take the source database offline and make the destination database available for applications and users.

If the schema or object structure is different at the destination database, then Streams transformations need to be incorporated to manipulate the change to its new structure.

4.4.6 Database Object Reorganization

Many scheduled outages related to the data server involve some reorganization of the database objects. The database object reorganization must be accomplished with continued availability of the database. Oracle's online object reorganization capabilities have been available since Oracle8i. These capabilities enable object reorganization to be performed even while the underlying data is being modified.

Table 4-23 describes a few of the object reorganization capabilities available with Oracle Database 10g.

Table 4-23 Some Object Reorganization Capabilities

Object Type

Example of Object Reorganization Solution

Description of Solution

Table

DBMS_REDEFINITION PL/SQL package

A PL/SQL package that provides a mechanism to redefine tables online. This is Oracle's recommended best practice.

Index

Rebuild index

Rebuild an index that has previously been marked as unusable.

Tablespace

Rename tablespace

Enables an existing tablespace to be renamed without rebuilding the tablespace and its contents.

In highly available systems, it is occasionally necessary to redefine large tables that are constantly accessed to improve the performance of queries or DML. The Online Reorganization and Redefinition feature in Oracle Database 10g, offers administrators unprecedented flexibility to modify table physical attributes and transform both data and table structure, while allowing users full access to the database. This capability improves data availability, query performance, response time and disk space usage, all of which are important in a mission-critical environment and it can make the application upgrade process easier, safer and faster.

Oracle's recommended practice is to reorganize tables using the DBMS_REDEFINITION PL/SQL package, because it provides a significant increase in availability compared to traditional methods of redefining tables that require tables to be taken offline. Whether you call DBMS_REDEFINITION manually at the command line or automatically through Oracle Enterprise Manager, the entire reorganization process occurs while users have full access to the table thus ensuring system availability.

Figure 4-19 shows the Reorganize Objects Wizard in Oracle Enterprise Manager that you can use as an alternative to calling the DBMS_REDEFINITION package at the SQL*Plus command line. After you answer a few questions in the wizard, it automatically generates the script and performs the reorganization.

Using the DBMS_REDEFINITION approach, an interim table is created that contains all the desired attributes. The reorganization begins by calling the procedure START_REDEF_TABLE, which is where the column mappings between the current and new version of the table are described. All the dependent objects such as triggers, constraints and indexes are automatically copied to the interim table using the procedure COPY_TABLE_DEPENDENTS. During the reorganization, any changes made to the original table are added to the interim table by calling the procedure SYNC_INTERIM_TABLE. The reorganization is complete when the procedure FINISH_REDEF_TABLE is called and the interim table is renamed as the main table.

A tablespace can be renamed in Oracle Database 10g, similar to the ability to rename a column, table and datafile. Previously, the only way to change a tablespace name was to drop and re-create the tablespace, but this meant that the contents of the tablespace had to be dropped and rebuilt later. With the ability to rename a tablespace online, there is no interruption to the users.

Additionally, consider the following when performing data reorganization:

Concurrent activity on the table during an online operation.

During an online operation, Oracle recommends users minimize activities on the base table. Database activities should impact less than ten percent of the table while online operation is in progress. Also the database administrator can use the Database Resource Manager to minimize the data reorganization impact to users by allocating enough resources to users.

Oracle does not recommend running online operations at peak times or running a batch job that modifies large amount of data during an online data reorganization.

In fact, parallel DML, direct load and import/export cannot be performed during an online operation.

Rebuilding index online vs. dropping an index and then re-creating a new index online.

Rebuilding an index online requires additional disk space for the new index during the operation, whereas dropping an index and then re-creating an index does not require additional disk space.

Coalescing an index online vs. rebuilding an index online.

Online index coalesce is an in-place data reorganization operation, hence does not require additional disk space like index rebuild does. Index rebuild requires temporary disk space equal to the size of the index plus sort space during the operation. Index coalesce does not reduce the height of the B-tree. It only tries to reduce the number of leaf blocks. The coalesce operation does not free up space for users but does improve index scan performance.

If a user needs to move an index to a new tablespace, use online index rebuild.

Local and global indexes.

Oracle Database 10g supports both local and global partitioned indexes with online operations. When tables and indexes are partitioned, this allows administrators to perform maintenance on these objects, one partition at a time, while the other partitions remain online.

4.4.7 System Maintenance

For a scheduled outage that requires an instance, node, or other component to be isolated, RAC provides the ability to relocate, disable, and enable services. Relocation migrates a service to another instance. Services and instances can be selectively disabled while repair, change, or upgrade is performed on hardware or system software and re-enabled after the maintenance is complete. This ensures that the service or instance is not started during the maintenance outage. The service and instance is disabled at the beginning of the planned outage. It is then enabled at the end of the maintenance outage.

When using RAC, Oracle Clusterware daemons start automatically at the time the node is started. When performing maintenance that requires one or more system reboots or requires that all non-operating system processes be shut down, use the crsctl command to stop and disable the startup of the Oracle Clusterware daemons. Once maintenance is complete, enable and start the Oracle Clusterware daemons with crsctl commands.