Adding and Removing Cluster Nodes

One of the central tenets of grid computing revolves around the ability to move resources from one area of need to another, adding additional capacity where it is needed most, and/or removing excess or unused capacity so that it can be better utilized elsewhere. In terms of the database grid (aka RAC), this means the ability to add and remove nodes from a cluster with relative ease and swiftness. There are other situations where this need is applicable-for example, aside from the need to move resources around to where they are needed most, if a node in a cluster fails completely, the HA DBA will want to remove all vestiges of that node from the memories of the remaining nodes. Later on, the HA DBA may want to add another node, or the same node, back into the cluster. Regardless of the circumstances, the ability to do this has been enhanced in Oracle Database 10g. This next section discusses how this is accomplished.

Adding a Cluster Node

We will begin with the scenario of adding a node into your existing cluster. As mentioned, you may have the need to do this as your capacity needs grow over time-these needs may be permanent or they may be temporary. Or you may have a situation where a cluster node suffered a catastrophic hardware failure, and therefore a new node needs to be brought online. We can essentially boil the adding of a new node down to a four-step process:

Configure the new hardware.

Configure the new operating system.

Add the node to the cluster at the cluster layer.

Add the new instance at the database layer.

Step 1: Configuring the Hardware

This step merely consists of ensuring that the new node is hooked up to the shared components of your cluster. Make sure that the shared disks are hooked up and visible to the new node. Ensure that all network cards are correctly connected-make sure that the private card(s) are going to the switch(es) used for the interconnect, and the public card is hooked up to your public network.

Step 2: Configuring the Operating System

Make sure that the operating system level on the new node is the same as that on existing nodes, including patch levels, kernel settings, oracle user and group settings, and so on. Configure user equivalence between all the nodes. For the complete list of steps required, please refer to the preinstall steps that were described in Chapter 4, applying those same steps to the new node that you intend to add to the cluster. In addition to configuring the hosts file on the new node, make sure that the hosts files on all existing nodes are updated to include references to the new node. Follow this up by verifying that you can ping the new node from all existing nodes in the cluster, and vice versa (using both the public and private node names).

Note

You must also be sure that you have secured a VIP (virtual IP) for the node to be added, and that the VIP is also defined in the hosts file of each node, whether new or existing.

Disk Configuration on the New Node If using RAW devices or ASM, configure the /etc/sysconfig/rawdevices file to match the RAW device bindings on the other nodes (or install the ASMLib provided by Oracle, if available for your platform). If using OCFS, install the matching version of the OCFS driver on the new node, and verify that you can mount the OCFS drive from the new node. Again, refer to Chapter 4 for details on configuring these pieces. Another important point to keep in mind if you are using RAW devices is that you will need to configure RAW partitions for your new instance's online redo logs and undo tablespace. If using OCFS or ASM, this is not necessary.

Step 3: Adding the Node to the Cluster Layer

Once the preaddition steps just listed have been completed, you are now ready to add the node into the CRS layer, or clusterware layer, making the other nodes aware of its existence. To do this, you must go to one of the existing nodes in the cluster, change into the <CRS_HOME>/oui/bin directory, and run the addNode.sh script as the oracle user. This will start up the Oracle Universal Installer in Add Node mode. Be sure that the display is set correctly. The Welcome screen will look the same as it always does.

Click Next on the Welcome screen, and you will see the screen to Specify Cluster Nodes for Node Addition, as shown in Figure 5-1. The upper table will show the existing nodes in the cluster, while you will be able to add in the information for the new node in the lower table. Specify the public and private node info for the new node, and then proceed to the next step. In the following screen, you will be prompted to run the oraInstroot.sh script as root on the new node (unless there is an inventory location there already).

Figure 5-1: Adding a node at the cluster layer

Click Next from here, and you will see the Cluster Node Addition Progress page. At this point, the OUI begins copying the binaries from the CRS_HOME on the local node to the CRS_HOME on the new node. It is important to note that you are copying the home from one node to the other (unless the home is a shared CRS_HOME on a cluster file system drive). This has two advantages-first, you do not have to provide the install media for CRS, as it is not needed. Second, if the CRS_HOME on the existing node(s) has been patched already, the patch level is propagated to the new node in one swell foop, foregoing the need to do multiple runs-that is, installing the base release and then patching on top of that. The same is true of the inventory-this is also updated on the remote node to reflect the version and patch level of what has been copied over.

After the copy of the CRS_HOME has completed, you will be prompted to run the rootaddnode.sh script, followed by a prompt to run the root.sh script. The rootaddnode.sh script should be run on the local node from which you are running the OUI, as root. In our case, we ran the addNode operation from node rmsclnxlcu1, so on that node we run rootaddnode.sh:

Next you will be prompted to run root.sh on the new node(s) (in our case, the new node is rmsclnxclu2). This will start the CRS stack on the new node. A successful run of root.sh should have the following configuration information at the end:

... ... Preparing Oracle Cluster Ready Services (CRS): Expecting the CRS daemons to be up within 600 seconds. CSS is active on these nodes. rmsclnxclu1 rmsclnxclu2 CSS is active on all nodes. Oracle CRS stack installed and running under init(1M)

The last step to complete the install is to connect as user oracle, on either node, and run the RACGONS command from the CRS_HOME/bin directory to add the Oracle Notification Services component. The command should be

./racgons add_config rmsclnxclu2:4948

where rmsclnxclu2 is the new node. The port used should be port 4948.

Step 4: Adding the Node at the RDBMS Layer

Once the new node is up and running as a member of the cluster, the next step is to proceed with adding the RDBMS layer onto the new node. To do this, again go to an existing node in the cluster and this time change to the ORACLE_HOME/oui/bin directory. Again run the addNode.sh script as user oracle (note that we are now in ORACLE_HOME/oui/bin as opposed to CRS_HOME/oui/bin). Again, the OUI will start up in Add Node mode, and clicking next on the Welcome screen will take you to the Specify Cluster Nodes to Add to Installation page. At this point, since we have successfully added the new node at the cluster layer, the new node should be listed in the Specify New Nodes section on the lower half of the screen. If the node is not listed, as shown in Figure 5-2, you must recheck the steps above in the section on adding the node to the cluster layer. From the Node Selection window at the bottom of the page, select the node you wish to add to the cluster and proceed.

Figure 5-2: Adding a node at the RDBMS layer

Again, the add node process will copy the binaries from the ORACLE_HOME on the existing node directly to the ORACLE_HOME on the new node (unless OCFS is used for the ORACLE_HOME), precluding the need to provide the install media, and also precluding the need to reapply patches on the new node. At the end of the install, run root.sh as prompted.

Run VIPCA to Complete the RDBMS Install After you have exited the installer, you must run the VIPCA from the command line on either node to ensure that all nodes that are part of the RAC install are included in the node list. Recall that VIPCA has to be run as root. Since this command will start up the GUI Configuration Assistant, make sure that your display is set properly before running the command:

vipca -nodelist rmsclnxclu1,rmsclnxclu2

The GUI Virtual IP Configuration Assistant will show the virtual IP information for the existing node(s) grayed out, allowing you only to specify the VIPs for the node(s) being added, as shown in Figure 5-3.

Figure 5-3: Adding virtual IPs for new nodes

Adding the Instance

Finally, you are ready to run the DBCA to configure the instance on the new node. Running DBCA from an existing node in the cluster, choose Real Application Clusters Database and then Instance Management. From there, select the Add Instance option, where you will be able to select the existing RAC database. You will need to authenticate yourself before you can add the new instance in. After typing in the password, be sure to tab out of the password field; otherwise, the password will not be recognized. Next, assuming that the new node is a visible member of the cluster, you will be given the opportunity to select that node and choose an instance name for it. The DBCA will assume that you want to call the instance <dbname>#, with the number being the next available instance number, as shown in Figure 5-4.

Figure 5-4: Naming and node assignment for the new instance

On the Database Services page, you will have the opportunity to make the new instance either an Available or Preferred instance for the existing services (see Chapter 6). Follow the prompts until you reach the Instance Storage page. If you are using RAW devices, you will need to expand the storage options page for the undo tablespace and the redo logs to point to links for the RAW slices that you created in Step 2 earlier. If using OCFS or ASM for the existing database, you can simply click past this at the end of the configuration, allowing the DBCA to pick the file locations. Once the instance is added, your new node is fully functional and participating in your database grid.

Removing a Cluster Node

As discussed previously, though it is probably not as common, you may have the need to remove a cluster node from your RAC cluster for various reasons. It may be that you need to shift the hardware resources elsewhere, or it may be due to a complete failure of the node itself. In the event of a failure of the node itself, the node to be removed will no longer be accessible to do any cleanup. Therefore, depending on the circumstances, the next steps may not all apply. Even so, we need to make the other nodes aware of their cohort's demise-therefore, many of the node removal steps must still be run from one of the existing/remaining nodes in the cluster. The following steps are in essence the inverse of the steps previously listed:

Remove the database instance using DBCA.

Remove the node from the cluster.

Reconfigure the OS and remaining hardware.

Removing an Instance Using DBCA

To remove an instance from the database, you must run the DBCA from one of the existing/remaining nodes in the cluster. If the node is accessible, leave it up and running, as well as the instance that you are removing. This will enable the DBCA to get information regarding the instance, such as bdump/udump locations, and also to archive any unarchived redo logs. As before, select the Instance Management option in the DBCA. This time, choose Delete Instance from the Instance Management page. You will be prompted to select the database from which you intend to delete the instance, and you will also be prompted to supply the sys password. Again, after typing in the password, tab out of the Password field before clicking Next; otherwise, the password will not be recognized.

At this time, the List of Cluster Database Instances page will appear, as shown in Figure 5-5, presenting you the option to choose which instance should be deleted. Highlight the appropriate instance and click Next. If there have been services assigned specifically to the deleted instance, you will be given the opportunity to reassign those services via the Database Services page. We discuss services in more detail in Chapter 6. Modify the services so that each service can run on one of the remaining instances, and set Not Used for each service regarding the instance that is to be deleted. Once the services are reassigned, choose Finish. You will be prompted to confirm your choice.

Figure 5-5: Deleting an instance

Some Manual Cleanup Steps

Once complete, the DBCA should remove references to the instance from all listeners in the cluster, as well as deleting the instance's password file and init file, and removing the udump/bdump/cdump directories for that instance if the node is accessible. It will also disable the redo thread for that instance and modify the spfile, removing specific parameters for the removed instance. In addition, the undo tablespace for the defunct instance will be dropped.

If the instance is unavailable, however, or if the DBCA fails to remove all components, you may need to use SRVCTL commands to remove the instance from the OCR manually. For example:

srvctl remove instance -d grid -i grid2

If for any reason the redo thread is not disabled, we discussed disabling redo threads in the section on redo logs earlier in this chapter. In addition, we discuss SRVCTL more in Chapter 6. To check for possible problems, a log of the DBCA's actions can be found in the $ORACLE_HOME/assistants/dbca/logs directory on the node where DBCA was run.

Dropping Red Logs May Fail in Archivelog Mode If you are in archivelog mode, you may also find that the DBCA cannot drop the current log, as it needs archiving. If this happens, you will get an ORA-350 and ORA-312 error in a pop-up window. Click Ignore and the DBCA will continue, and should remove everything but the current redo log group on the instance you are deleting. After the DBCA completes, you will need to manually archive the logs for the instance to be deleted, and then drop that log group afterward via these commands:

alter system archive log all; alter database drop logfile group 3;

Manually Remove the ASM Instance If this node had an ASM instance, and this node will no longer be used as part of this cluster, you will need to manually remove the ASM instance before proceeding to remove the node from the cluster. Do so with the following command:

srvctl remove asm -n rmsclnxclu2

Removing the Node from the Cluster

After the instance has been deleted, the process of removing the node from the cluster is still essentially a manual process. This process is accomplished by running manual scripts on the deleted node (if available) to remove the CRS install, as well as scripts on the remaining nodes to update the node list and inform the remaining nodes of who is left. While we expect that this process will be simplified in forthcoming releases, we will go through the steps as they currently exist in the form of an HA Workshop.

HA Workshop: Removing a Node from a Cluster

Workshop Notes

This workshop will walk you step by step through the process of removing a node from a two-node cluster. The same principles can be applied to a cluster of however many nodes you have. We assume that the node to be removed is still functioning-that is, this is a resource shift, and this node is needed elsewhere. However, most commands can be done from any node in the cluster so, again, these concepts will apply even if the node is defunct. At the end of the workshop, we will point out steps that would be different if the node targeted for removal is unavailable. Pay heed, as well, to the user that you need to run each command as.

Step 1. Start out as the root user. We first want to determine the node name and node number of each node, as stored in the Cluster Registry, so run the OLSNODES command first from the CRS_HOME/bin directory and make note of this information for your cluster:

Step 2. In this case, we want to delete node number 2, which is rmsclnxclu2, but we first must stop the node apps (see Chapter 6). So, still as root, run the following command (be sure that the ASM instance-if it exists-has been removed, as we noted in the previous section):

srvctl stop nodeapps -n rmsclnxclu2

Step 3. Still as root, follow this up by running the rootdeletenode.sh script, passing in the node name to be removed:

$ORACLE_HOME/install/rootdeletenode.sh rmsclnxclu2

Note that this script is run from the ORACLE_HOME/install directory. Even though you are running this as root, the ORACLE_HOME environment variable should be set to the appropriate ORACLE_HOME. This script will remove the CRS node apps (discussed in Chapter 6).

In some cases, we have seen this command fail with the following error:

PRKO-2112 : Some or all node applications are not removed successfully on node: rmsclnxclu2

Most likely this is due to not removing the ASM instance (as noted in the previous section). However, we have found that this is not a critical error, so if you see this error, you can still proceed to the next step.

Step 4. Now switch over to the oracle user account, and then follow up running the installer with the updateNodeList option (do this from the same node as previous steps):

Note that this is all one command and should be entered on a single line. In this example, we have a two-node cluster and we are removing node RMSCLNXCLU2, so RMSCLNXCLU1 is the only remaining node. Therefore, we specified CLUSTER_NODES=rmsclnxclu1. If you have multiple nodes remaining, this should be a comma- separated list of the remaining nodes. Note also that even though the GUI installer window will not open, the display environment variable should still be set to a terminal with an X-Term.

Step 5. Next we switch back to root to finish up the removal. This command must be run from the node that we intend to remove. As root, run the following command to stop the CRS stack and delete the ocr.loc file on the node to be removed. The nosharedvar option assumes that the ocr.loc is not on a shared file system with any other nodes. (If it were, for example, on an HP Tru64 cluster, where the entire operating system exists on a shared cluster file system, you should specify sharedvar instead of nosharedvar.) Again, this command should be run on the node that you intend to remove, and this step is only necessary if that node is still operational.

Aside from stopping the CRS stack, this command will remove the inittab entries for CRS, and also remove the init files from /etc/init.d.

Step 6. Next, switch back to the node where the previous steps have been executed. Still as root, run the rootdeletenode.sh script from the CRS_HOME/install directory. Here, rootdeletenode.sh, as run from the CRS home, must specify both the node name and the node number. Refer back to Step 1, where we ran the OLSNODES command to determine the node number of the node to be deleted. Do not put a space after the comma:

Step 7. Confirm the success by again running the OLSNODES command to confirm that the node is no longer listed-in our example, the only remaining node is the first node:

root@/u01/app/oracle/CRS/bin>: ./olsnodes -n rmsclnxclu1 1

Step 8. We are finally near completion. Switch back now to the oracle user, and run the same RUNINSTALLER command as before, but this time you are running it from the CRS_HOME instead of the ORACLE_HOME. Again, do this on an existing node, be sure that the display is set, and specify all remaining nodes for the CLUSTER_ NODES argument, just as we did in Step 4:

Step 9. Once the node updates are done, you will need to manually delete the ORACLE_HOME and CRS_HOME from the node to be expunged (unless, of course, either of these is on a shared OCFS drive). In addition, while the inittab file will be cleaned up, and the init files will be removed from /etc/init.d, you may still want to remove the soft links from /etc/rc2.d, /etc/rc3.d, and /etc/rc5.d. The links are named K96init.crs and S96init.crs in each directory. Also, you can remove the /etc/oracle directory and the oratab file from /etc, if you wish. The node is now ready to be plugged into another cluster.

Removing a Node When the Node Is Fried As noted previously, the above workshop assumes that the node to be removed is still fully functional. Obviously, there will be some cases where this is not so. In the above workshop, all of the commands/steps can be run from any node in the cluster, with the exception of Step 5. In the case where the node is no longer accessible, you can simply skip Step 5 altogether, as there is no need to bring down the CRS stack, nor is there a need to modify the inittab if the node is gone.