11g Release 2

One of the major adventures this time of the year involves installing RAC 11.2.0.2 on Solaris 10 10/09 x86-64. The system setup included EMC Power Path 5.3 as the multipathing solution to shared storage.

I initially asked for 4 BL685 G6 with 24 cores, but in the end “only” got two-still plenty of resources to experiment with. I especially like the output of this command:

So much for the equipment. The operating system showed 4 NICs, all called bnxen where n was 0 through 4. The first interface, bnxe0, will be used for the public network. The second NIC is to be ignored and the final 2, bnxe2 and bnxe3 will be used for the high available cluster interconnect feature. This way I can prevent the use of SFRAC which inevitably would have meant a clustered Veritas file system instead of ASM.

One interesting point to notice is that the Oracle MOS document 1210883.1 specifies that the interfaces for the private interconnect are on the same subnet. So-node1 will use 192.168.0.1 for bnxe2 and 192.168.0.2 for bnxe3. Similarly, node2 uses 192.168.0.3 for bnxe2 and 192.168.0.4 for bnxe3. The Oracle example is actually a bit more complicated than it could have been, as they use a /25 subnet mask. But ipcalc confirms that the address range they use are all well within the subnet:

This setup will have some interesting implications which I’ll describe a little later.

Part of the test was to find out how mature the port to Solaris on Intel was. So I decided to start off by installing Grid Infrastructure on node 1 first, and extend the cluster to node2 using the addNode.sh script in $ORACLE_HOME/oui/bin.

The installation uses 2 different accounts to store the Grid Infrastructure binaries separately from the RDBMS binaries. Operating system accounts are oragrid and oracle.

I started off by downloading files 1,2 and 3 of patch 10098816 for my platform. The ratio of downloads of this patch was 243 to 751 between x64 and SPARC. So not a massive uptake of this patchset for Solaris it would seem.

As the oragrid user I created user equivalence for RSA and DSA ssh-keys, a little utility will do this now for you, but I’m old-school and create the keys and exchanged them on the hosts myself. Not too bad a task on only 2 nodes.

The next step was to find out about the shared storage. And that took me a little while I admit freely: I haven’t used the EMC Power Path multipathing software before and found it difficult to approach, mainly for the lack of information about it. Or maybe I just didn’t find it, but device-mapper-multipath for instance is easier to understand. Additionally, the fact that this was Solaris Intel made it a little more complex. First I needed to know what the device names actually mean. As on Solaris SPARC, /dev/dsk will list the block devices, /dev/rdsk/ lists the raw devices. So there’s where I’m heading. Next I checked the devices, emcpower0a to emcpower9a. In the course of the installation I found out how to deal with these. First of all, on Solaris Intel, you have to create a partition of the LUN before it can be dealt with in the SPARC way. So for each device you would like to use, fdisk the emcpowerxp0 device, i.e.

# fdisk /dev/rdsk/emcpower0p0

If there is no partition, simply say “y” to the question if you want to use all of it for Solaris and exit fdisk. Otherwise, delete the existing partition (AFTER HAVING double/triple CHECKED THAT IT’S REALLY NOT NEEDED!) and create a new one of type “Solaris2”. It didn’t seem necessary to make it active.

This particular device will be used for my OCRVOTE disk group, that’s why it’s only 1G. The next step is identical on SPARC-start the format tool, select partition, change the fourth partition to use the whole disk (with an offset of 3 cylinders at the beginning of the slice) and label it. With that done, exit the format application.

This takes me back to the discussion of the emcpower-device name. The letters [a-p] refer to the slices of the device, while p stands for the partition. /dev/emcpowernc is slice 2 of the second multipathed device, in other words the whole disk. I usually create a slice 4 which translates to emcpowerne. After having completed the disk initialisation, I had to ensure that the ones I was working on were really shared. Unfortunately the emcpower devices are not consistently named across the cluster. What is emcpower0a on node1 turned out to be emcpower2a on the second node. How to check? The powermt tool to the rescue. Similar to “multipath –ll” on Linux the powermt command can show the underlying disks which are aggregated under the emcpowern pseudo device. So I wanted to know if my device /dev/rdsk/emcpower0e was shared. What I really was interested on was the native device:

So yes it was there. Cool! I checked the 2 other OCR/voting disks LUNS and they were shareable as well. The final piece was to change the ownership of the devices to oragrid:asmdba and permissions to 0660.

Project settings

Another item to look at is the project settings for the grid owner and oracle. It’s important to set projects correctly, otherwise the installation will fail when ASM is starting. All newly created users inherit the settings from the default project. Unless the sys admins set the default project high enough, you will have to change them. To check the settings you can use the “prctl -i project default” call to check all the values for this project.

I usually create a project for the grid owner, oragrid, as well as for the oracle account. My settings are as follows for a maximum SGA size of around 20G:

Finally ready to start the installer! The solaris installation isn’t any different from Linux except for the aforementioned fiddling with the raw devices.

The installation went smoothly, I ran orainstroot.sh and root.sh without any problem. If anything, it was a bit slow, taking 10 minutes to complete root.sh on node1. You can tail the rootcrs_node1.log file in /data/oragrid/product/11.2.0.2/cfgtoollogs/crsconfig to see what’s going on behind the scenes. This is certainly one of the biggest improvements over 10g and 11g Release 1.

Extending the cluster

The MOS document I was alluding to earlier suggested, like I said, to have all the private NIC IP addresses in the same subnet. That isn’t necessarily to the liking of cluvfy. The communication over bnxe3 on both hosts fails, as shown in this example. Tests executed from node1:

This is actually a little more verbose than I needed, but it’s always good to be prepared for a SR with Oracle.

However, the OUI command will perform a pre-requisite check before the actual call to runInstaller, and that repeatedly failed, complaining about connectivity on the bnxe3 network. Checking the contents of the addNode.sh script I found an environment variable “$IGNORE_PREADDNODE_CHECKS” which can be set to “Y” to force the script to ignore the pre-requisite checks. With that set, the addNode operation succeeded.

RDBMS installation

This is actually not worthy to report, it’s pretty much the same as on Linux. However, a small caveat is specified to Solaris x86-64. Many files in the Oracle inventory didn’t have correct permissions. When launching runInstaller to install the binaries, I was bombarded with complaints about file permissions.

For example, oraInstaller.properties has the wrong permissions. Example for Solaris Intel:

This post is about the installation of Grid Infrastructure, and where it’s really getting exciting: the 3rd NFS voting disk is going to be presented and I am going to show you how simple it is to add it into the disk group chosen for OCR and voting disks.

Let’s start with the installation of Grid Infrastructure. This is really simple, and I won’t go into too much detail. Start by downloading the required file from MOS, a simple search for patch 10098816 should bring you to the download patch for 11.2.0.2 for Linux-just make sure you select the 64bit version. The file we need just now is called p10098816_112020_Linux-x86-64_3of7.zip. The file names don’t necessarily relate to their contents, the readme helps finding out which piece of the puzzle is used for what functionality.

I alluded to my software distribution method in one of the earlier posts, and here’s all the detail to come. My dom0 exports the /m directory to the 192.168.99.0/24 network, the one accessible to all my domUs. This really simplifies software deployments.

This creates the subdirectory “grid”. Switch back to edcnode1 and log in as oracle. As I already explained I won’t use different accounts for Grid Infrastructure and the RDBMS in this example.

If not already done so, mount the /m directory on the domU (which requires root privileges). Move to the newly unzipped “grid” directory under your mount point and begin to set up the user equivalence. On edcnode1 and edcnode2, create RSA and DSA keys for SSH:

[oracle@edcnode1 ~]$ ssh-keygen -t rsa

Any questions can be answered with the return key, it’s important to leave the passphrase empty. Repeat the call to ssh-keygen with argument “-t dsa”. Navigate to ~/.ssh and create the authorized_keys file as follows:

[oracle@edcnode1 .ssh]$ cat *.pub >> authorized_keys

Then copy the authorized_keys file to edcnode2 and add the public keys:

If you are prompted, add the host to the ~/.ssh/known_hosts file by typing in “yes”.

[oracle@edcnode2 .ssh]$ cat *.pub >> authorized_keys

Change the permissions on the authorized_keys file to 0400 on both hosts, otherwise it won’t be considered when trying to log in. With all of this done, you can add all the unknown hosts to each node’s known_hosts file. The easiest way is a for loop:

Run this twice on each node, acknowledging the question if the new address should be added. Important: Ensure that there is no banner (/etc/motd, .profile, .bash_profile etc) writing to stdout or stderr or you are going to see strange error messages about user equivalence not being set up correctly.

I hear you say: but 11.2 can create user equivalence in OUI now-this is of course correct, but I wanted to run cluvfy now which requires a working setup.

Cluster Verification

It is good practice to run a check to see if the prerequisites for the Grid Infrastructure installation are met, and keep the output. Change to the NFS mount where the grid directory is exported, and execute runcluvfy.sh as in this example:

Screen 06: Ensure that both hosts are listed in this screen. Add/edit as appropriate. Hostnames are edcode{1,2}.localdomain, VIPs are to be edcnode{1,2}-vip.localdomain. Enter the oracle user’s password and click on next

Screen 07: Assign eth0 to public, eth1 to private and eth2 to “do not use”.

It’s about time to deal with this subject. If not done so already, start the domU “filer03″. Log in as openfiler and ensure that the NFS server is started. On the services tab click on enable next to the NFS server if needed. Next navigate to the shares tab, where you should find the volume group and logical volume created earlier. The volume group I created is called “ocrvotenfs_vg”, and it has 1 logical volume, “nfsvol_lv”. Click on the name of the LV to create a new share. I named the new share “ocrvote” – enter this in the popup window and click on “create sub folder”.

The new share should appear underneath the nfsvol_lv now. Proceed by clicking on “ocrvote” to set the share’s properties. Before you get to enter these, click on “make share”. Scroll down to the host access configuration section in the following screen. In this section you could set all sorts of technologies-SMB, NFS, WebDAV, FTP and RSYNC. For this example, everything but NFS should be set to “NO”.

For NFS, the story is different: ensure you set the radio button to “RW” for both hosts. Then click on Edit for each machine. This is important! The anonymous UID and GID must match the Grid Owner’s uid and gid. In my scenario I entered “500″ for both-you can check your settings using the id command as oracle: it will print the UID and GID plus other information.

The UID/GID mapping then has to be set to all_squash, IO mode to sync, and write delay to wdelay. Leave the default for “requesting origin port”, which was set to “secure < 1024″ in my configuration.

I decided to create /ocrvote on both nodes to mount the NFS export:

[root@edcnode2 ~]# mkdir /ocrvote

Edit the /etc/fstab file to make the mount persistent across reboots. I added this line to the file on both nodes:

Regardless of what I tried, the system complained. Grudgingly I used the GUI – asmca.

After starting asmca, click on Disk Groups. Then select diskgroup “OCRVOTE”, and right click to “add disks”. The trick is to click on “change discovery path”. Enter “ORCL:*, /ocrvote/nfsvotedisk01″ (without quotes) to the dialog field and close it. Strangely, now the NFS disk now appears. Make two ticks: before disk path, and in the quorum box. A click on the OK button starts the magic, and you should be presented with a success message. The ASM instance reports a little more:

One of the cool new 11.1 features allowed administrators to instruct administrators of stretch RAC system to read mirrored extents rather than primary extents. This can speed up data access in cases where data would otherwise have been sent from the remote array. Setting this parameter is crucial to many implementations. In preparation of the RDBMS installation (to be detailed in the next post), I created a disk group consisting of 4 ASM disks, two from each filer. The syntax for the disk group creation is as follows:

Now all that remains to be done is to instruct the ASM instances to read from the local storage if possible. This is performed by setting an instance-specific init.ora parameter. I used the following syntax:

SQL> alter system set asm_preferred_read_failure_groups='DATA.SITEB' scope=both sid='+ASM2';
System altered.
SQL> alter system set asm_preferred_read_failure_groups='DATA.SITEA' scope=both sid='+ASM1';
System altered.

So I’m all set for the next step, the installation of the RDBMS software. But that’s for another post…

Wow, that looked like a completely corrupted installation of Oracle now, and I got ready to roll the change back. However, MOS had an answer to this!

These errors can be ignored as per this MOS document: Relinking causes many warning on AIX [ID 1189533.1]. See also MOS note Note 245372.1 TOC overflow Warning Can Safely Be Ignored and BUG:9828407 – TOO MANY WARNING MESSAGES WHEN INSTALLING ORACLE 11.2 for more information.

Finally I have some more time to work on the next article in this series, dealing with the setup of my two cluster nodes. This is actually going to be quite short compared to the other articles so far. This is mainly due to the fact that I have streamlined the deployment of new Oracle-capable machines to a degree where I can comfortably set up a cluster in 2 hours. It’s a bit more work initially, but it paid off. The setup of my reference VM is documented on this blog as well, search for virtualisation and opensuse to get to the article.

When I first started working in my lab environment I created a virtual machine called “rhel55ref”. In reality it’s OEL, because of Red Hat’s windooze like policy to require an activation code. I would have considered CentOS as well, but when I created the reference VM the community hadn’t provided the “update 5″. I like the brand new shiny things most :)

Seems like I’m lucky now as well with the introduction of Oracle’s own Linux kernel I am ready for the future. Hopefully Red Hat will get their act together soon and release version 6 of their distribution. As much as I like Oracle I don’t want them to dominate the OS market too much. With Solaris now in their hands as well…

Anyway, to get started with my first node I cloned my template. Moving to /var/lib/xen/images all I had to do was to “cp -a rhel55ref edcnode1″. One repetition to edcnode2 gave me my second node. Xen (or libvirt for that matter) stores the VM configuration in xenstore, a backend database which can be interrogated easily. So I dumped the XML configuration file for my rhel55ref VM and stored it in edcnode{1,2}.xml. The command to dump the information is “virsh dumpxml domainName” > edcnode{1,2}.xml

The domU folder contains the virtual disk for the root file system of my VM, called disk0. I then created a new “hard disk”, called disk1 to contain the Oracle binaries. Experience told me not to have that too small, 20G should be enough for my /u01 mountpoint for Grid Infrastructure and the RDBMS binaries.

I like to speed the file creation up by using the sparse file trick: the file disk1 will be reported to be 20G in size, but it will only use that if the virtual machine needs them. It’s a bit like Oracle creating a temporary tablespace.

With that information it’s time to modify the dumped XML file. Again it’s important to define MAC addresses for the network interfaces, otherwise the system will try and use dhcp for your NICs, destroying the carefully crafted /etc/sysconfig/network-scripts/ifcfg-eth{0,1,2} files. Oh, and remember that the first 3 tupel are reserved for XEN, so don’t change “00:16:3e”! Your UUID also has to be unique. In the end my first VM’s XML description looked like this:

You can see that the interfaces refer to br1, br2, and br3. These are the ones that were previously defined in the first article. The tag “” in the tag doesn’t matter as that will be dynamically assigned anyway.

You are directly connected to the VM’s console (80×24-just like in the old times!) and have to wait a looooong time for the DHCP requests for eth0, eth1 and eth2 to time out. This is the first thing to address. As root, log in to the system and navigate straight to /etc/sysconfig/network-scripts to change ifcfg-eth{0,1,2}. Alternatively, use system-config-network-tui to change the network settings.

The following settings should be used for edcnode1:

eth0: 192.168.99.56/24

eth1: 192.168.100.56/24

eth2: 192.168.101.56/24

These are the settings for edcnode2:

eth0: 192.168.99.58/24

eth1: 192.168.100.58/24

eth2: 192.168.101.58/24

The nameserver for both is my dom0 – in this case 192.168.99.10. Enter the appropriate hostname as well as the nameserver. Note that 192.168.99.57 and 59 are reserved for the node VIPs, hence the “gap”. Then edit /etc/hosts to enter the information about the private interconnect, which for obvious reasons is not included in DNS. If you like, persist your public and VIP information in /etc/hosts as well. Don’t do this with the SCAN, it’s not suggested to have the SCAN resolve through /etc/hosts although it works.

Now’s the big moment-restart the network services and get out of the uncomfortable 80×24 character limitation:

[root@edcnode1]# service network restart

The complete configuration is printed here for the sake of completeness for edcnode1:

It's important to edit the initiator name, i.e. the name the initiator reports back to OpenFiler. I changed it to include edcnode1 and edcnode2 on their respective hosts. The file to edit is /etc/iscsi/initiatorname.iscsi

Fine! Now over to fdisk the new devices. I know that my “local” storage is named /dev/xvd*, so anything new (“/dev/sd*”) will be iSCSI provided storage. If you are unsure you can always check the /var/log/messages file to see which device have just been discovered. You should see something similar to this output:

The output will continue with /dev/sdb and other devices exported by the filer.

Prepare the local Oracle Installation

Using fdisk, modify /dev/xvdb, create a partition spanning the whole disk and set its type to “8e” – Linux LVM. It’s always a good idea to use LVM to install Oracle binaries into, it makes later extension of a filesystem easier. I’ll add the fdisk output here for this device but won’t for later partitioning excercises.

root@edcnode1 ~]# fdisk /dev/xvdb
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.
The number of cylinders for this disk is set to 1305.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-1305, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-1305, default 1305):
Using default value 1305
Command (m for help): t
Selected partition 1
Hex code (type L to list codes): 8e
Changed system type of partition 1 to 8e (Linux LVM)
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.

Once /dev/xvdb1 is ready, we need to start its transformation into a logical volume. First, a physical volume is to be created:

The physical volume (“PV”) is then used to form a volume group (“VG”). In real life, you’d probably have more than 1 PV to form a VG… I named my volume group “oracle_vg”. The existing volume group is called “root_vg” by the way.

Wonderful! I never quite remember how many extents this VG has so I need to query it. When using –size 10g it will through an error – some internal overhead will reduce the available capacity to something just shy of 10G:

You are done! Create the mountpoint for your oracle installation, /u01/ in my case, and grant oracle:oinstall ownership to it. In this lab excercise I didn’t create a separate owner for the Grid Infrastructure to avoid potentially undiscovered problems in 11.2.0.2 and stretched RAC. Finally add this to /etc/fstab to make it persistent:

Now continue to partition the iSCSI volumes, but don’t create file systems on top of them. You should not assign a partition type other than the default “Linux” to it either.

ASMLib

Yes I know…The age old argument, but I decided to use it anyway. The reason is simple: scsi_id doesn’t return a value in para-virtualised Linux, which makes it impossible to set up device name persistence with udev. And ASMLib is easier to use anyway! But if your system administrators are database agnostic and not willing to learn the basics about ASM, then probably ASMLib is not a good idea to be rolled out. It’s only a matter of time until someone executes an “rpm -Uhv kernel*” to your box and of course a) didn’t tell the DBAs and b) didn’t bother applying the ASMLib kernel module. But I digress.

Before you are able to use ASMLib you have to configure it on each cluster node. A sample session could look like this:

[root@edcnode1 ~]# /etc/init.d/oracleasm configure
Configuring the Oracle ASM library driver.
This will configure the on-boot properties of the Oracle ASM library
driver. The following questions will determine whether the driver is
loaded on boot and what permissions it will have. The current values
will be shown in brackets ('[]'). Hitting without typing an
answer will keep that current value. Ctrl-C will abort.
Default user to own the driver interface []: oracle
Default group to own the driver interface []: dba
Start Oracle ASM library driver on boot (y/n) [n]:
Scan for Oracle ASM disks on boot (y/n) [y]:
Writing Oracle ASM library driver configuration: done
Dropping Oracle ASMLib disks: [ OK ]
Shutting down the Oracle ASMLib driver: [ OK ]
[root@edcnode1 ~]#

Now with this done, it is possible to create the ASMLib maintained ASM disks. For the LUNs presented by filer01 these be

ASM01FILER01

ASM02FILER01

OCR01FILER01

OCR02FILER01

The disks are created using the /etc/init.d/oracleasm createdisk command as in these examples:

Switch over to the second node now to validate the configuration and to continue the configuration of the iSCSI LUNs from filer02. Define the domU with a similar configuration file as shown above for edcnode1, and start the domU. Once the wait for DHCP timeouts is over and you are presented with a login, set up the network as shown above. Install the iscsi initiator package, change the initiator name and discover the targets from filer02 in addition to those from filer01.

All done!And I seriously thought initially that this was going to be a shorter post than the others, how wrong I was. Congratulations on having arrived here at the bottom of the article by the way.

In the course of this post I prepared my virtual machines to begin the installation of Grid Infrastructure. The ASM disk names will be persistent across reboots thanks to ASMLib, and no messing around with udev for that matter. You might notice that there are 2 ASM disk from filer01 but only 1 from filer02 for the voting disk/OCR diskgroup, and that’s for a reason. I’m cheeky and won’t tell you here, that’s for another post later…

As you may know, Oracle released the first patchset on top of 11g Release 2. At the time of this writing, the patchset is out for 32bit and 64bit Linux, 32bit and 64bit Solaris SPARC and Intel. What an intersting combination of platforms… I thought there was no Solaris 32bit on Intel anymore.

Upgrade

Oracle has come up with a fundamentally different approach to patching with this patchset. The long version of this can be found in MOS document 1189783.1 “Important Changes to Oracle Database Patch Sets Starting With 11.2.0.2″. The short version is that new patches will be supplied as full releases. This is really cool, and some people have asked why that wasn’t always the case. In 10g Release 2, to get to the latest version with all the patches, you had to

Install the base release for Clusterware, ASM and at least one RDBMS home

Install the latest patchset on Clusterware, ASM and the RDBMS home

Apply the latest PSU for Clusterware/RDBMS, ASM and RDBMS

Especially applying the PSUs for Clusterware were very labour intensive. In fact, for a fresh install it was usually easier to install and patch everything on only one node and then extend the patched software homes to the other nodes of the cluster.

Now in 11.2.0.2 things are different. You no longer have to apply any of the interim releases-the patch contains everything you need, already on the correct version. The above process is shortened to:

Install Grid Infrastructure 11.2.0.2

Install RDBMS home 11.2.0.2

Optionally, apply PSUs or other patches when they become available. Currently, MOS note 756671.1 doesn’t list any patch as recommended on top of 11.2.0.2.

Interestingly upgrading from 11.2.0.1 to 11.2.0.2 is more painful than from Oracle 10g, at least on the Linux platform. Before you can run rootupgrade.sh, the script tests if you applied the Grid Infrastructure PSU for 11.2.0.1.2. OUI hasn’t performed the test when it checked for prerequisistes which caught me off-guard. The casual observer may now ask: why do I have to apply a PSU when the bug fixes should be rolled up into the patchset anyway? I honestly don’t have an answer, other than that if you are not on Linux you should be fine.

Grid Infrastructure will be an out-of-place upgrade which means you have to manage your local disk space very carefully from now on. I would not use anything less than 50-75G on my Grid Infrastructure mount point.This takes the new cluster health monitor facility (see below) into account, as well as the fact that Oracle performs log rotation for most logs in $GRID_HOME/log.

The RDBMS binaries can be patched either in-place or out-of-place. I’d say that the out-of-place upgrade for RDBMS binaries is wholeheartedly recommended as it makes backing out a change so much easier. As I said, you don’t have a choice for Grid Infrastructure which is always out-of-place.

And then there is the multicast issue Julian Dyke (http://juliandyke.wordpress.com/) has written about. I couldn’t reproduce the test case, and my lab and real-life clusters run with 11.2.0.2 happily.

Changes to Grid Infrastructure

After the successful upgrade you’d be surprised to find new resources in Grid Infrastructure. Have a look at these:

The cluster_interconnect.haip is yet another step towards the self contained system. The Grid Infrastructure installation guide for Linux states:

“With Redundant Interconnect Usage, you can identify multiple interfaces to use for the cluster private network, without the need of using bonding or other technologies. This functionality is available starting with Oracle Database 11g Release 2 (11.2.0.2).”

So – good news for anyone who is relying on third party software like for example HP ServiceGuard for network bonding. Linux has always done this for you, even in the times of the 2.4 kernel. Linux network bonding is actually quite simple to set up as well. But anyway, I’ll run a few tests in the lab when I have time with this new feature enabled, deliberately taking down NICs to see if the new feature works as labelled on the tin. The documentation states that you don’t need to bond your NICs for the private interconnect, simply leave the ethx (or whatever name you NICs have on your OS) as they are, and indicate the ones you like to use for the private interconnect as private during the installation. If you decide to add a NIC to the cluster for use with the private interconnect later, use oifcfg as root to add the new interface (or watch this space for a later blog post on this). Oracle states that if one of the private interconnects fails, it will transparently use another one. Additionally to the high availability benefit, Oracle apparently also performs load balancing across the configured interconnects.

To learn more about the redundant interconnect feature I had a glance at its profile. As with any resource in the lower stack (or HA stack), you need to append the “-init” argument to crsctl.

With this information at hand, we see that the resource is controlled through ORAROOTAGENT, and judging from the start sequence position and the fact that we queried crsctl with the “-init” flag, it must be OHASD’s ORAROOTAGENT.

Indeed, there are references to it in the $GRID_HOME/log/`hostname -s`/agent/ohasd/orarootagent_root/ directory. Further reference to the resource was found in cssd.log which makes perfect sense: it will use it for many things, last but not least fencing.

OK, I know understand this a bit better. But the log information mentioned something else as well, an IP address that I haven’t assigned to the cluster. It turns out that this IP address is another virtual IP on the private interconnect, called bond1:1

Substitute the correct values of course for interface and source address.

Oracle CRF resources

Another intersting new feature is the CRF resource, which seems to be an implementation of IPD/OS Cluster Health Monitor on the servers. I need to dig a little deeper in this feature, currently I can’t get any configuration data from the cluster:

[grid@node1] $ oclumon showobjects
Following nodes are attached to the loggerd
[grid@node1] $

You will see some additional background processes now, namely ologgerd and osysmond.bin, which are started through the CRF resource. The resource profile (shown below) suggests that this resource is started through OHASD’s ORAROOTAGENT and can take custom logging levels.

An investigation of orarootagent_root.log revealed that the rootagent indeed starts the CRF resource. This resource will start the ologgerd and oysmond processes, which then write their log files into $GRID_HOME/log/`hostname -s`/crf{logd,mond}.

Configuration of the daemons can be found in $GRID_HOME/ologgerd/init and $GRID_HOME/osysmond/init. Except for the PID file for the daemons there didn’t seem to be anything of value in the directory.

The command line of the ologgerd process shows it’s configuration options:

The files in the directory specified by the “-d” flag denote where the process stores its logging information. The files are in BDB format, or Berkeley DB (now Oracle too). The oclumon tool should be able to read these files, but until I can persuade it to connect to the host there is no output.

CVU

Unlike the previous resources, the cvu resource is actually cluster aware. It’s the Cluster Verification Utility we all know from installing RAC. Going by the profile (shown below), I conclude that the utility is run through the grid software owner’s scriptagent and has exactly 1 incarnation on the cluster. It is only executed every 6 hours and restarted if it fails. If you like to execute a manual check, simply execute the action script with the command line argument “check”.

Enough for now, this has become a far longer post than I initially anticipated. There are so many more new things around, like Quality of Server that need exploring making it very difficult to keep up.

Finally time for a new series! With the arrival of the new 11.2.0.2 patchset I thought it was about time to try and set up a virtual 11.2.0.2 extended distance or stretched RAC. So, it’s virtual, fair enough. It doesn’t allow me to test things like the impact of latency on the inter-SAN communication, but it allowed me to test the general setup. Think of this series as a guide after all the tedious work has been done, and SANs happily talk to each other. The example requires some understanding of how XEN virtualisation works, and it’s tailored to openSuSE 11.2 as the dom0 or “host”. I have tried OracleVM in the past but back then a domU (or virtual machine) could not mount an iSCSI target without a kernel panic and reboot. Clearly not what I needed at the time. OpenSuSE has another advantage: it uses a new kernel-not the 3 year old 2.6.18 you find in Enterprise distributions. Also, xen is recent (openSuSE 11.3 even features xen 4.0!) and so is libvirt.

The Setup

The general idea follows the design you find in the field, but with less cluster nodes. I am thinking of 2 nodes for the cluster, and 2 iSCSI target providers. I wouldn’t use iSCSI in the real world, but my lab isn’t connected to an EVA or similar.A third site will provide quorum via an NFS provided voting disk.

Site A will consist of filer01 for the storage part, and edcnode1 as the RAC node. Site B will consist of filer02 and edcnode2. The iSCSI targets are going to be provided by openFiler’s domU installation, and the cluster nodes will make use of Oracle Enterprise Linux 5 update 5.To make it more realistic, site C will consist of another openfiler isntance, filer03 to provide the NFS export for the 3rd voting disk. Note that openFiler seems to support NFS v3 only at the time of this writing. All systems are 64bit.

The network connectivity will go through 3 virtual switches, all “host only” on my dom0.

Public network: 192.168.99/24

Private network: 192.168.100/24

Storage network: 192.168.101/24

As in the real world, private and storage network have to be separated to prevent iSCSI packets clashing with Cache Fusion traffic. Also, I increased the MTU for the private and storage networks to 9000 instead of the default 1500. If you like to use jumbo frames you should check if your switch supports it.

Grid Infrastructure will use ASM to store OCR and voting disks, and the inter-SAN replication will also be performed by ASM in normal redundancy. I am planning on using preferred mirror read and intelligent data placement to see if that makes a difference.

Known limitations

This setup has some limitations, such as the following ones:

You cannot test inter-site SAN connectivity problems

You cannot make use of udev for the ASM devices-a xen domU doesn’t report anything back from /sbin/scsi_id which makes the mapping to /dev/mapper impossible (maybe someone knows a workaround?)

Network interfaces are not bonded-you certainly would use bonded NICs in real life

No “real” fibre channel connectivity between the cluster nodes

So much for the introduction-I’ll post the setup step-by-step. The intended series will consist of these articles:

Introduction to XEN on openSuSE 11.2 and dom0 setup

Introduction to openFiler and their installation as a virtual machine

Setting up the cluster nodes

Installing Grid Infrastructure 11.2.0.2

Adding third voting disk on NFS

Installing RDBMS binaries

Creating a database

That’s it for today, I hope I got you interested and following the series. It’s been real fun doing it; now it’s about writing it all up.

Just a quick one to announce that I’ll present at said event. Here’s the short synopsis of my talk:

Upgrading to Oracle Real Application Cluster 11.2

With the end of premier support in sight mid 2011 many business start looking at possible upgrade paths. With the majority of RAC systems deployed on Oracle 10g, there is a strong demand to upgrade these systems to 11.2. The presentation focuses on different upgrade paths, including Grid Infrastructure and the RDBMS. Alternative approaches to upgrading the software will be discussed as well. Experience from migrations performed at a large financial institution round the presentation up.

I have already written about the renamedg command, but since then fell in love with ASMLib. The use of ASMLib introduces a few caveats you should be aware of.

USAGE NOTES

This document presents research I performed with ASM on a lab environment. It should be applicable to any environment, but you should NOT use this for production-the renamedg command still is buggy, and you should not mess with ASM disk headers in an important system such as production or staging/UAT. You set the importance here! The recommended setup for cloning disk groups is to use a data guard physical standby database on a different storage array to create a real time copy of your production database on that array. Again, do not use you production array for this!

Walking through a renamdg session

Oracle ASMLib introduces a new value to the ASM header, called the provider string as the following example shows:

The prefix “ORCLDISK” is automatically added by ASMLib and cannot easily be changed.

The problem with ASMLib is that the renamedg command does NOT update the provider string, which I’ll illustrate by walking through an example session. Disk group “DATA”, setup with external redundancy and two disks, DATA1 and DATA2, is to be cloned to “DATACLONE”.

The renamedg command requires the disk group to be cloned to be stopped. To prevent nasty surprises, you should stop the databases using that diskgroup manually.

So apart from files from other disk groups no files are open, especially not referring to disk group DATA.

Now comes the part where you copy the LUNs, and this entirely depends on your system. The EVA series of storage arrays I worked with in this particular project offered a “snapclone” function, which used COW to create an identical copy of the source LUN, with a new WWID (which can be an input parameter to the snapclone call). When you are using device-mapper-multipath then ensure that your sys admins add the newly created LUNs to the /etc/multipath.conf file on all cluster nodes!

I am using Xen in my lab, which makes it simpler-all I need to do is to copy the disk containers on the domO and then add the new block devices to the running domU (“virtual machine” in Xen language). This can be done easily as the following example shows:

In the example, rac11gr2drnode{1,2} are the domU, the backend device is the copied file on the file system, the front end device in the domU is xvd{g,h}, and the mode is read/write, shareable. The exclamation mark here is crucial or else the second domU can’t mount the new block device-it is already exclusively mounted to another domU.

The fdisk command in my example immediately “sees” the new LUNs, with device mapper multipathing you might have to go through iterations of restarting multipathd and discovering partitions using kpartx. It is again very important to have all disks presented to all cluster nodes!

Although the original disks (/dev/xvdd1 and /dev/xvde1) had their disk group name changed, the provider string remained untouched. So if we were to issue a scandisks command now through /etc/init.d/oracleasm, there’d still be duplicate disk names. This is a bug in my opinion, and a bad thing.

Renaming the disks is straight forward, the difficult bit is to find out which have to be renamed. Again, you can use kfed to figure that out. I knew the disks to be renamed were /dev/xvdd1 and /dev/xvde1 after consulting the header information.

Sure enough, the cloned disks were present. Although everything seemed ok at this point, I could not start disk group DATA and had to reboot the cluster nodes to rectify that problem. Maybe there is some not so transient information stored somewhere about ASM disks. After the reboot, CRS started my database correctly, and with all dependent resources:

Quick post and note to self. Where are the SCAN listener log files? A little bit of troubleshooting was required, but I guess I could have read the manuals too. In the end it turned out to be quite simple!

First of all, I needed to find out where on my four node cluster I had a SCAN listener. This is done quite easily by asking Clusterware:

Well maybe not. Next option is to query the listener itself via lsnrctl. Nothing easier that that:

LSNRCTL> set current_listener LISTENER_SCAN1
Current Listener is LISTENER_SCAN1
LSNRCTL> show log_file
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN1)))
LISTENER_SCAN1 parameter "log_file" set to /u01/app/grid/product/11.2.0/crs/log/diag/tnslsnr/rac11gr2node2/listener_scan1/alert/log.xml
The command completed successfully
LSNRCTL>

Aha, it uses the ADR as well. So back there, change the base and query the file: