I will be presenting a session titled “An 18 pointers guide to setting up an Exadata machine” at Cloud Day being organized by North India chapter of AIOUG. Vivek Sharma is doing multiple sessions on various cloud and performance related topics. You can register for the event here

If you have installed some one off ksplice fix for kernel on Exadata, remember to uninstall it before you do a kernel upgrade eg regular Exadata patching. As such fixes are kernel version specific so they may not work with the newer version of the kernel.

A colleague was working on an ASM issue (Standalone one, Version 11.2.0.3 on AIX) at one of the customer sites. Later on, I also joined him. The issue was that the customer added few news disks to an existing diskgroup. Everything went well and the rebalance kicked in. After some time, something happened and all of a sudden the diskgroup was dismounted. While trying the mount the diskgroup again, it was giving

ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "27" is missing from group number "2"

At this stage disk 27 was not readable even with dd. So that means something is wrong with the disk. Since it is an external redundancy diskgroup not much can be done until the disk becomes available.

Speaking to the storage team cleared the air. One that the disk had gone offline at storage level so that is why even dd was not able to read it. Two that all these disks were thin provisioned(over provisioning of the storage space to improve the utilization; similar to over provisioning of CPU cores in the Virtualization world) from the storage. This particular disk 27 was meant for some other purpose but got wrongly allocated to this diskgroup. The actual space available in the pool (of this disk) was less than what was needed. The moment disks were added to the diskgroup, the rebalance kicked in and ASM started writing data to the disk. Within few minutes space became full and the storage software took the disk offline. Since ASM couldn’t write to the disk, the diskgroup was dismounted.

Fortunately, in the same pool, there was another disk that was still unused. So the storage guy dropped that disk and it freed up some space in the pool. He brought this disk 27 online after that. Diskgroup got mounted and the rebalance kicked in again. Finally, we dropped this disk and the rebalance started again. Once the rebalance completed, disk was free to be taken offline.

Scenario : Setting up a physical standby from Exadata to a non-Exadata single instance. tnsping from standby to primary works fine but tnsping from primary to standby fails with:

TNS-12543: TNS:destination host unreachable

I am able to ssh standby from primary, can ping as well but tnsping doesn’t work. From the error description we can figure out that something is blocking the access. In this case it was iptables that was enabled on the standby server.

Stopping the service resolved the issue.

service iptables stop
chkconfig iptables off

The error is an obvious one but sometimes it just doesn’t strike you that it could be something simple like that.

Hit this silly issue in one of the data guard environments today. Primary is a 2 node RAC running 11.2.0.4 and standby is also a 2 node RAC. Archive logs from node2 aren’t shipping and the error being reported is

ORA-12154: TNS:could not resolve the connect identifier specified

We tried usual things like going to $TNS_ADMIN, checking the entry in tnsnames.ora and then also trying to connect using sqlplus sys@target as sysdba. Everything seemed to be good but logs were not shipping and the same problem was being reported repeatedly. As everything on node1 was working fine so it looked even more weird.

From the error it is clear that the issue is with tnsnames entry. Finally found the issue after some 30 mins. It was an Oracle EBS environment so the TNS_ADMIN was set to the standard $ORACLE_HOME/network/admin/*hostname* path (on both the nodes). On node1 there was no tnsnames.ora file in $ORACLE_HOME/network/admin so it was connecting to the standby using the Apps tnsnames.ora which was having the correct entry for standby. On node2 there was a file called tnsnames.ora in $ORACLE_HOME/network/admin but it was not having any entry for standby. It was trying to connect using that file (the default tns path) and failing with ORA-12154. Once we removed that file, it started using the Apps tnsnames.ora and logs started shipping.

So we can see the issue here. We can look at the above trace file also for more detail.

Now to why did this happen ?

The RECOC1 is a HIGH redundancy disk group which means that if we want to place Voting files there, it should have at least 5 failure groups. In this configuration there are only 3 cells and that doesn’t meet the minimum failure groups condition (1 cell = 1 failgroup in Exadata).

Now to how did it happen ?

This one was an Exadata X3 half rack and we planned to deploy it (for testing purpose) as 2 quarter racks : 1st cluster with db1, db2 + cell1, cell2, cell3 and 2nd cluster with db3, db4 + cell4, cell5, cell6, cell7. All the disk groups to be in High redundancy.

Before a certain 12.x Exadata software version it was not even possible to have all disk groups in High redundancy in a quarter rack as to have Voting disk in a High redundancy disk group you need to have a minimum of 5 failure groups (as mentioned above). In a quarter rack you have only 3 fail groups. With a certain 12.x Exadata software version a new feature quorum disks was introduced which made is possible to have that configuration. Read this link for more details. Basically we take a slice of disk from each DB node and add it to the disk group where you want to have the Voting file. 3 cells + 2 disks from DB nodes makes it 5 so all is good.

Now while starting with the deployment we noticed that db node1 was having some hardware issues. As we needed the machine for testing so we decided to build the first cluster with 1 db node only. So the final configuration of 1st cluster had 1 db node + 3 cells. We imported the XML back in OEDA, modified the cluster 1 configuration to 1 db node and generated the configuration files. That is where the problem started. The RECO disk group still was High redundancy but as we had only 1 db node at this stage so the configuration was not even a candidate for quorum disks. Hence the above error. Changing DBFS_DG to Normal redundancy fixed the issue as when DBFS_DG is Normal redundancy, OneCommand will place the Voting files there.

Ideally it shouldn’t happened as OEDA shouldn’t allow a configuration that is not doable. The case here is that as originally the configuration was having 2 db nodes + 3 cells so High redundancy for all disk groups was allowed in OEDA. While modifying the configuration when one db node was removed from the cluster, OEDA probably didn’t run the redundancy check on disk groups and it allowed the go past that screen. If you try to create a new configuration with 1 db node + 3 cells, it will not allow you to choose High redundancy for all disk groups. DBFS will remain in Normal redundancy. You can’t change that.

Hit this silly issue while doing an Exadata deployment for a customer. Step 1 was giving the following error:

ERROR: 192.168.99.102 configured on dm01celadm01.example.com as dm01dbadm02 does not match expected value dm01dbadm02.example.com

I wasn’t able to make sense of it for quite some time until a colleague pointed out that the reverse lookup entries should be done for FQDN only. As it is clear in the above message reverse lookup of the IP 192.168.99.102 returns dm01dbadm02 instead of dm01dbadm02.example.com. Fixing this in DNS resolved the issue.

Actually the customer had done reverse lookup entries for both the hostname and FQDN. As the DNS can return the results in any order, so the error message was bit random. Whenever the the hostname was returned first, Step 1 gave an error. But when the FQDN was the first thing returned, there was no error in Step 1 for that IP.

ld.so.1: lsnodes: fatal: libskgxn2.so: open failed: No such file or directory

Killed

-bash-4.1$

So looks like it is not able to find this library called libskgxn2.so. If we do a find for this file name we can see that it is present in this directory /usr/cluster/lib/sparcv9/libskgxn2.so .

Some googling and MOS searches revealed that it expects the library to be present at /opt/ORCLcluster/lib. This directory doesn’t exist here. As a workaround we can create this directory manually and create symbolic link to file libskgxn2.so

The lsnodes command worked fine after this workaround and installer also shows both the nodes listed.

Just a stupid error. Posting it so that someone else googling for the same thing can get a clue.

An ASM instance running with default parameters (no pfile, no spfile). Updated spfile for the instance with asmcmd spset command and bounced crs. After reboot also, it still wasn’t using spfile. Got puzzled and checked GPnP settings again. All looked good. Then in alert log came across this

ERROR: SPFile in diskgroup <> does not match the specified spfile +DATA/asm/asmparameterfile/registry.253.769187275

The problem was that while copying the spfile path the complete name didn’t get copied. The last character got missed. So the filename that it was looking for wasn’t there. Updating GPnP with correct filename and bouncing crs resolved the issue.

So this customer has an Exadata quarter rack and they have an IB listener configured on both DB nodes (for DB connections from a multi-racked Exalogic system). We were adding a new DB node to this rack. So just followed the standard procedure of creating users, directories etc on the new node, setting up ssh equivalence and running addNode.sh. All went fine but root.sh failed. Little looking into the logs revealed that it failed while running srvctl start listener –n <node_name>

If we manually run this command, it will immediately reveal what the problem is. It is not able to start IB listener on the new node as the IB VIP doesn’t yet exist. It could happen for any of the additional networks added.

There is a MOS note that describes this exact situation but the solution that it gives is to remove the additional listener, complete addNode.sh & root.sh and add the additional listener back. That wasn’t possible in this case. After little bit of googling I stumbled upon this post by Jeremy Schneider. His colleague solved this problem with a very simple and clever workaround. Before root.sh prepares to run srvctl start listener command, run the add VIP command from another Window . Additional network would have already got added when root.sh runs on the new node.

To be able to perform this trick, you have to have the hosts file updated with the new VIP name and IP and be ready with the command to add the VIP. While root.sh is running, it will show a message like “there is already an active cluster, restarting to join”, immediately start trying to run srvctl add vip command in another window. The moment CRS, comes up the command will succeed. Immediately after that root.sh is going to run srvctl start listener command, and this time it shouldn’t fail as the VIP is already added.

Another small mistake we made was not updating the cellip.ora on the new node before running root.sh. That caused the root.sh to fail as it couldn’t talk to ASM running on existing cell nodes. Updating cellip.ora with the existing storage node IPs fixed the problem.

So if you are filling an OEDA for Exadata deployment there are few things you should take care of. Most of the screens are self explanatory but there are some bits where one should focus little more. I am running the Aug version of it and the screenshots below are from that version.

On the Define customer networks screen, the client network is the actual network where your data is going to flow. So typically it is going to be bonded (for high availability) and depending upon the network in your data center you have to select one out of 1/10 G copper and 10 G optical.

If you are going to use trunk VLANs for your client network, remember to enabled it by clicking the Advanced button and then entering the relevant VLAN id.

Also if it is going to be an OVM configuration, you may want to have different VMs in different VLAN segments. It will allow you to change VLAN ids for individual VMs on the respective cluster screens like below

If all the cores aren’t licensed remember to enable Capacity on Demand (COD) on the Identify Compute node OS screen.

On the Define clusters screen make sure that you enter a unique (across your environment) cluster name.

The cluster details screen captures some of the most important details like

Whether you want to have flash cache in WriteBack mode instead of WriteThrough

Whether you want to have a role separated install or want to install both GI and Oracle binaries with oracle user itself.

GI & Database versions and home for binaries. Always good to leave it at the Oracle recommended values as that makes the future maintenance easy and less painful.

Disk Group names, redundancy and the space allocation.

Default database name and type (OLTP or DW).

Of course it is important to carefully fill the information in all the screens but the above ones are some of them which should be filled very carefully after capturing the required information from other teams, if needed.

Faced this error while querying v$asm_disk after adding new storage cell IPs to cellip.ora on DB nodes of an existing cluster on Exadata. Query ends with ORA-03113 end-of-file on communication channel and ORA-56841 is reported in $ORA_CRS_HOME/log/<hostname>/diskmon/diskmon.log. Reason in my case was that the new cell was using different subnet for IB. It was pingable from the db nodes but querying v$asm_disk wasn’t working. Changing the subnet for IB on new cell to the one on existing cells fixed the issue.

On a T5 Super Cluster (running 11.2.0.3) I was creating a cascaded standby from an already functional standby using RMAN DUPLICATE and it errored out with

ORA-01671: control file is a backup, cannot make a standby control file

A quick search reveals that it is bug 11715084 that affects most of the 11.x versions except 11.2.0.4. There is a one off patch available for most of the versions or one can install the bundle patch that includes the fix for this patch. I applied BP26 and it worked fine after that.

A rather not so great post about an ORA-00600 error i faced on a standby database. Environement was 11.2.0.3 on Sun Super Cluster machine. MRP process was hitting ORA-00600 while trying to apply a specific archive log.

Some googling and MOS searches revealed that the error was due to corrupted archive log file. Recopying the archive file from primary and restarting the recovery resolved the issue. The fist argument of the ORA-600 is actually the sequence no of the archive it is trying to apply.

Tim Hall has written some brilliant posts about getting going with writing (blogs, whitepapers etc). This post is the result of inspiration from there only. Tim says that just get started with whatever .

If you are into blogging and no so active or even if you aren’t you may want to take a look at all the posts to get some inspiration to document the knowledge you gain on day to day basis.

Many people have asked me this question that how they can learn Exadata ? It starts sounding even more difficult as a lot of people don’t have access to Exadata environments. So thought about writing a small post on the same.

It actually is not as difficult as it sounds. There are a lot of really good resources available from where you can learn about Exadata architecture and the things that work differently from any non-Exadata platform. You might be able to do lot more RnD if you have got access to an Exadata environment but don’t worry if you haven’t. Without that also there is a lot that you can explore. So here we go:

I think the best reference that one can start with is Expert Oracle Exadata book by Tanel Poder, Kerry Osborne and Randy Johnson. As a traditional book covers the subject topic by topic from ground up so it makes a fun read. This book is also no different. It will teach you a lot. They are already working on the second edition. (See here).

Next you can jump to whitepapers on Oracle website Exadata page, blog posts (keep an eye on OraNA.info) and whitepapers written by other folks. There is a lot of useful material out there. You just need to Google a bit.

Exadata documentation (not public yet) should be your next stop if you have got access to it. Patch 10386736 on MOS if you have got the access.

Try to attend an Oracle Users Group conference if there is one happening in your area. Most likely someone would be presenting on Exadata so you can use that opportunity to learn about it. Also you will get a chance to ask him questions.

Lastly if you have an Exadata machine available do all the RnD you can.

I was troubleshooting some Windows hangs on my Desktop system running Windows 8 and enabled driver verifier. Today when I tried to start VirtualBox it failed with this error message.

Failed to load VMMR0.r0 (VERR_LDR_MISMATCH_NATIVE)

Most of the online forums were asking to reinstall VirtualBox to fix the issue. But one of the thread mentioned that it was being caused by Windows Driver Verifier. I disabled it, restarted Windows and VirtualBox worked like a charm. Didn’t have time to do more research as i quickly wanted to test something. May be we can skip some particular stuff from Driver Verifier and VirutalBox can then work.

Few months ago I contributed a chapter (on Monitoring, Troubleshooting and Performance tuning) to a GoldenGate book on Oracle Press that Robert Freeman was authoring. Thought of posting a small update that the book is now out. My name doesn’t appear on the main page but you will see it in the Acknowledgements section Below is a screenshot taken from Amazon preview .

You may want to grab a copy if you are using/planning to use Oracle GoldenGate 11g.

Here is the link to the book page on Amazon. It seems the book is not published in India yet but one can order the imported edition on amazon.in