Wednesday, June 4, 2014

Rdbms-- Standby problem analysis -- a real life example

To check whether if our standby services are running; we check 2 things in general. 1) Are the changes in our primary database are transffered to our standby datasbaee? 2) Are those changes actually applied to our Standby Databases? The second control ,which is meant to check the MRP services, is very important, as we cant say that our standby services are working properly without seeing the logs are actually applied to our standby databases.
So to control or schedule an alert for standby services should be in two phases.
- check if our logs are transferred , if they are transferred then check if they are applied

In this blog post; I will demonstrate a problematic standby environment , which seems not working even at first glance.

In this kind problematic standby situation, I usually check both the primary and standby sites..

Now suppose that we receive an email from the Customer Site, complaining about a Standby problem; here is what we do for diagnosing the situation in general;

Production Database---------------------------------------------------------------------------------
We check if the changes are transferred to the Standby site?

In this case there are gaps as you see.. So by looking to the output of above, we can say that there is something wrong between archiver and rfs .. We can take 81092 and 81093 as our references for diagnosing further.. It is also interesting that after 81093, the logs have continued to be transferred ( 81094, 81095) till the log with sequence no 81096

Then we open alert log of our production databsae and look for the errors, especially error which are recorded while the archiver was processing our problematic archives..

WARN: ARC1: Terminating pid 11757 hung on an I/O operation

WARN: ARC1: Terminating pid 11763 hung on an I/O operation

krsv_proc_kill: Killing 1 processes (Process by index)

krsv_proc_kill: Killing 1 processes (Process by index)

ARC1: Error 16198 due to hung I/O operation to LOG_ARCHIVE_DEST_2

Hmm.. As we see above, it seems we had an IO problem.. Note that : this kind of IO problem can be related with filesystem, permissions or even this problem can be seen because of the network between the prod and the standby sites.

Note that : after analyzing the alert log a little further, we see that archive log was terminated and restart after these errors..

ORA-27054: NFS file system where the file is created or resides is not mounted with correct options

Linux-x86_64 Error: 13: Permission denied

File #23 added to control file as 'UNNAMED00023'

So , here is the problem... A file named system12.dbf was created in the Production Database, and because there is not a similar directory in the Standby site, this file could not be created in the standby site .. Normally, this unfortunate situation prevents our Log apply services from applying the logs.. If we had used DB_FILE_NAME_CONVERT parameter set, then there would be no problems because of this action, As this parameter converts the filename of a new datafile on the primary database to a filename on the standby database..

In conclusion,

We have seen how to check a standby database environment in case of a problematic situation. Also, we could use the same controls for a daily check routine.

More importantly, in this post, we have seen a problematic situation that have made us recall the following facts:

Use db_file_name_convert if you have a different directory structure in Standby database..