Search my Blog

Follow me on Twitter

The Case of a Missing or Corrupt Spfile in your Grid Infra (11.2.0.3.0)

Introduction

On a grid infra-structure (11.2.0.3.0) cluster (SRVR1Rr – 2r -3r) we got more and more tickets that the ASM instances where having fragmented memory for the shared pool. The investigation showed that there was no SPFILE present, with automatic memory management in there. So this environment clearly needed better. Only thing found was some out of date INIT.ORA. From history this environment had been set up as an 11gr1 RAC cluster, and some time ago it had been upgraded to 11gr2 Grid Infra.

In this blog you will see the steps to move to an SPFILE in a cluster environment step by step. But this procedure can / should also be followed in case of lost SPFILE in a RAC cluster.

Note. What was interesting to find out was the way Oracle uses a specific configuration file as a properties file when starting the Cluster ware. It is VERY important to understand that the activities below will have to be performed on ONE Node in the cluster!

After the work had finished I noticed that Oracle had created a new property file. But there was more to be wow about. After starting the Grid-Infra for the other two nodes I see that grid infra took care of those files on the other machines too. So apparently the order of importance with starting the ASM instances =

Investigate the environments.

Since I had issues with memory on the SRVR1R Node ( +ASM1) I checked both other nodes for their settings first. Interesting part is that a lot of parameters are being default already like cluster_database = true etc. I am sure that is cause of that property file I spoke about in previous chapter. I used values for asm_diskgroup as preparation for my own new INIT.ORA file. In which I would also increase memory settings.

Plan of approach

I learned that my only chance will be to get things right now is to stop the FULL Cluster ware stack on ALL nodes in the cluster. THIS cannot be done on a per Node base ( trust me I have investigated that option). After shutting down then you will have to work with cluster ware on one node in a kind of restricted mode .

Once that situation is in place ( all down , working from One node without the crs active) I will have to present a prepared INIT.ORA , and create an SPFILE in a disk group in ASM.

Steps below will follow that scenario, describe it and show the needed aftercare I came across.

Stopping the grid infra:

First I stopped the databases using the corresponding srvctl commands. NOTE . this is a mixed env of 11.1 and 11.2 Databases all under control of the Grid infra structure of 11.2.0.3.0 and listeners so you NEED to use appropriate environment setting before going to work.

As ROOT:

On ALL Nodes to be done !

export PATH=/opt/crs/product/112_ee_64/crs/bin:$PATH

which crsctl

crsctl stop crs -f

Stopping Grid Infra Information on SRVR1R

First attempt showed errors below (marked in red). That was to be expected since SRVR1R, +ASM1 was having the memory issues in the first place

Oh well that worked out not very well. System was hanging and after some time I noticed that the node was in a panic because it rebooted itself. That was not great at all. I showed patience and waited for the node to come back in full. Asked a Linux colleague to take a look and he confirmed in the end, system is struggling but getting online anytime soon. So I waited till the end.

Second attempt stop Grid Infra SRVR1R

OK to let the cat out of the bag J And this time it looked MUCH better!

CRS-2793: Shutdown of Oracle High Availability Services-managed resources on ‘SRVR1R’ has completed

CRS-4133: Oracle High Availability Services has been stopped.

Ah so this time it was not only happy Node but also happy Dba. Of course I did final check:

ps -ef|grep d.bin

And all looks great now.

Started Grid Infra on one node in exclusive (restricted mode):

As the root user I would have to start the cluster ware (Grid Infra) in an exclusive mode without starting the crs.

SRVR1R:root:/root # crsctl start crs -excl -nocrs

CRS-4123: Oracle High Availability Services has been started.

CRS-2672: Attempting to start ‘ora.mdnsd’ on ‘SRVR1R’

CRS-2676: Start of ‘ora.mdnsd’ on ‘SRVR1R’ succeeded

CRS-2672: Attempting to start ‘ora.gpnpd’ on ‘SRVR1R’

CRS-2676: Start of ‘ora.gpnpd’ on ‘SRVR1R’ succeeded

CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘SRVR1R’

CRS-2672: Attempting to start ‘ora.gipcd’ on ‘SRVR1R’

CRS-2676: Start of ‘ora.cssdmonitor’ on ‘SRVR1R’ succeeded

CRS-2676: Start of ‘ora.gipcd’ on ‘SRVR1R’ succeeded

CRS-2672: Attempting to start ‘ora.cssd’ on ‘SRVR1R’

CRS-2672: Attempting to start ‘ora.diskmon’ on ‘SRVR1R’

CRS-2676: Start of ‘ora.diskmon’ on ‘SRVR1R’ succeeded

CRS-2676: Start of ‘ora.cssd’ on ‘SRVR1R’ succeeded

Startup the ASM instance manually.

I had prepared an INIT.ORA file with new and higher settings (and had included the Automatic memory management that came recommend by the Oracle. So with that specific INIT.ORA I started the ASM instance of the first node manually.

SQL> startup pfile=’/opt/oracle/initNew.ora’;

ASM instance started

Total System Global Area 4275781632 bytes

Fixed Size 2235208 bytes

Variable Size 4239991992 bytes

ASM Cache 33554432 bytes

ASM diskgroups mounted

## I checked all the diskgroups that I wanted to be mounted for all ASM instances and was happy to see them when I checked.

SQL> select * from v$asm_diskgroup;

SQL> select name,state from v$asm_diskgroup order by 1;

Note .. there was no specific disk group available for voting Disk and for my soon to be born SPFILE So that is a bummer .

select group_number, name,compatibility, database_compatibility from v$asm_diskgroup order by 1;

GROUP_NUMBER NAME COMPATIBIL DATABASE_C

.

.

6 DATA1 10.1.0.0.0 10.1.0.0.0

.

.

17 rows selected.

So I had to chance compatible.asm to proceed with the Disk group I had selected (being data1).

alter diskgroup data1 set attribute ‘compatible.asm’=’11.2’;

Diskgroup altered.

Then I created the SPFILE from pfile:

SQL> create SPFILE=’+DATA1′ from pfile=’/opt/oracle/initNew.ora’;

File created.

SQL> shutdown immediate;

As a non-believer of course I wanted to make sure about the new automatic memory management settings because maintenance windows like these should be very rare on a RAC cluster. So I restarted the instance manually again one more time. Cause I wanted 2 b happy with final checks.

The parameters marked are the ones I had increased or had set.

ASM instance started

Total System Global Area 4275781632 bytes

Fixed Size 2235208 bytes

Variable Size 4239991992 bytes

ASM Cache 33554432 bytes

ASM diskgroups mounted

SQL> show parameter SPFILE

NAME TYPE VALUE

———————————— ———– ——————————

SPFILE string +DATA1/srvr1r/asmparameterfile/registry.253.794950891

SQL> show parameter asm

NAME TYPE VALUE

———————————— ———– ——————————

asm_diskgroups string ALL Disk groups to be on mounted

asm_diskstring string

asm_power_limit integer 1

asm_preferred_read_failure_groups string

SQL> show parameter memory

NAME TYPE VALUE

———————————— ———– ——————————

memory_max_target big integer 4G

memory_target big integer 2G

SQL> show parameter sga

NAME TYPE VALUE

———————————— ———– ——————————

lock_sga boolean FALSE

sga_max_size big integer 4G

sga_target big integer 0

SQL> show parameter large

NAME TYPE VALUE

———————————— ———– ——————————

large_pool_size big integer 256M

use_large_pages string TRUE

SQL> show parameter proce

NAME TYPE VALUE

———————————— ———– ——————————

processes integer 999

This was everything I had planned for so time to shut down the instance .

SQL> shutdown immediate;

ASM diskgroups dismounted

ASM instance shutdown

SQL> exit

Time for the Normal restart of the first Node in the Grid infra

Since the node had been started in exclusive mode I first had to stop the cluster again:

AS the user ROOT.

crsctl stop crs

I was very curious, and crossed finger about the outcome.

As part of shops standards where each node has a mount point per DB. I established the link in the /opt/oracle

ls -s +ASM +ASM1

And started the grid-infra structure.

crsctl start crs

Once the first node was up and ASM was on its feet I did same for the two other nodes.

Last things

Had to restart the databases that where 11.2 Manually. Noticed that the 11.1 had restarted themselves. My stop procedure was causing that . If this is scenario is being performed on production box this part will need improvement.

Listeners of ASM did not come up as a resource due to .

Missing log directory but I added that one

Tried starting the listener from crs environment and it failed.

ora.SRVR1R.ASM1.asm ONLINE ONLINE on SRVR1R

ora.SRVR1R.listener_+ASM1.lsnr ONLINE OFFLINE

ora.SRVR2R.ASM2.asm ONLINE ONLINE on SRVR2R

ora.SRVR2R.listener_+ASM2.lsnr ONLINE ONLINE on SRVR2R

ora.SRVR3R.ASM3.asm ONLINE ONLINE on SRVR3R

ora.SRVR3R.listener_+ASM3.lsnr ONLINE OFFLINE

Manully I could start the listener after adding logfile directory (also part of standards). So I had to understrand what is going on . I think 11gr2 grid infra did not accept the + in the Name which was perfectly valind in 11.1. This is an action Point for the future.

Grid infra did start the listner for ASM2 but it I try to check the resource :