Monday, July 17, 2017

Recently dealed with a database startup problem.
It was critical, because the database was a production database.
All the redologs were erased, actually they were zeroed.
At first, I thought that the issue might be caused by a wrong duplicate command. Such as, a duplicate command specified with NOFILENAMECHECK.

INFO: NOFILENAMECHECK prevents RMAN from checking whether the source database datafiles and online redo logs files share the same names as the duplicated files. This option is necessary when you are creating a duplicate database in a different host that has the same disk configuration, directory structure, and filenames as the host of the source database. If duplicating a database on the same host as the source database, then make sure that NOFILENAMECHECK is not set.
However; later on, I learned the truth. The issue was caused by a wrong controlfile recreation operation, that was done by a junior dba.

He was trying to clone a database, which was planned to be running on the same database server as the source database. Unfortuneatly, he recreated the controlfile of this cloned environment by pointing the redologs of the production environment. So, he went too far with this..

When I connected to the production database, I saw the redologs were zeroed.

I tried to validate them using alter system dump logfile '+REDO/redo0x.log' validate; and saw that, there are no redo records left in them.

At that point, I realized that , we were in a critical situation.

There were no redo records in redologs and the database was complaining with ORA-00312: online log x thread x: '+REDO/logx.dbf' and"ORA-00338": log X of thread X is more recent than control file.
As a result, the instance was terminated with opiodr aborting process unknown ospid (82519) as a result of ORA-1092.

ORA-0038 normally means -> The control file change sequence number in the log file is// greater than the number in the control file. But, another potential cause for getting such error is that listed redo log is Not valid (i.e contain zeros). -- "actually this was the case"..

Well... The production database could not be opened, as the recovery was requesting one of the zeroed redologs. (the cloned database used these redologs and zeroed them. At this point; it was impossible to reuse them with the production database.)

I also saw that, the last redo was lost, but the previous one was archived.

INFO: In a cooked filesystem like ext3/ext4, if you remove the redologs while the datababase is open, there are still some ways to get the redolog contents . (considering linux/unix doesn't delete the filecontents if the file is open by some processes, using lsof and /proc filesystem, you can get the data of those deleted files) -- it seems this is not possible with ASM at all.

Likewise, if your database is closed (closed with shutdown normal, not abort/not crashed) and if you delete your redologs (or zeroed them), then this is not a problem.

However, if the database is open and if you shutdown it using "shutdown abort" or if the database is crashed somehow, then it means you just lost all your redo.

Well.. The production database including all its redolog files was on ASM. So there were no ways to get the before image of the redolog files, so I decided to force a startup using _allow_resetlogs_corruption=true and startup force.

Well, after this forced startup, the database opened. EBS services started without errors and no problem encountered, but as recommend by Oracle Support, we needed to rebuild the database after opening it with this kind of a method. rebuild means doing the following, namely: (1) perform a full-database export, (2) create a brand new and separate database, and finally (3) import the recent export dump. When the database is opened, the data will be at the same point in time as the datafiles used.

Then, I thought that, "even if we do a full-export and import and become stable, we still lost some data. We forced the startup, so we didn't apply the redo records.. (redologs were already zeroed anyways)"

So, at that time, I also realized that, even we rebuild the database in this stage, we will never be sure about it stability. Full exp itself might encounter errors as well..

At the end of the day; the best option that came to my mind was restoring and recovering the database.

We had the backups (both full and incremental) + we had the backup of the archivelogs + we knew the log sequence number when the instance terminated.

So I told to myself "why not we restore and recover it? The database is now open but it is not stable.."

Anyways, "rman" is intelligent enough to use incremental backups during the recover operations (if they are available and relevant). Ofcourse, rman applies archivelogs automatically after restoring the database and rolling it forward with the level1 incremental backups.

We just issued a simple run {} block as the one below and waited.

RUN{SET UNTIL SEQUENCE 12538;RESTORE DATABASE;RECOVER DATABASE;}

It was a friday night and we restored and recovered an EBS database. We opened it with a minium data loss and luckily that data could be recreated by the business & application guys.

At the end of the day, the lesson learned here was -> "do not to place production and the clone environments in the same host".

However; the biggest lesson was " work on the production server only if you know what you are doing" and/or "do not work on the production, when you lose your focus".