Unfortunately, the documentation is slightly unspecific regarding the requirements: glibc 2.14 is met, but in my system only 64 bit by default.
With a glimpse at this part of the installation path mrprof-7.1.1.3-linux32 it should be obvious, but it wasn't for me, and the error-message was not helpful at first stage.

All together nothing special, but it can cost some time when you want to use a tool for the first time - probably for urgent reasons - and first have to fix such obstacles.
So it's collected here, to save me some time when I hit it again.

Freitag, 15. Dezember 2017

We all know, if we try to look to close on some details, the results we get may change from the original observation.
This is also sometimes true for Oracle Trace events:
During analyses of a query with XMLTYPE which fails with ORA-12850 (in a 4 node RAC DB), Oracle Support asked for some trace files. the query itself is something like

Donnerstag, 14. Dezember 2017

We are facing an issue where Oracles Trace File Analyzer creates high load on some servers.
During the discussion with Oracle Support I installed the latest version of TFA which is 12.2.1.3.0 right now.
With this version, oratop - which oracle calls

oratop is a utility similar to the unix OS utility top which gives a overview of database performance and can be used in combination with top to get a more complete overview of system performance

After another SR , it was clear to be Bug 27103547 - ORATOP DOES NOT WORK IN TFA 12.2.1.3.0
The solution is simple:
In $GRID_HOME/tfa/`hostname`/tfa_home/bin/common/tfactlshare.pm
edit somewhere around line 9010

So the exit 0; is not executed anymore. Now oratop works as expected - even with TFA 12.2.1.3.0
Update: 18-Dec-2017
A new version 12.2.1.3.1 is available - this Bug is fixed there.
thank you Sandesh Rao for informing me.

@martinberx the oratop issue which was in TFA 12.1.2.3.0 has been fixed in 12.1.2.3.1 which was just released , thanks for pointing it out

Dienstag, 12. Dezember 2017

Managing many ORACLE_HOMEs in a complex environment can be tricky, especially when questions about their patches & bugs or comparisons should be done.

It's required to understand the different types of patches available. This can be (according to the documentation & my interpretation). Especially the differences between SYSTEM PATCH (in our case only Bundle Patches), COMPOSITE and SINGLETON patches, and BUGs - of course.

A way to handle all the information is to use the details provided by opatch and the patches, store them in some tables and afterwards query them.
For ORACLE_HOME related information, all the information required can be generated by a simple opatch command:

So a lot of useful queries can be done when information regarding ORACLE_HOMEs is stored in a RDBMS.

Similar things can be done for patches as well. patches contain a lot of useful information. Those are stored in inventory.xml, actions.xml and if they exist in bundle.xml, and any README* files. If these files are loaded into tables similar to the method shown above, they can be used to generate even more insight. Some care must be taken as SYSTEM and COMPOSITE patches contain other patches which must be processed as well (together with their dependencies). We normalized the xml files to several tables during the load for easier/faster queries.

Mittwoch, 6. Dezember 2017

I was on DOAG and UKOUG conferences this year and hold some presentations there.
Of course I attended even more presentations and learned a lot of new things.
But I did not blog about it for some good reasons:
The presentations I attended are done already. Which means the presenters did a lot of research and put it all into great presentations and abstracts - so why should I create a weak copy of their work?

Beside these presentations I had a lot of talks to several people. We shared war stories, created concepts for new solutions over a beer and also discussed hidden, or at least not so well known "features". I took notes for some of these discussions for later analysis.
But these notes contains this important reminder: !no Blog!
That's there for a good reason: We shared information in mutual private trust. There are often details about things (ideas, concepts, tools, features, ...) which are not complete yet, or never will be. Or only applicable in this very specific situation, but not used without proper care. Or many other good reasons.
And this is the real value for going to conferences! These discussions beside and after the presentations.
It's NOT the hunt for hidden information in general, but the possibility to get open minded conversations off the records.
It was great to meet all the old and new friends. I love this community where you are seen as a valuable member based on your contribution; questions are always accepted and not marked as "dumb" or "silly". And even (and special) in disagreement the discussions are on highest technical level, but never personal.

Writing that I have to thank DOAG and UKOUG for their great work running these conferences.

Donnerstag, 16. November 2017

I hit an issue where a full backup took much longer than normal.
In this case there was no alarm yet as no threshold was reached. But I worked on the DB for some other reason and out of a habit I most often start a ASH viewer whenever I work on a system - even if I only check data, it's worth to have an eye on the system.
In this case I saw some top session in waits 'ASM file metadata operation' & 'KSV master wait'.
It wasn't my query session (so I didn't break anything) but some RMAN worker processes.

That's worth to investigate. After some research (Google & MetaLink) I saw some links between ASM rebalance and 'ASM file metadata operation'.

Checking the ASM instance, there was really a ASM rebalance ongoing, but no progress (no change in v$asm_operation.SOFAR over some minutes). It was initiated the other evening by a colleague which added a disk to the DG. I agree with Kevin this is a bad habit, but in this environment it's not enough pain (and multiple teams involved) to re-work all the processes. The RBAL process was waiting in 'enq: RB - contention'.

As ASM rebalance can be stopped or re-started wit othe rpriority easily, I gave this a chance and run ALTER DISKGROUP dg REBALANCE POWER 2 - the power is not important here, ony to stop the current (stalled) rebalance and issue another.
The ASH viewer immediately showed the uncommon waits disappear and in RMAN logs I saw ordinary progress immediately.

To be honest I did not much analysis here, so it might be worth to do better, but in this case it was sufficient and the issue solved even before there was an alarm regarding the blocked backup.

Donnerstag, 26. Oktober 2017

I spent a recent night hunting an issue which was non trivial to detect.
On a 4 node cluster 2 nodes got Linux patches and were rebooted. This was done after 10pm for minimal business impact. (The application is written by Oracle - so it's not fully TAF aware).
The DBs started fine, all services were available.
At 3:50 in the morning I got a call some processes show strange errors.
As I connected to the Server, the Instance shows running proceses. But I could not connect to the instance:
With proper environment variables, the errors were:

My sqlplus process tried to attach the shared memory segment 323780623 at address 0xa0000000 ( shmat(323780623, 0xa0000000, 0) ) but it failed (-1 EINVAL).
The reason for the failure is the alredy linked library /lib64/ld-2.12.so at address 324e000000 (and some others as well).

But why is it so important to have this shared memory at the specific address? As strace shows, it was able to map it on some other address without any issue ( shmat(323780623, 0, 0 = 0x2aaaaac00000 ).

But Oracle requires the shared memory segments mapped at the same address for all processes of an instance. It uses the direct memory address to access all it's memory structures. E.g. when querying saddr or paddr from v$session, these are the real addresses.
So if you map a shared memory process with your own binary at the same address, you can access the memory structures direct using these addresses.

So what bad did prelink make to the system?

The Linux server was patched at about midnight. A lot of new RPMs were installed and the server rebooted. Clusterware started and fine and int started all instances.
But sometime close to 03:34 prelink was started, it did a lot of work with all the new libraries and created a new /etc/prelink.cache file, so all new started binaries loads their libraries according to this file. In my case the shared memory segment could not be loaded at the specific address anymore - so the Instance was still healthy, but no new process could attach anymore.

My solution was to kill the instance and restart it.
Generally I recommend to disable prelink on Linux hosts running Oracle database.