oracle performance

Thank you very much to all those who attended my session "Stats with Confidence". Unfortunately I was delayed by the keynote running late. With the big party coming up, I appreciate the spirit of those brave souls who stayed back. The late start didn't allow me to show the demo completely. But here are the scripts; hope you will be able to follow it along and run it on your own infrastructure.

It contains the presentation as well. Thanks for attending and hope you will find it useful.

The other day I was putting together my presentation for Oracle Open World on Application Profiling in RAC. I was going to describe a methodology for putting a face to an app by measuring how it behaves in a database – a sort of a signature of that application. I was going to use the now-ubiquitous 10046 trace for wait events and other activities inside the database. For resource consumption such as redo generated, logical I/Os, etc., I used the v$sesstat; but then I was stuck. How would I collect the stats of a session when the session has not even started and I don’t know the SID. That problem led to the development of this tool where the stats of a future session can be recorded based on some identifying factors such as username, module, etc. Hope this helps in your performance management efforts.

The Problem

Suppose you want to find out the resource consumed by a session. The resources could be redo generation, CPU used, logical I/O, undo records generated – the list is endless. This is required for a lot of things. Consider a case where you want to find out which apps are generating the most redo; you would issue a query like this:

The value column will show the redo generated. From the SID you can identify the session. Your next stop is v$session to get the other relevant information such as username, module, authentication scheme, etc. Problem solved, right?

Not so fast. Look at the above query; it selects from v$sesstat. When the session is disconnected, the stats disappear, making the entries for that session go from v$sesstat. If you run the query, you will not find these sessions. You have to constantly select from the v$sesstat view to capture the stats of the sessions hoping that you would capture the stats before the session disconnects. But it will be not be guaranteed. Some short sessions will be missed in between collection samples. Even if you are lucky to capture some stats of a short session, the other relevant information from v$session will be gone.

Oracle provides a package dbms_monitor, where a procedure named client_id_stat_enable allows you to enable stats collection on a future session where the client_id matches a specific value, e.g. CLIENT1. Here is an example:

execute dbms_monitor.client_id_stat_enable('CLIENT1');

However there are three issues:

(1) It collects only about 27 stats, out of 400+

(2) It offers only three choices for selecting sessions – client_id, module_name and service_name.

(3) It aggregate them, sums up all stats for a specific client_id. That is pretty much useless without a detailed session level.

So, in short, I didn’t have a readily available solution.

Solution

Well, necessity is the mother of invention. When you can’t find a decent tool; you build it; and so did I. I built this tool to capture the stats. This is version 1 of the tool. It has some limitations, as shown at the end. These limitations do not apply to all situations; so the tool may be useful in a majority of the cases. Later I will expand the tool to overcome these limitations.

Concept

The fundamental problem, as you recall, is not the dearth of data (v$sesstat has plenty); it’s the sessions in the future. To capture those sessions, the tool relies on a post-logon database trigger to capture the values.

The second problem was persistence. V$SESSTAT is a dynamic performance view, which means the records of the session will be gone when the session disappears. So, the tool relies on a table to store the data.

The third problem is the getting the values at the very end of the session. The difference between the values captured at the end and beginning of the session are the stats. To capture the values at the very end; not anytime before, the tool relies on a pre-logoff database trigger.

The fourth challenge is identification of sessions. SID of a session is not unique; it can be reused for a new session; it will definitely be reused when the database is recycled. So, the tool uses a column named CAPTURE_ID, a sequentially incremented number for each capture. Since we capture once at the beginning and then at the end, I must use the same capture_id. I use a package variable to store that capture_Id.

Finally, the tool allows you to enable stats collections based on some session attributes such as username, client_id, module, service_name, etc. For instance you may want to enable stats for any session where the username = ‘SCOTT’ or where the os_user is ‘ananda’, etc. These preferences are stored in a table reserved for that purpose.

Construction

Now that you understand how the tool is structured, let me show the actual code and scripts to create the tool.

(1) First, we should create the table that holds the preferences. Let’s call this table RECSTATS_ENABLED. This table holds all the different sessions attributes (ip address, username, module, etc.) that can enable stats collection in a session.

If you want to enable stats collection of a session based on a session attribute, insert a record into this table with the session attribute and the value. Here are some examples. I want to collect stats on all sessions where client_info matches ‘MY_CLIENT_INFO1’. You would insert a record like this:

Here are some more examples. All sessions where ACTION is ‘MY_ACTION1’:

insert into recstats_enabled values ('ACTION','MY_ACTION1');

Those of user SCOTT:

insert into recstats_enabled values ('SESSION_USER','SCOTT')

Those with service name APP:

insert into recstats_enabled values ('SERVICE_NAME','APP')

You can insert as many preferences as you want. You can even insert multiple values of a specific attribute. For instance, to enable stats on sessions with service names APP1 and APP2, insert two records.

Important: the session attribute names follow the naming convention of the USERENV context used in SYS_CONTEXT function.

Note, I used the tablespace USERS; because I don’t want this table, which can potentially grow to huge size, to overwhelm the system tablespace. The STATISTIC_NAME and STATISTIC_VALUE columns record the stats collected. The other columns record the other relevant data from the sessions. All the attributes here have been shown with VARCHAR2(2000) for simplicity; of course they don’t need that much of space. In the future versions, I will put a more meaningful limit; but 2000 does not hurt as it is varchar2.

The capture point will show when the values were captured – START or END of the session.

(3) We will also need a sequence to identify the sessions. Each session will have 400+ stats; we will have a set at the end and once at the beginning. We could choose SID as an identifier; but SIDs could be reused. So, we need something that is truly unique – a sequence number. This will be recorded in the CAPTURE_ID column in the stats table.

SQL> create sequence seq_recstats;

(4) To store the capture ID when the post-logon trigger is fired, to be used inside the pre-logoff trigger, we must use a variable that would be visible to entire session. A package variable is the best for that.

create or replace package pkg_recstatsis g_recstats_id number;end;

(5) Finally, we will go on to the meat of the tool – the triggers. First, the post-logon trigger to capture the stats in the beginning of the session:

Again the code is self explanatory. We capture the stats only of the global capture ID has been set by the post-logoff trigger. If we didn’t do that all the sessions would have started recording stats at their completion.

Execution

Now that the setup is complete, let’s perform a test by connecting as a user with the service name APP:

SQL> connect arup/arup@app

In this session, perform some actions that will generate a lot of activity. The following SQL will do nicely:

SQL> create table t as select * from all_objects;

SQL> exit

Now check the RECSTATS table to see the stats on this catured_id, which happens to be 1330.

This clearly shows you all the stats of that session. Of course the table recorded all other details of the session as well – such as username, client_id, etc., which are useful later for more detailed analysis. You can perform aggregations as well now. Here is an example of the stats collected for redo size:

Granularity like this shows you how the application from different client servers helped; not just usernames.

Limitations

As I mentioned, there are some limitations you should be aware of. I will address them in the next iterations of the tool. These are not serious and applicable in only certain cases. As long as you don’t encounter that case, you should be fine.

(1) The logoff trigger does not fire when the user exits from the session ungracefully, such as closing down the SQL*Plus window, or closing the program before exiting. In such cases the stats at the end of the session will not be recorded. In most application infrastructure it does not happen; but it could happen for adhoc user sessions such as people connecting through TOAD.

(2) The session attributes such as module, client_id and action can be altered within the session. If that is the case, this tool does not record that fact since there is no triggering event. The logoff trigger records the module, action and client_id set at that time. These attributes are not usually changed in application code; so it may not apply to your case.

(3) Parallel Query sessions will have a special issue since there will be no logoff trigger. So in case of parallel queries, you will not see any differential stats. If you don’t use PQ, as most OLTP applications do, you will not be affected.

(4) If the session just sits there without disconnecting, the logoff trigger will never fire and the stats will never be captured. Of course, it will be eventually captured when the session exits.

Once again, these limitations apply only to certain occasions. As long as you are aware of these caveats, you will be able to use this tool to profile many of your applications.

Query transformation is a set of techniques used by the optimizer to rewrite a query and optimizer it better. Few optimization paths open up to the optimizer after query transformation. Some query transformations must be costed to be chosen and some do not need to be costed. For example, if a table can be eliminated completely from the join, then that transformation is applied and need to cost that transformation is minimal.

Test case

We will use the following test case to illustrate the concepts behind Query transformation. Some of the optimizations that we see here works from version 11gR1 onwards and so, these test cases might not work in the versions 10g and below.

JE is a technique in which one or more tables can be eliminated from the execution plan without altering functional behavior. In the listing 1-1, query selects columns from the table t1 only, but there exists a join predicate between t1 and t2 in that query. Further, no columns are selected from table t2 in this query and join to t2 simply serves as to verify the existence of foreign key values. Enabled Foreign key constraint between these two tables establishes the existence check already and so, there is no need for explicit existence check in the query also. Join to table t2 can be eliminated by the optimizer safely.

As you see from the listing 1-1, Table T2 is removed from the execution plan. Since there is a valid foreign key constraint, optimizer eliminated the join condition to that table t2.

Let’s also discuss another Join Elimination test case. In the Listing 1-2, predicate is “t1.n2 not in (select t2.n1 from t2)”. As the enabled foreign key constraint dictates that this predicate will always be false and no rows will be returned. Optimizer promptly identified this condition and added a filter predicate in the step 1 with “NULL is NOT NULL” as a predicate. Step 1 is executed before step 2; Step 2 is never executed as the value of Starts column is zero in the execution plan.

Various filter predicates are generated and added to the execution plan using enabled and validated constraints (check, not null constraints).

In the Listing 1-4, columns n1 and n2 has enabled valid NOT NULL constraints that precludes null values in the columns n1 and n2. Query in the listing 1-4 has predicate “n1 is null or n2 is null” which can never be true. This fact is used by the optimizer to improve the execution plan. Filter predicate (NULL IS NOT NULL) is added in step 1 which will be FALSE. So, Step 2 is never executed as the value of Starts column is 0 in the execution plan. This means that step (2) in the execution plan was never executed and table T1 was never accessed.

Let’s add a check constraint to this column to explain this further. In the listing 1-5 a check constraint is added which specifies that “n1 200 and generated predicate n1<200 will nullify each other leading to an always FALSE condition. Optimizer identified this condition and added a filter predicate in step 1: NULL IS NOT NULL.

Following lines from the trace file generated from event 10053 shows that a predicate n1<200 is added; This auto-generated predicate and existing predicate canceled each other leading to an eternally FALSE condition.

In some cases, the optimizer can convert a set operator to a join operator. Interestingly, this feature is not enabled by default (up to 11gR1). In the listing 1-6, we enable this parameter. A MINUS set operation has been converted to a join operation.

Subqueries can be unnested in to a join. Listing 1-7 shows that a subquery is unnested in to a view and then joined to other row sources. In this listing, a correlated subquery is moved in to a view VW_SQ_1, unnested and then joined using Nested Loops Join technique. There are many different variations of Subquery Unnesting possible, but the crux of the matter is that subqueries can be unnested, joined and then costed.

SU is one reason why there are many performance issues after a database upgrade to 10g and above. Cost of unnested subquery will go up or down leading to an unfortunate choice of not-so-optimal execution plan.

Use of ORDERED hint can really play havoc with SU feature too. For example, in the listing 1-8, join between t1 and t2 is preferred followed by other joins. You would expect to see the leading table in the join to be T1, but the leading row source is VW_SQ_1.

Is CBO not honoring our hint? It is honoring our hint. Except that ORDERED hint was applied after the SU transformation and so, unnested view is in the leading row source. A variation of the transfromed query from 10053 trace file is printed below. With ORDERED hint, of course, CBO must choose the unnested view as the leading row source. Use LEADING hint instead of ORDERED hint if necessary.

In RAC, every data block is mastered by an instance. Mastering a block simply means that master instance keeps track of the state of the block until the next reconfiguration event (due to instance restart or otherwise).

Hash to the master

These data blocks are mastered in block ranges. For example, range of blocks starting from file 10, block 1 through block 128 may be mastered by instance 1, blocks from file 10, block 129 through 256 are mastered by instance 2 etc. Of course, there are differences between various versions 10g, 11g etc, but Idea here is that block ranges are uniformly mastered between various instances so that Global cache grants are evenly distributed among the instances. Interestingly, length of the block range is 128 from 10g onwards (Julian Dyke mentioned that is 1089 in 9i, but I have not personally tested it). Of course, Support recommends you to unset db_file_multiblock_read_count which will be auto adjusted to 128 which means that Full block range can be read with fewer GC messages, I suppose. I digress.

Further, Michael Möller pointed out that this hash-algorithm is further optimized: The hash-algorithm used when initially computing the master node from the DBA, results in a “virtual master”, which is then translated to a real (online&open) master by a lookup table (the length of which is the maximum number of possible nodes (128 ?). This means that when one node goes off/on-line, RAC does NOT have to recalculate the hash for all blocks, but only distribute the new Hash-to-node table. (One can later visualize dynamic remastering as an additional lookup table between the hash value and node. This table also needs redistributing on node changes.)

Following SQL is helpful in showing masters and owners of the block. This SQL joins, x$kjbl with x$le to retrieve resource name. If you are familiar with Oracle locking strategy, you would probably recognize the format of these cache fusion (aka old PCM) locks. Lock type in this case is BL, id1 is block# and id2 is file_id in this case. Column kjblname2 provides a decimal format lock resource.

A session is trying to access the block file 1, block 10776, but that block is mastered by instance 2 and also that block is owned by instance 2 (meaning, it is in instance 2 cache). So, instance 3 will sent a PR (Protected Read) mode BL lock request on that block to instance 2. Ignoring additional complexities, instance 2 will grant PR mode lock to instance 3 and transfer the block to instance 3. Obviously, this involves multiple GC messages, grants and block transfer. Statistics ‘gc remote grants’ gets incremented too.

Let’s consider that session is trying to access another block: file 1, block 6375. That block is mastered by instance 3 and also owned by instance 3. At this point, there is no additional GCS/GES processing is needed and the session pin that buffer and continue the work.

Let’s consider a third case. Session is trying to access file 1 block 6374. That block is not in any buffer cache, but instance 3 is master of the block, so local affinity locks are acquired with minimal GC messages and waits. That block is read from the disk in to the buffer cache

In the case #2 and #3 above, requesting instance also is the master node of a block or block range. In these cases, statistics ‘gc local grants’ is incremented and cheaper local affinity locks on those block ranges are acquired avoiding many Global cache messages.

So far so good, but what if, say instance 1, is reading one object (table, index etc) aggressively, but other instances are not reading that object at all? [ through some sort of application node partitioning or just plain workload]. Does it make sense for the instance accessing that object aggressively request a grant to the remote instance(s) for each OPEN on that object’s blocks? Especially, if the blocks are read in to the buffer cache, but disappears soon from the buffer cache? Wouldn’t that be better if the instance reading that object aggressively is also the master of that object, as in the cases #2 and #3 above?

In addition to that, if the block is supposed to be thrown away from buffer cache (close of BL lock) or if the block needs to be written, then that will involve additional overhead/messaging between the master instance and owner instance since the ownership needs to be communicated back to the master of the block.

Enter Object remastering.

Object Remastering

There are many new features in 10g/11g RAC stack. One of them is Object remastering feature. This feature was implemented in 10gR1 and improved in 10gR2 and further enhanced in 11g. I realize there are parameters in 9i also, but I don’t think it worked as intended.

With object remastering feature, if an object is accessed by an instance aggressively, then that instance will become the master of the object reducing gc remote grants improving performance of the application. In the prior sentence, I used the word “accessed”, but it is a loose term, and the correct term is if the instance is requesting much BL locks on an object, then that object can be remastered. In an ideal world, even if the application is not partitioned, remastering of the objects that were accessed aggressively from one instance will acquire cheaper local instance affinity locks and effective RAC Tax will be minimal.

Well, I said, in an ideal world There are few issues here:

Instance do not remember prior mastership across restarts. This means that instance needs to re-learn the object mastership map after every restart. I can see the complexities of remembering the mastership, but it is possible to implement that.

Remastering is not exactly cheap. Instance GRD is frozen during reconfiguration and in a very busy instances, this can take many seconds leading to instance freeze for several seconds. While 10gR2 introduce parallel reconfiguration (_rcfg_parallel_Replay parameter controls this behavior) using all LMS processes to complete the reconfiguration, still, several seconds of freeze is not exactly acceptable in many environments.

I advice my clients to keep LMS processes to a lower value (3 to 5), at the most, but instance reconfiguration effective parallelism is reduced if we reduce number of LMS processes.

Last, but not the least important point is that, default values of few parameters that trigger remastering events are quite low for busy environments causing frequent remastering of objects. In an E-Business World, minor mismanagement in the manager configuration can lead to a massive reconfiguration issues.

Parameters, views and internals

Few parameters are controlling this behavior, not well documented, my test case results are not very accurate either. But, these parameters are giving us a picture of what is going on internally. These parameters are applicable to 10gR2 and below. For 11g, whole set of different parameters comes in to play and I will blog about the differences in another blog entry.

X$object_affinity_statistics maintains the statistics about objects and OPENs on those objects. It is important to understand the difference between OPEN and Buffer access. If the block is in the cache already in a suitable mode, there is no need for BL opens on that block. So, if the sessions are accessing the same block repeatedly without requesting any additional BL locks, then the count is not incremented. So, OPEN is simply a number of BL request initiated in an ephimeral time frame.

LCK0 process maintains these object affinity statistics. If an instance opens 50 more opens on an object then the other instance (controlled by _gc_affinity_limit parameter), then that object is a candidate for remastering. That object is queued and LMD0 reads the queue and initiates GRD freeze. LMON performs reconfiguration of buffer cache locks working with LMS processes. All these are visible in LMD0/LMON trace files. Parameter _gc_affinity_time controls how often the queue is checked to see if the remastering must be triggered or not with a default value of 10 minutes.

Now, you don’t want just any object as a candidate for remastering, meaning, if instance 1 opened 101 BL locks on that object and instance 2 opened 50 BL locks on that object, you don’t want to trigger object remastering. Only objects with higher amount of BL lock requests must be queued for remastering. Well, that threshold seems to be controlled by another parameter _gc_affinity_minimum: This parameter is defined as “minimum amount of dynamic affinity activity per minute” to be a candidate for remastering. Defaults to 2500 and I think, it is lower in a busy environment.

Few lines from LMD0 trace files showing that LMD0 is reading a request queue:

Performance can suffer if there are remastering issues. Following AWR report shows that few instances froze due to DRM reconfiguration issue. Same type of freeze is visible in all other nodes too. gc buffer busy is a side effect of DRM freeze (not always, but in this case).

Essentially, an object was remastered after excessive BL locking requests ( in a loose term accesses) on that index.

undo and affinity

Mastering of Undo segments differ from non-undo segment mastering. With non-undo segments, all the blocks are mastered by a hash technique spreading mastership among instances for a segment. Only after an instance opens BL locks aggressively on a segment that segment is mastered. But, for undo segments, Instance that activates an undo segment masters the segment immediately. This makes sense, since that undo segment will be used by the instance opening the segment in most cases. Parameter _gc_undo_affinity controls whether this dynamic undo remastering is enabled or not.

Since undo segments do not have real object_ids, a dummy object_ids over a value of 4294950912 is used. For example, undo segment 1 (with usn=1) will have an object_id of 4294950913, usn=2 will have object_id of 4294950914 etc.
[ 4294950912 = power(2,32) - power (2,14) = xFFFFC000 ]

I was not successful in triggering another undo segment remastering event. I created one active transaction generating 200K undo blocks in one node, another node was reading that table and I can see enormous waits for those undo blocks. But, I didn’t see any remastering events related to that undo segment. Not sure why it did not work, may be the conditions for the undo segment remastering is different.

[ PS: I am able to manually remaster the undo segment using lkdebug command discussed below: So, code must be remastering the undo segments automatically too, but may be some other conditions must be met.

I am not preaching that you should modify these undocumented parameters. Far from it. Understand the parameters, if you run in to wait events such as 'gc remaster', 'gcs freeze for instance reconfiguration', understand whether the default values are quite low. Work with support and see if this can be tuned.

In summary, remastering is a great feature. It is a pity that some times, we fall victims of the side effects. So, if you run in to issues with remastering, don't disable it, but see if you can tune those parameter upwards so as to control the remastering events. If you stil want to disable DRM completely, I would recommend setting _gc_affinity_limit and _gc_affinity_minimum to much higher value, say 10Million. Setting the parameter _gc_affinity_time to 0 will completely disable DRM, but that also means that you can not manually remaster objects. Further, Arup mentioned that x$object_affinity_statistics is not maintained if DRM is disabled.

Again, these are undocumented parameters. Before you change these parameters make sure that Oracle Support agrees with you.

Update 1:

From 11g onwards, affinity management renamed to policy management. For example, x$object_affinity_statistics is renamed to x$object_policy_statistics. Similarly, initialization parameters are renamed too:Parameter _gc_affinity_limit is renamed to _gc_policy_limit; Parameter _gc_affinity_time is renamed to _gc_policy_time; A new view v$policy_history is introduced and all rows with policy_event = 'initiate_affinity' are associated with DRM events.
Other details about the blog still holds good, except that by default _gc_policy_limit is lowered to 1500 which would mean that, by theory, 11g may have more DRM events. YMMV.

[ Many Thanks Arup Nanda and Michael Möller (aka "M2") for reviewing this blog entry, they contributed heavily to some of my discussion. But, any mistakes in this document is solely of mine. ]
This blog is available in a pdf format from Dynamic_remastering_RAC_orainternals.

I will be presenting on two topics in HOTSOS’ 2010, an Oracle performance focused conference, in my home town Dallas, Texas. You can read about my presentation topics in HOTSOS’ 2010 Riyaj . This symposium is a valued conference for performance engineers, DBAs and developers who are interested to know learn about performance. There are many great speakers presenting in this conference and the main page for this conference is HOTSOS ’2010 . BTW, My friend Alex Gorbachev interviewed Gary Goodman and posted a video in his blog also.

Tanel Poder is conducting the HOTSOS training day this year. You can’t miss his training day and I heard that he is working on a MOTS (Mother Of all Tuning Script) and planning to release that in HOTSOS ’2010.

On behalf of the Dallasites, I invite you to visit Dallas and attend this great conference.

Global cache performance metrics are not correctly measured. It is not understood clearly either. There are even few blogs and web pages disseminating incorrect information. This blog entry is an attempt to offer few methods and scripts to understand global cache performance.

Always review all instances

It is very important to review the performance metrics from all instances in that RAC cluster, not just one instance that you are connected. If you have access to AWR reports, then it is critical to generate AWR reports (or statspack reports) from all instances. But, the problem is that, DBAs tend to generate AWR reports after logging in to each instance iteratively, enter couple of parameters and then reports are generated. Not exactly a convenient practice.

There are few issues with this approach. It is a cumbersome practice if the instance count is higher. In addition to that, all of AWR reports are, in turn, accessing underlying AWR tables. Physically, rows from all instances are together in the same block and so, by executing these reports connecting to various instances, Global cache traffic is increased. If the database is suffering from Global cache (GC) performance issues then generating reports connecting to various instances is probably not a grand idea.

I have created few basic scripts to generate AWR reports from all instances in a single step. Script awrrpt_all_gen.sql queries AWR tables and gv$instance to access last snap_id and various instance_ids. Then this script generates AWR reports for all instances in that cluster in the current directory, for the last AWR snap. This script is quite handy while debugging RAC performance issues. Now, we have AWR reports from all instances in one directory and we can easily grep these files to understand metrics in various instances.

It is important to measure Global cache performance metrics correctly, especially those damned averages. Incorrect measurements of averages can lead to faulty analysis. For example, following query is retrieving average global cache CR receive time from gv$sysstat view. Problem with this SQL statement is that this statement is retrieving average global cache receive time from the restart of the instances. Global cache performance is quite sensitive to workload and querying over a wide time range, as in this query, can lead to faulty analysis.

As an analogy, above query is equivalent to a patient-Doctor conversation:

Patient: Doctor, I don’t feel good. I feel that I am having fever.
Doctor : Nope. Your average body temperature from the beginning of this year is around 98. So, you are doing excellent.

It is essential to review the global cache performance metrics at a granular level. Use AWR reports or statspack reports for a short period of time. Better yet, if you want to quickly review GC performance metrics, you can use my script gc_traffic_print.sql. This script captures various statistics relevant to Global cache performance in to a PL/SQL table, sleeps for N seconds (default N=60) and re-captures the statistics after the sleep. Then the script prints all the relevant details in easily readable tabular format. Very useful in quickly reviewing the Global cache performance metrics.

Global cache processing performance can be broken down to its sub components ( excluding wire latency, but that can be calculated). Breakup of this metrics must be done at a granular level also. We can use yet another script gc_traffic_processing.sql to review the performance metrics for the past N seconds. For example, in the output printed below, Global cache CR processing time is broken in to various sub-components of global cache processing. Remember that these performance metrics are in the SEND side of Global cache performance metrics. Meaning, inefficient global cache processing in one instance affects other instances' global cache receive time, but not the self. For example, in the table above, GC CR receive performance was at 12.1ms for instance 1. That is due to another instance (probably node 4 in this case) suffering from high GC processing time (more specifically high 'CR block flush time' in node 4). It is far more accurate to review global cache performance issues in a granular basis. Caution though, try not to specify at least a minute so that these numbers are consistent.

Averages can be misleading. Revealing performance issues underneath the average is a very important task to do. For example, we see that 'gc cr block receive time' average is high. But, let's review the histogram for the gc events related to Global cache receive time. My script gc_event_histogram.sql is quite handy to review event histograms. This script provides a difference in performance statistics between two samples and prints them in a readable format. In this case, node 1 is has 8578 waits for the event 'gc cr block 2-way' under 2ms and the sessions in instance 1 had 9851 waits between 2-4ms. Only few waits over 16ms.

But, we also need to see 3-way waits. Let's review 3-way waits with the above script. We will simply supply 'gc cr block 3-way' as the wait event. In this case, we see that there are 12,062 waits over between 4-8ms. Of course, this must be improved.

In essence, these four scripts will be useful in debugging Global cache performance issues. Please let me know if you encounter errors in these scripts. We will use these scripts in future discussions to show how we resolved few RAC performance issues.

I was reviewing a performance issue with a client recently. Problem is that increased global cache waits causing application slowdown affecting few critical business functions. Using one of my script gc_traffic.sql > and graphing the results with Excel spreadsheet, it is established that there is a marked increase in GC traffic today compared to week earlier. Similar jobs runs every day and so comparing two week days is sufficient to show the GC traffic increase. Graph is between total blocks and AWR snap time in 30 minutes interval. [Click the picture below to review the graph clearly.]

Identifying the object creating this increased GC traffic is essential to identify root cause. We were able to quickly determine that this increase in GC traffic was localized around few SQL statements using ADDM and AWR reports. We decided to focus on one SQL with an obvious increase in elapsed time compared to prior week. So, first question asked, is there a change in the plan? plan_hash_value was reviewed and quickly determined that there is no change in the plan_hash_value.

Little bit of history, there were few changes performed by the vendor over the weekend as a part of few bug fixes. Vendor’s argument was that since there is no change to the plan_hash_value, SQL access plan did not change and so, this can’t be due to vendor changes. My Client’s argument was that there were no changes to the environment and problem started after the application changes.

There are many different things that can go wrong without changes to the execution plan. We can ignore those conditions for now since (a) there has been no changes to the environment (b) no visible changes to the data (c) no error message and average CR recv time is consistent with prior weeks. Well, Let’s cut to the chase. It boiled down to a question “Can SQL plan change without a change in plan_hash_value?”. What do you think? Please answer in the poll below.

Hash_value of a SQL statement is generated from the text of an SQL statement and plan_hash_value is generated from the execution plan of that SQL statement[ More accurately, from that child cursors' execution plan and exactly what is involved in generating plan_hash_value is not published]. It is a general belief that plan_hash_value will change even if there is a slightest change in the execution plan. But, that is not always the case!

Notice that in the test result above filter predicate changed reflecting index column reordering. But the plan_hash_value did not change. Point is that execution plan can change without a change in plan_hash_value (due to change in the underlying tables).

Let’s modify that index dropping a column from the index in the next test case.

We see that predicates changed but the plan_hash_value did not change. In the test case below, we will modify index to be a function based index and test this SQL statement. There are also few more self-evident test cases below.

In summary, plan_hash_value is a very good indicator to see if the plan changed or not. Don’t get me wrong, I also use plan_hash_value, but in addition to comparing CPU time and elapsed time. Execution plan can change even when there is no change to the plan_hash_value. Salient points from these test cases are:

Plan_hash_value is dependent upon partial execution plan not on complete execution plan.

If a predicate is moved from filter_predicate to access_predicate or vice-versa, it doesn’t affect plan_hash_value.

Changes in the parallelism of the queries does not affect plan_hash_value. For example, if a query used 4 parallel slaves today and 16 parallel slaves yesterday, that change is not visible through plan_hash_value. This behavior is expected as parallel_adaptive_multiuser can allocate slaves depending upon the state of that instance.

Plan_hash_value is case-sensitive to index/table names. Also sensitive to white-space characters.

Plan_hash_value is not sensitive to index types. For example, if the index type is a function based index as in our test case #4, as long as, index name did not change plan_hash_value will remain the same.

Plan_hash_value is not sensitive to schema either. SQL statement accessing different schemas can have same plan_hash_value too.

Back to our problem. One of the index was recreated removing a column and caused optimizer to apply filter predicates at table level increasing number of accesses to table block tremendously, leading to more logical reads, more Global cache waits etc. This problem was amplified since this SQL was executed very frequently and concurrently from all RAC instances.

In my client’s defense, this application change was tested thoroughly. But, alas, test data chosen for this performance test was not probing this specific issue. This performance issue did not show up in development. Essentially, chosen data in the performance benchmark suite was not an excellent choice.

As they say in Gaelic language “Go raimh maith agat” to my client for allowing me to post this blog.
This can be read in traditional format inplan_hash_value_and_gc
Update 1: Added the document and also corrected a typo.

My earlier blog about shared pool duration got an offline response from one of my reader:
” So, you say that durations aka mini-heaps have been introduced from 10g onwards. I have been using Steve Adams’ script shared_pool_free_lists.sql. Is that not accurate anymore?”

Shared pool free lists

I have a great respect for Steve Adams . In many ways, he has been a great virtual mentor and his insights are so remarkable.

Coming back to the question, I have used Steve’s script before and it is applicable prior to Oracle version 9i. In 9i, sub-heaps were introduced. Further, shared pool durations were introduced in Oracle version 10g. So, his script may not be applicable from version 9i onwards. We will probe this further in this blog.

This is the problem with writing anything about internals stuff, they tend to change from version to version and In many cases, our work can become obsolete in future releases(including this blog!).

In version 9i, each sub-heap of the shared_pool has its own free list. In version 10g and 11g, each duration in sub-heap has its own free list. This is visible through x$ksmsp and column x$ksmsp.ksmchdur indicates the duration that chunk belongs to. In 9i, that column always has a value of 1 (at least, that I have experimented so far). In 10g & 11g (up to 11.1.0.7), there are exactly 4 durations in each sub-heap and values range from 1-4 for this column ksmchdur. Each duration has its own free list.

After reading my earlier post on shared pool A stroll through shared pool heap , one of my client contacted me with an interesting ORA-4031 issue. Client was getting ORA-4031 errors and shared pool size was over 4GB ( in a RAC environment). Client DBA queried v$sgastat to show that there is plenty of free memory in the shared pool. We researched the issue and it is worth blogging. Client DBA was confused as to how there can be ORA-4031 errors when the shared pool free memory is few GBs.

Heapdump Analysis

At this point, it is imperative to take heapdump in level 2 and Level 2 is for the shared pool heap dump. [ Please be warned that it is not advisable to take shared pool heap dumps excessively, as that itself can cause performance issue. During an offline conversation, Tanel Poder said that heapdump can freeze instance as his clients have experienced.]. This will create a trace file in user_dump_dest destination and that trace file is quite useful in analyzing the contents of shared pool heap. Tanel Poder has an excellent script heapdump_analyzer . I modified that script adding code for aggregation at hea, extent and type levels to debug this issue further and it is available as heapdump_dissect.ksh . ( with a special permission from Tanel to publish this script.)