Blogroll

Posts Tagged ‘performance’

A quick note, Expert Oracle RAC book co-written by me is available now: Expert Oracle RAC 12c. I have written about 6 chapters covering the RAC internals that you may want to learn 🙂 I even managed to discuss the network internals in deep, after all, network is one of the most important component of a RAC cluster.

On February 14-16, I’ll be at the Colorado Convention Center in Denver, Colorado for RMOUG’s Training Days Conference. This is the largest regional Oracle User Conference in North America and attracts presenters from all around the country and the globe. I’ll be presenting:

Presentation Name: Troubleshooting RAC Background Process

Abstract: RAC background process performance is critical to keep the application performance. This session will demo techniques to review the performance of RAC background processes such as LMS, LMD, LMON, etc. using various statistics and UNIX tools. The presentation will also discuss why certain background processes must run in higher priority to maintain the application performance in RAC.

Presentation Name: A Kind and Gentle Introduction to RAC

Abstract: This session will introduce basic concepts such as cache fusion, conversion to RAC, protocols for interconnect, general architectural overview, GES layer locks, clusterware, etc. The session will also discuss the srvctl command and demo a few of these commands to improve the understanding.

Presentation Name: Parallel Execution in RAC

Abstract: This presentation will start to discuss and demo parallel server allocation, intra, and inter node parallelism aspects. The session will discuss the new parallelism features such as parallel statement queuing, parallel auto dop, and discuss the interaction of those features with RAC. The session will probe a few critical parameters to improve PQ performance in RAC.

Click here for more information or to register for RMOUG’s Training Days.

In RAC, every data block is mastered by an instance. Mastering a block simply means that master instance keeps track of the state of the block until the next reconfiguration event (due to instance restart or otherwise).

Hash to the master

These data blocks are mastered in block ranges. For example, range of blocks starting from file 10, block 1 through block 128 may be mastered by instance 1, blocks from file 10, block 129 through 256 are mastered by instance 2 etc. Of course, there are differences between various versions 10g, 11g etc, but Idea here is that block ranges are uniformly mastered between various instances so that Global cache grants are evenly distributed among the instances. Interestingly, length of the block range is 128 from 10g onwards (Julian Dyke mentioned that is 1089 in 9i, but I have not personally tested it). Of course, Support recommends you to unset db_file_multiblock_read_count which will be auto adjusted to 128 which means that Full block range can be read with fewer GC messages, I suppose. I digress.

Further, Michael Möller pointed out that this hash-algorithm is further optimized: The hash-algorithm used when initially computing the master node from the DBA, results in a “virtual master”, which is then translated to a real (online&open) master by a lookup table (the length of which is the maximum number of possible nodes (128 ?). This means that when one node goes off/on-line, RAC does NOT have to recalculate the hash for all blocks, but only distribute the new Hash-to-node table. (One can later visualize dynamic remastering as an additional lookup table between the hash value and node. This table also needs redistributing on node changes.)

Following SQL is helpful in showing masters and owners of the block. This SQL joins, x$kjbl with x$le to retrieve resource name. If you are familiar with Oracle locking strategy, you would probably recognize the format of these cache fusion (aka old PCM) locks. Lock type in this case is BL, id1 is block# and id2 is file_id in this case. Column kjblname2 provides a decimal format lock resource.

A session is trying to access the block file 1, block 10776, but that block is mastered by instance 2 and also that block is owned by instance 2 (meaning, it is in instance 2 cache). So, instance 3 will sent a PR (Protected Read) mode BL lock request on that block to instance 2. Ignoring additional complexities, instance 2 will grant PR mode lock to instance 3 and transfer the block to instance 3. Obviously, this involves multiple GC messages, grants and block transfer. Statistics ‘gc remote grants’ gets incremented too.

Let’s consider that session is trying to access another block: file 1, block 6375. That block is mastered by instance 3 and also owned by instance 3. At this point, there is no additional GCS/GES processing is needed and the session pin that buffer and continue the work.

Let’s consider a third case. Session is trying to access file 1 block 6374. That block is not in any buffer cache, but instance 3 is master of the block, so local affinity locks are acquired with minimal GC messages and waits. That block is read from the disk in to the buffer cache

In the case #2 and #3 above, requesting instance also is the master node of a block or block range. In these cases, statistics ‘gc local grants’ is incremented and cheaper local affinity locks on those block ranges are acquired avoiding many Global cache messages.

So far so good, but what if, say instance 1, is reading one object (table, index etc) aggressively, but other instances are not reading that object at all? [ through some sort of application node partitioning or just plain workload]. Does it make sense for the instance accessing that object aggressively request a grant to the remote instance(s) for each OPEN on that object’s blocks? Especially, if the blocks are read in to the buffer cache, but disappears soon from the buffer cache? Wouldn’t that be better if the instance reading that object aggressively is also the master of that object, as in the cases #2 and #3 above?

In addition to that, if the block is supposed to be thrown away from buffer cache (close of BL lock) or if the block needs to be written, then that will involve additional overhead/messaging between the master instance and owner instance since the ownership needs to be communicated back to the master of the block.

Enter Object remastering.

Object Remastering

There are many new features in 10g/11g RAC stack. One of them is Object remastering feature. This feature was implemented in 10gR1 and improved in 10gR2 and further enhanced in 11g. I realize there are parameters in 9i also, but I don’t think it worked as intended.

With object remastering feature, if an object is accessed by an instance aggressively, then that instance will become the master of the object reducing gc remote grants improving performance of the application. In the prior sentence, I used the word “accessed”, but it is a loose term, and the correct term is if the instance is requesting much BL locks on an object, then that object can be remastered. In an ideal world, even if the application is not partitioned, remastering of the objects that were accessed aggressively from one instance will acquire cheaper local instance affinity locks and effective RAC Tax will be minimal.

Well, I said, in an ideal world 🙂 There are few issues here:

Instance do not remember prior mastership across restarts. This means that instance needs to re-learn the object mastership map after every restart. I can see the complexities of remembering the mastership, but it is possible to implement that.

Remastering is not exactly cheap. Instance GRD is frozen during reconfiguration and in a very busy instances, this can take many seconds leading to instance freeze for several seconds. While 10gR2 introduce parallel reconfiguration (_rcfg_parallel_Replay parameter controls this behavior) using all LMS processes to complete the reconfiguration, still, several seconds of freeze is not exactly acceptable in many environments.

I advice my clients to keep LMS processes to a lower value (3 to 5), at the most, but instance reconfiguration effective parallelism is reduced if we reduce number of LMS processes.

Last, but not the least important point is that, default values of few parameters that trigger remastering events are quite low for busy environments causing frequent remastering of objects. In an E-Business World, minor mismanagement in the manager configuration can lead to a massive reconfiguration issues.

Parameters, views and internals

Few parameters are controlling this behavior, not well documented, my test case results are not very accurate either. But, these parameters are giving us a picture of what is going on internally. These parameters are applicable to 10gR2 and below. For 11g, whole set of different parameters comes in to play and I will blog about the differences in another blog entry.

X$object_affinity_statistics maintains the statistics about objects and OPENs on those objects. It is important to understand the difference between OPEN and Buffer access. If the block is in the cache already in a suitable mode, there is no need for BL opens on that block. So, if the sessions are accessing the same block repeatedly without requesting any additional BL locks, then the count is not incremented. So, OPEN is simply a number of BL request initiated in an ephimeral time frame.

LCK0 process maintains these object affinity statistics. If an instance opens 50 more opens on an object then the other instance (controlled by _gc_affinity_limit parameter), then that object is a candidate for remastering. That object is queued and LMD0 reads the queue and initiates GRD freeze. LMON performs reconfiguration of buffer cache locks working with LMS processes. All these are visible in LMD0/LMON trace files. Parameter _gc_affinity_time controls how often the queue is checked to see if the remastering must be triggered or not with a default value of 10 minutes.

Now, you don’t want just any object as a candidate for remastering, meaning, if instance 1 opened 101 BL locks on that object and instance 2 opened 50 BL locks on that object, you don’t want to trigger object remastering. Only objects with higher amount of BL lock requests must be queued for remastering. Well, that threshold seems to be controlled by another parameter _gc_affinity_minimum: This parameter is defined as “minimum amount of dynamic affinity activity per minute” to be a candidate for remastering. Defaults to 2500 and I think, it is lower in a busy environment.

Few lines from LMD0 trace files showing that LMD0 is reading a request queue:

Performance can suffer if there are remastering issues. Following AWR report shows that few instances froze due to DRM reconfiguration issue. Same type of freeze is visible in all other nodes too. gc buffer busy is a side effect of DRM freeze (not always, but in this case).

Essentially, an object was remastered after excessive BL locking requests ( in a loose term accesses) on that index.

undo and affinity

Mastering of Undo segments differ from non-undo segment mastering. With non-undo segments, all the blocks are mastered by a hash technique spreading mastership among instances for a segment. Only after an instance opens BL locks aggressively on a segment that segment is mastered. But, for undo segments, Instance that activates an undo segment masters the segment immediately. This makes sense, since that undo segment will be used by the instance opening the segment in most cases. Parameter _gc_undo_affinity controls whether this dynamic undo remastering is enabled or not.

Since undo segments do not have real object_ids, a dummy object_ids over a value of 4294950912 is used. For example, undo segment 1 (with usn=1) will have an object_id of 4294950913, usn=2 will have object_id of 4294950914 etc.
[ 4294950912 = power(2,32) – power (2,14) = xFFFFC000 ]

I was not successful in triggering another undo segment remastering event. I created one active transaction generating 200K undo blocks in one node, another node was reading that table and I can see enormous waits for those undo blocks. But, I didn’t see any remastering events related to that undo segment. Not sure why it did not work, may be the conditions for the undo segment remastering is different.

[ PS: I am able to manually remaster the undo segment using lkdebug command discussed below: So, code must be remastering the undo segments automatically too, but may be some other conditions must be met.

I am not preaching that you should modify these undocumented parameters. Far from it. Understand the parameters, if you run in to wait events such as ‘gc remaster’, ‘gcs freeze for instance reconfiguration’, understand whether the default values are quite low. Work with support and see if this can be tuned.

In summary, remastering is a great feature. It is a pity that some times, we fall victims of the side effects. So, if you run in to issues with remastering, don’t disable it, but see if you can tune those parameter upwards so as to control the remastering events. If you stil want to disable DRM completely, I would recommend setting _gc_affinity_limit and _gc_affinity_minimum to much higher value, say 10Million. Setting the parameter _gc_affinity_time to 0 will completely disable DRM, but that also means that you can not manually remaster objects. Further, Arup mentioned that x$object_affinity_statistics is not maintained if DRM is disabled.

Again, these are undocumented parameters. Before you change these parameters make sure that Oracle Support agrees with you.

Update 1:

From 11g onwards, affinity management renamed to policy management. For example, x$object_affinity_statistics is renamed to x$object_policy_statistics. Similarly, initialization parameters are renamed too:Parameter _gc_affinity_limit is renamed to _gc_policy_limit; Parameter _gc_affinity_time is renamed to _gc_policy_time; A new view v$policy_history is introduced and all rows with policy_event = ‘initiate_affinity’ are associated with DRM events.
Other details about the blog still holds good, except that by default _gc_policy_limit is lowered to 1500 which would mean that, by theory, 11g may have more DRM events. YMMV.

[ Many Thanks Arup Nanda and Michael Möller (aka “M2”) for reviewing this blog entry, they contributed heavily to some of my discussion. But, any mistakes in this document is solely of mine. ]
This blog is available in a pdf format from Dynamic_remastering_RAC_orainternals.

I encountered few customer issues centered around library cache lock and library cache pin waits. Library cache lock and pin waits can hang instance, and in few cases, whole clusters of RAC instances can be hung due to library cache lock and pin waits.

Why Library cache locks are needed?

Library cache locks aka parse locks are needed to maintain dependency mechanism between objects and their dependent objects like SQL etc. For example, if an object definition need to be modified or if parse locks are to be broken, then dependent objects objects must be invalidated. This dependency is maintained using library cache locks. For example, if a column is dropped from a table then all SQLs dependent upon the table must be invalidated and reparsed during next access to that object. Library cache locks are designed to implement this tracking mechanism.

In a regular enqueue locking scenarios there is a resource (example TM table level lock) and sessions enqueue to lock that resource. More discussion on enqueue locking can be found in Internal of locks. Similarly, library cache locks uses object handles as resource structures and locks are taken on that resource. If the resources are not available in a compatible mode, then sessions must wait for library cache objects to be available.

Why Library cache pins are needed?

Library cache pins deals with current execution of dependent objects. For example, an underlying objects should not be modified when a session is executing or accessing a dependent object. So, before parse locks on a library cache object can be broken, library cache pins must be acquired in an Exclusive mode and then only library cache objects can be dropped. If a session is executing a stored object, then the library cache pins will not be available and there will be waits for library cache pins. Typically, this happens for a long running statement executing a stored object.

x$kgllk, x$kglpn and x$kglob

Library cache locks and pins are externalized in three x$ tables. x$kgllk is externalizing all locking structures on an object. Entries in x$kglob acts as a resource structure. x$kglpn is externalizing all library cache pins.

x$kglob.kglhdadr acts as a pointer to the resource structure. Presumably, kglhdadr stands KGL handle address. x$kgllk acts as a lock structure and x$kgllk.kgllkhdl points to x$kglob.kglhdadr. Also, x$kglpn acts as a pin stucture and x$kglpn.kglpnhdl points to x$kglob.kglhdadr to pin a resource. To give an analogy between object locking scenarios, x$kglob acts as resource structure and x$kgllk acts as lock structures for library cache locks. For library cache pins, x$kglpn acts as pin structure. x$kglpn also pins that resource using kglpnhdl. This might be clear after reviewing the example below.

Test case

We will create a simple test case to create library cache locks and pin waits

In this test case above, we create a procedure and it accepts two boolean parameters: sleep and compile. Passing true to first argument will enable the procedure to sleep for a minute and passing true for the second argument will enable the procedure to recompile itself.

Let’s create two sessions in the database and then execute them as below.

Well, no surprise there. New session 442 also waiting for library cache pin. But, notice the request mode for session 442. It is 2. Session 442 needs that library cache pin in share mode to start execution. But 313 has already requested that library cache pin in mode 3. A queue is building up here. Many processes can queue behind session 313 at this point leading to an hung instance.

Wait, what happened to session #2? It is not visible in x$kglpn. Querying v$session_wait shows that Session #2 is waiting for library cache lock. We will run yet another query against x$kgllk to see library cache lock waits.

Session 313 is holding library cache lock on that object in mode 3 and session 268 is requesting lock on that library cache object in mode 2. So, session 268 is waiting for library cache lock while session 313 is waiting for library cache pin (self ). Again, point here is that session 268 is trying to access library cache object and need to acquire library cache lock in correct mode. That library cache lock is not available leading to a wait.

Things are different in RAC. Library cache locks and pins are global resources controlled by GES layer. So, these scripts might not work if these library cache lock and pin waits are global events. Let’s look at what happens in a RAC environment

Essentially, this script does not work in a RAC environment since it accesses x$ tables directly, which are local to an instance. To understand the issue in a RAC environment we need to access gv$ views, based on x$kgllk, x$kglpn etc. But, I don’t see gv$ views over these x$ tables. We are out of luck there unless we do some more coding.

Nevertheless, we can see lockers and waiters accessing gv$ges_blocking_enqneue to understand locking in RAC.

We can see that PID 12741 from instance 1 is holding a library cache global lock [LB]. Global resource in this case is [0x45993b44][0x3a1b9eee],[LB] which uniquely identifies a library cache object at the cluster level. Grant_level is KJUSEREX or Exclusive level and request_level from node 2 is KJUSERPR which is Protected Read level. PID 8700 in node 2 is waiting for library cache lock held by PID 12741 in node1. Using this output and our script output, we can pin point which process is holding library cache lock or pin. While Library cache locks are globalized as global locks in the range of [LA] – [LZ], Library cache pins are also globalized as lock types in the range [NA]-[NZ].

I presented about Cost based optimizer explaining why some times CBO chose inefficient access plan, even though, there is an efficient plan in the search space. This entry is to post presentation slides and they can be downloaded from Why_optimizer_hates_my_sql

My client DBA wrote a small script: To drop a column to a busy table. This application is mostly an OLTP application and table is modified heavily. It is hard to get a lock on the table, even though DDL to drop the column will be quick completing within sub-seconds. Following script was tried in the middle of night so as to minimize production impact.

CPU usage was very high in production database server in user% mode. Dynamic performance view v$session_wait indicated excessive waits for latch contention. Output from a script wait_details.sql (scripts available at the end of this post) shows that many sessions were waiting for ‘latch free’ event. Also, address for these latch children are the same, meaning all these sessions are trying to access one latch children.

We can identify SQL causing latch contention querying v$session_wait. From the output below, SQL with hash_value 1509082258 is suspicious since there are many sessions executing that SQL and waiting / waited recently for ‘latch free’ event.

Average wait time for ‘log file sync’ event is 148ms and it is 5ms for ‘log file parallel write’ event. There is an oddity here. Wait time for ‘log file sync’ event seems bit excessive compared to ‘log file parallel write’ event. Ratio between number of waits for ‘log file sync’ and ‘log file parallel write’ is approximately 4. In an average case scenario, if each commit resulted in 4 log writes, then I would expect an Average wait time of 20ms for ‘log file sync’ event. Wait time of 148ms? Needs further analysis.

pstack, pmap and more

Database is running in a Solaris 10 server and database version 9.2.0.8. Since this is a Solaris operating environment, we can use proc utilities to debug this further. We used a script to dump pstack and pmap of LGWR in a loop. This wasn’t much help, but we could guess that LGWR was waiting for I/O since there were many lines in the stack with kaio and aio wait calls. Few lines from pstack output printed below.

I/O statistics did not provide much clarity either. We need to find if LGWR is suffering from a performance issue. To see if LGWR is suffering from any OS related issues, we need to trace system calls from LGWR and Truss utility provides such a facility. Suffering from truss related paranoia, we didn’t want to run truss against LGWR since that can affect performance, more importantly database stability.

Fortunately, dtruss came handy. dtruss is based upon dtrace utility and by design, dtrace is safe in Solaris Sparc platform.

Recently, we resolved a performance issue with one of our esteemed clients. Problem was interesting and worth blogging about it.

Problem
An application makes 300 static connections to database DB1 in database server, say CLNTDB server. Application relies on database link and over a period of time, each session executes SQL through a database link, creating a connection in central database PROD1. So, there are 300 connections in PROD1 database coming from DB1 database through database links. Performance is fine during normal operation.

Problem starts when the application is shutdown. Shutting down the application in DB1, creates massive CPU consumption in PROD1DB server. Unfortunately, this spike in CPU usage lasts for 5-10 seconds and causes ASM heartbeat to fail. Considering that PROD1 is a central database and this has a global effect on application(s) functionality. See presentation below for graphical representation of this problem.

Symptoms
Looking at symptoms in detail, it is visible that CPU usage in sys mode. It is not uncommon to see high CPU usage during a storm of disconnects. But, this specific CPU usage is much higher and uses all CPUs in kernel mode. If this is a problem due to process destruction, then this will show up in CLNTDB server too as there are 300 disconnects there too. But, this problem manifests only in PROD1DB server.

mpstat output from PROD1DB server shows that CPU usage in %sys mode. Notice the numbers below under sys column. almost all CPUs are used in sys mode. We need to drill down further to understand why this much CPUs are used in sys mode. It is also visible that no other columns have any abnormal higher values.

Test case
Let’s analyze this issue for a single connection to simplify the problem. Created a test user in DB1 database, created a private database link from DB1 to PROD1 and then executed a select statement over that database link.

At this point, a database link connection has been created in PROD1 database. Querying v$session in PROD1 database to find session created for database link. SID 4306 is the session created for our test connection from db1 database.

Associating this session to v$process, found UNIX process for this session. Now, we will start truss on this UNIX process( Solaris platform).

truss -p -d -o /tmp/truss.log

In DB1 executed: exit

This will disconnect session in PROD1 database.

Truss output
Truss output contains activity generated for disconnect in PROD1 server. Reading through truss output, we can see that system call (shmdt) consumed CPU time. Each shmdt call consumes approximately 24ms from output below (18.5053-18.4807=0.0242). We are using -d flag as above to print time spent in that call. Of course, this is a system call and so this CPU usage will be in kernel mode.

So, one disconnect executes 10 system calls and consumes approximately 0.24 CPU seconds in kernel mode. Shmdt calls are used to detach from shared memory segment. Since there are 10 shared memory segments ( as visible in ipcs -ma), 10 shmdt calls are executed per session disconnect.

Projecting this calculation for 300 connections, CPU consumption will be for approximately 72 seconds in total. With 16 CPUs, at least, 5-6 seconds will be used in kernel mode, assuming linear projections[ But, in practice, due to mutex calls and such this may not be linear and will be over 72 seconds]. This is matching with our observation: 5-10 seconds of CPU consumption in kernel mode. First thing, we need to do is reduce # of shmdt calls. One way is to reduce # of shared memory segments.

Shared memory segments
We thought that SHMMAX kernel parameter is limiting since SGA size was bigger than SHMMAX size. After changing SHMMAX parameter, restarting server and such, still many shared memory segments were created.

We had a discussion about import performance in this otn forum . During the discussion, OP raised a doubt that import will resort to single row inserts for tables with date columns. Buffer parameter specifies, essentially, size of this array for array inserts.

We know that if a table has lob columns, then import parameter buffer is not honored and import utility resorts to single row inserts for those tables. But, claim here is, tables with date columns also suffers from single row inserts. We will probe this further and validate that claim in this blog.

Above code fragment created a table, inserted 300,000 rows and exported that table to an export dump file. This dump file is ready to be imported. But, we need to trace the import to measure the effect of buffer parameter. Problem is that how to we trace import session alone, without generating every session in the database? This can be achieved by creating a logon trigger as below. Only sessions from a test user will have trace enabled from this trigger (username is cbqt).

REM I could potentially , use "on schema clause too, but this is part of a generic code that I use.
REM Riyaj Shamsudeen - To trace a session through logon trigger
create or replace trigger
set_system_event
after logon on database
declare
v_user dba_users.username%TYPE:=user;
sql_stmt1 varchar2(256) :='alter session set events '||chr(39)||'10046 trace name context forever, level 12'||chr(39);
begin
if (v_user = 'CBQT') THEN
execute immediate sql_stmt1;
end if;
end;
/

Let’s drop the table, import with a default buffer size of 64K. Through logon trigger a new sql trace file will be generated. That trace file will be analyzed with tkprof utility as shown in the code fragment below: