Before proceeding though, please read this post about safety of different stack sampling approaches.

I have had few non-trivial Oracle troubleshooting cases, related to query hangs and bad performance, where I’ve wanted to know where exactly in execution plan the current execution is.
Remember, Oracle is just another program executing instructions clustered in functions on your server, so stack sampling can help out here as well.

So, I was looking into the following stack trace taken from an Oracle 10.1 database on Solaris SPARC, running a SQL with this execution plan.

First, by now it should be obvious that in Oracle, each rowsource operator (the different lines you see in SQL execution plans) is actually just a function inside Oracle kernel. These are the row source functions, starting with “qer”. QER stands for Query Execution Rowsource as far as I know.

Whenever an Oracle process fetches data from a cursor, it calls opifch2() which in turn calls the root rowsource function in execution plan. In my case that function was qersoFetch and my os_explain script just substituted the “qerso” part with SORT (as per the Metalink note I mentioned above). The first child function of qersoFetch was qerhjFetch, which is a hash join rowsource, and so on. Note that os_explain prefixes some lines with an asterisk (*), this indicates that the output of given function is in turn filtered by a filter operation (the same filter ops what you normally see in the bottom of DBMS_XPLAN explained plans).

So, logically you can imagine an execution plan as a tree of Oracle functions:

For a SELECT query the OPI fetch function (opifch2) would be the root.

Various join and union functions like qerhjFetch (HASH JOIN) and qerjotFetch (NESTED LOOPS JOIN) would be branches.

But physically, an execution plan is just a memory structure in subheap 6 of a child cursor (x$kglcursor.kglobhd6), which has a bunch of rowsource function opcodes in it.
During plan execution Oracle process “just” traverses through those opcodes, looks up the corresponding rowsource function starting address using a lookup table and calls it. That function does its task (probably calls other rowsource functions recursively) and returns to caller.

Note that many rowsource functions are designed to be cascading, being able to do only the work needed for returning a small subset of rows and return only few rows at a time, as opposed to the whole resultset.
This is a very good thing as rows can be cascaded, or pipelined back to parent functions as rows become available. For example a table fetch only fetches a handful of rows (and not the whole table) at a time and returns these “up” for further processing. Also, a nested loop join is able to pass matching rows “up” from the moment first matches are found, again there’s no need to perform the join on full dataset first before returning first rows.
This also means that there is no need to store the whole intermediate resultset somewhere in memory before passing it up to parent function; instead we just revisit that branch of execution plan tree whenever we need more rows from it. And the os_explain script shows you exactly in which execution branch the execution currently is.

Addition: I will elaborate on how to match the execution plan with stack trace in an upcoming post – it’s too much material for an introductory post.

So, cascading rowsources allow us to “incrementally” execute plans involving large datasets, without need to keep the intermediate resultsets in memory. On the other hand, few rowsource operators in your execution plan, like SORT, can not return any rows up before all its children’s rows are processed. With SORT (and aggregate operations which also use SORT) you just have to process all the source rows before returning any meaningful result back. You can’t just go through only half of the source data, order it and start returning rows in hope that the rest of the rows should have not been returned as first in the order. This is where the SQL cursor workareas come into play for such operations.

SORT, HASH and BITMAP rowsources can allocate SQL workareas for them, while others can’t. This can easily be identified from execution plan statistics of following sample query:

You see how the HASH operations do have their last Mem columns populated, therefore those rowsource functions did allocate a SQL workarea for them. Others like NESTED LOOPS joins and data access rowsources did not have any SQL workarea memory allocated as they are completely cascading.

Note that os_explain can also read the stack from STDIN as seen in example below. Also the -a option will tell os_explain to show all functions in the stack and not only the execution plan ones and their children.

The FILTER rowsources (not talking about filter predicates on normal rowsources here) can make things more complicated though as they can introduce their own logic and loops into the execution plan (for running correlated subqueries etc).

By the way, see what happens when I run exactly the same query with rowsource level statistics collection enabled:

Every rowsource is wrapped into a QUERY EXECUTION STATISTICS wrapper, which’ task is just to count the number of rows sent “up” to parents in the tree, logical IOs and also rowsource timing info whenever an internal counter (set by _rowsource_statistics_sampfreq parameter) wraps.

This is just an intro to Oracle execution plan internals and troubleshooting. Hopefully you don’t need this technique too often, however it has helped me to successfully pinpoint the root cause of a problem in few non-trivial database problems.

Note that in Oracle 11g there’s an excellent new feature called Real-time SQL Monitoring. It allows you to monitor the progress of currently running SQL statements. For serially running statements the monitoring kicks in for a statement after it’s used total 5 seconds of CPU or IO time (this time is controlled by _sqlmon_threshold parameter, but for PX the monitoring is always enabled). After that you can query V$SQL_MONITOR and V$SQL_PLAN_MONITOR for seeing how much time/rows/LIOs that session has spent executing a statement. You can see these details even at the SQL execution plan line level. Alternatively you can use DBMS_SQLTUNE. REPORT_SQL_MONITOR function to get this info nicely formatted. Greg Rahn has written a good blog entry about it.

However it’s important to note that both DBMS_SQLTUNE and V$SQL_MONITOR/V$SQL_PLAN_MONITOR use requires you to have Oracle Tuning Pack license which in turn requires Oracle Diagnostic Pack license. Details are in Oracle 11g Licensing Guide.

So another, low level approach for real-time monitoring will still be handy even after 11g becomes mainstream. In a future post I will be showing how to measure the progress and get execution profile of your plan by aggregating multiple stack traces and also some cool opportunities with DTrace.

os_explain is just a shell script using sed & awk. It is doing just some string manipulation, search & replace etc.

DTrace allows accounting timing information for each rowsource function. On other platforms this could be doable with platform profilers (like OProfile in Linux).

However the script fills its original purpose as it is right now – I wanted to see where in execution plan the execution was mostly looping and taking a few os_explain samples did show it to me. I yet have to write a post about how to map os_explain and v$sql_plan execution plan lines together…

I’ve been using dbms_xplan.display_cursor quite a bit lately. I love the information it provides but I’m having trouble finding any detailed documentation on it. I’m specifically looking for more information on what the Omem and 1mem columns represent. Can you shed any light on the subject or point me in the right direction?

Well os_explain gives some indication. If you happen to have any physical IOs, you can map these (using v$session_wait.p1/p2) back to the segment name accessed. And if you know the segment name, you’ll know the execution plan line(s) currently executed as access path rowsources show the table/index name in exec plan as well.

More options would be to trace consistent reads (with event 10200) and map the data block addresses back to segment names, but this it not a widely used technique (not for production).

I would start from pstack as its possible to translate this back to execution plan.

If you still can reproduce this, send me couple of process stacks when this query is executing and the execution plan, I can reply you with answer and explanation how I worked it out..

I thought I’d report a small glitch I encountered when working with os_explain. It appears that on RHEL 4 the version of ksh that comes installed by default is restrained by the old 1024 array size, which means that having

MAX_ARRAY=4095

inside the f_lifo procedure causes the script to fail with a “subscript out of range” error.

There are a couple of workarounds…

1. (easiest) Change the shell invocation at the top of the script to bash– your script runs fine in bash, and bash does not have such a tiny array ubound:

#!/bin/bash

2. Upgrade to a modern version of ksh

Also, I noticed that the version of ksh installed with RHEL 5 is newer and does not suffer from this limitation.

This is not a problem with your script per se, but I thought I’d offer this in case anyone else tries using your script on Redhat Linux and is baffled like I was.

BTW, thanks for bringing your training seminar to the NYOUG earlier this month! It really took my understanding to a new level.

It works fine in RHEL4, but installing it involves upgrading various packages like gcc, glibc, libgcc, etc to their RHEL5 versions. I’d hesitate to do that in a production environment, needless to say. Easier just to use bash.

I am curious – I see from your examples and from my own tests that pstack is only capturing current stack operations. You mention that you have a loop to cycle through while the query is running. This begs a couple questions:
1) A loop is going to, effectively, sample the process stack. Correct?
2) What would it take to stitch together stack traces for os_explain? Not sure if just deleting the “header” lines would do it.

In this script (extracting stack traces) truss can’t be used at all, as far as I know truss can’t extract snapshots of stack traces like pstack does.

But if comparing truss’es a.out:func* notation with DTrace, I would always use DTrace if its available and usable. DTrace is safe by design (although you *can* cause trouble with it even in non-destructive mode) but truss had problems in past IIRC, such target processes missing signals sent to them and failing to exit.