What Degree Of Parallelism (DOP) was used for the execution of statement XYZ?

Anyone seriously using the Enterprise Edition Parallel Execution feature should consider upgrading to Oracle 11.2 and getting the Diagnostics and Tuning Pack license, then you might be able to answer some of these questions easily.

Oracle has significantly enhanced the Real-Time SQL Monitoring feature in 11.2, which was introduced in 11.1.

But there are some problems, even with the latest versions of Real-Time SQL Monitoring, one of them being simply that the reports are not persisted by default and they might no longer be available when you need them. I'll come back to this later.

When monitoring Parallel Execution, in particular the "ACTIVE" or "EM" version of the Real-Time SQL Monitoring offers invaluable insights into the SQL execution details.

On top of the general information provided and "Plan Statistics" tab that shows execution details on execution plan line level, the "Parallel", "Activity" and "Metrics" tabs allow gaining a lot of insights that weren't available in previous versions.

Here are some example Real-Time SQL Monitoring screen shots of a Parallel Execution (you can also simply access the active report version using this link):

General Information

So we can see this statement is completed, was a cross-instance Parallel Execution with two instances each running a DOP of (maximum) 16 (shown as DOP 32 in Real-Time SQL Monitoring). We can also see that we have a larger DB Time than Wall-Clock time, so it was a true Parallel Execution, but we probably would expect a much larger DB Time / Wall-Clock Time ratio given the Parallel Degree shown - a first indication that something might not have worked out as expected.

Furthermore we can see that this statement didn't perform any physical I/O, everything came from cache, which looks like an In-Memory Parallel Execution (it was, but not using the new 11.2 In-Memory Parallel Execution feature, but "hand-made")

Plan Statistics

We can see here that three operations of the plan were sampled most, in this case always on CPU.

Parallel

Here we can see various things:

- The work wasn't distributed evenly among the involved RAC nodes

- The work wasn't distributed evenly among the Parallel Slaves, one Parallel Slave had to work much more than the others => Parallel Execution skew problem

Activity

Here it can be seen that the "Parallel Execution skew" problem that could already be identified in the previous tab really had a significant impact on the SQL execution: There is only a short peak with the expected Parallel activity and a very long tail of execution where only a single CPU was active.

Metrics

This tab confirms what we've seen so far, but in this particular case doesn't add more information, but it definitely could in other cases (PGA / TEMP consumption, I/O requests and throughput analysis for example)

So as you can see the Real-Time SQL Monitoring offers great insights for Parallel Execution, but there are - at least in the currently available versions - also some shortcomings:

In a busy system the Real-Time SQL Monitoring information can age out quite quickly out of the Shared Pool, although Oracle tries to keep it as long as possible after the actual execution finished. You could work around this using a custom job that somehow persists the SQL Monitoring information, but that feature is not there out of the box. Note: You could also change the undocumented retention setting if you want to keep them longer. See Doug Burns' post for more information

For cross-instance Parallel Execution the "Activity" and "Metrics" tabs don't group the information provided by instance, and therefore you might not be able to tell if one of the involved instances was overloaded in terms of CPU usage, memory or similar

By default Real-Time SQL Monitoring doesn't monitor plans that have more than 300 lines. This can be changed by modifying undocumented parameters. See Doug Burns' post about this for more information

The Real-Time SQL Monitoring lacks precision in some regards, for example it only reports the maximum DOP found if an execution plan consists of multiple Parallelizers (also called "Data Flow Operations", DFOs). Each DFO gets its own DOP determined (and in principle will have its own Parallel Slave Sets assigned at runtime).

You have to look very closely at the number of executions on plan line level (only applicable to some operations depending on the shape of the execution plan) or have to navigate to the "Parallel" tab and check the multiple "Parallel Group" details how many Parallel Slaves were allocated per DFO to get more details about that, but even then you cannot easily relate this information to the execution plan, which means you don't know which part of the execution plan was executed with what DOP.

Here is an Real-Time SQL Monitoring report for an execution plan consisting of multiple DFOs. Can you easily tell which part of the execution plan ran with what DOP?

If - as in the first example above - Parallel Execution skew causes problems, then unfortunately Real-Time SQL Monitoring lacks the details which operations of the execution plan exactly are/were impacted by the skew, which would allow gaining a better understanding what caused the skew and how the skew problem could be addressed - the basis for a systematic troubleshooting approach.

Look at the various pictures from the report shown at the beginning - where do you think the skew happens and which operations are affected?

Very likely you would turn to the operation that is responsible for the majority of the DB time (The PX SEND HASH operation in the lower part of the execution plan). This approach is in principle a good idea, but in this case leads you into the wrong direction.

And the example is a very simple three-table join where you might be able to work through the different joins manually (tedious job and not always possible with transient data) and find out where things might go wrong, but imagine this was your real-life 150+ lines Parallel Execution plan including dozens of joins?

In short, it is not easy to tell from the Real-Time SQL Monitoring report. You can only tell that some Parallel Slaves have to perform more work than others, but it's not that simple to get more details about the problem for systematic troubleshooting.

Note that when looking at the "raw" (XML) output of the Real-Time SQL Monitoring and the underlying queries used to populate the XML it can be seen that the skew problem is already mentioned there but the information seems to be incomplete and/or is not displayed/processed yet in the report frontend code. May be this will change in a future release.

In order to address above shortcomings I've developed a SQL*Plus script that provides information about the SQL execution based on the Active Session History (ASH). Although that information is only based on sampling active sessions it still can provide a wealth of information, in particular for transient problems that already happened, are potentially hard to reproduce and you don't have a Real-Time SQL Monitoring (anymore) report at hand respectively you face one of the problems described that are not (yet) handled that well by the Real-Time SQL Monitoring.

Note that Oracle 10g already introduced ASH, but in 10g the ASH data lacks the relation to a particular SQL execution and the SQL execution plan lines, therefore the tool only processes ASH data from version 11g on. The information not related to execution plan lines could be extracted from 10g ASH (or free ASH implementations) but it would require a slightly different approach how to restrict the ASH data processed (for example, start sample time / end sample time instead of the 11g approach of identifying the data corresponding to an execution by SQL_EXEC_START / SQL_EXEC_ID).

Furthermore it should be mentioned that ASH requires at least the Enterprise Edition plus Diagnostic Pack license. For Real-Time SQL Monitoring you need the Tuning Pack license in addition.

Let's have a look what the tool provides about an SQL execution based on the ASH data. I'll refer the same SQL executions that were monitored above.

General Information

The first part of the output is general information so that you can verify that you're really looking at the execution that you want to analyse, and which options the tool uses:

Note it can be controlled via options which of the following output should be shown.

ASH Summary

The next part is a high-level summary of the ASH data found for that execution, including the number instances involved (for RAC cross-instance Parallel Execution), whether the execution seems to be active or inactive (which is a good indicator that it is completed or still executing), the assumed duration of the execution, the number of samples found, how many of them were CPU samples, the number of Parallel Slaves found, and the resulting number of Average Active Sessions. This last two bits of information gives already a rough idea about the actual Parallel Execution degree.

RAC Instance Summary

If a cross-instance Parallel Execution was detected, a similar overview on Instance Level will be provided, including an indication of the time the instance was active during the execution - it might happen that an instance was only involved partially in a Parallel Execution depending on the execution plan.

Here we can see that one instance was only active for a short period of time.

DFO Summary

If a Parallel Execution was detected and a corresponding execution plan could be found (from the Shared Pool, or the AWR workload repository, depending on the options used), then there will another summary information per Instance / DFO.

This information is part of the attempt addressing above shortcomings - it will show the number of Parallel Slaves found per DFO and based on the execution plan will try to perform an educated guess about the actual DOP used. Of course this can be incorrect if not all of the Parallel Slaves involved show up in the ASH samples, so this information has to be treated as what it is: a guess.

It will also show a graphical representation of the time the DFO / Instance combination was active relative to the full execution time, which tells another important point: If there were multiple DFOs involved, were they active at the same time, or after each other? In the former case, you can end up with a Parallel Execution that requires more Parallel Slaves than potentially expected (because each DFO gets its own set of Parallel Slaves), in the latter case Oracle usually is clever enough to re-use the same Parallel Slave processes for the different DFO's slave sets.

Here is a sample output for an execution where multiple DFOs were active at the same time:

Here is a sample output where multiple DFOs were active, but not at the same time. It is actually the corresponding output for above Real-Time SQL Monitoring report for the statement consisting of multiple DFOs that ran with different DOPs. It shows clearly which DFO used which DOP:

Average Active Session Graph

If a Parallel Execution was detected the "Average Active Session Graph" follows (if not disabled via the options). This is basically a cheap imitation of the "Activity" tab from the Real-Time SQL Monitoring, from 11.2 on mixed with some information from the "Metrics" tab, since from 11.2 ASH also includes PGA and TEMP figures per sample.

The graph shows the Average Active Sessions in a configurable number of buckets (100 by default), which means that an execution that has more samples than buckets will be compressed and averaged accordingly.

The graph shows CPU and non-CPU samples separately, furthermore the graph is always on instancelevel for cross-instance Parallel Execution, since it is important to see if there is a particular node in trouble. For example the overall memory or CPU usage might be reasonable for a cross-instance Parallel Execution, but you can't tell from such an cross-instance summary if one of the instances actually had to perform the majority of the work. As already mentioned above, this is again something that is not that easy to tell from a Real-Time SQL Monitoring report.

Here we can see that the graph in principle corresponds to the "Activity" tab from the Real-Time SQL Monitoring report and confirms that most of the time only a single CPU was active, with only a short spike of CPU activity at the beginning of the execution.

We can also see that the second instance was not involved in the execution most of the time.

Execution Plan ASH details

If an execution plan could be found by the DBMS_XPLAN.DISPLAY_CURSOR / AWR function used, then the ASH data will be shown along the execution plan per operation.

The following additional columns can be "injected" into the original DBMS_XPLAN output, using the same technique that was already used in the previous prototype implementation.

Note that the prototype extended Rowsource Statistics reporting functionality is still part of the script and can be combined with the ASH reporting.

Act: Indicates the most recent active plan lines according to ASH (only applicable if the statement is currently executing)

Start: Show the number of seconds since statement start the plan line was active the first time

Dur: Show the number of seconds the plan line is/was active

Time Active Graph: Show a graphical representation of the activity timeline of the plan line

Parallel Distribution ASH: Show the Top N processes names along with the number of samples found per SQL plan line id. A trailing "..." indicates that there are more than N processes found

Parallel Distribution Graph ASH: Show the distribution of the top processes names either relative to the number of samples per SQL plan line id or to the total number of samples. A trailing "..." indicates more processes than displayed

Activity Graph ASH: Show a graphical representation of the number of samples against that SQL plan line id relative to the total number of samples

Top N Activity ASH: Show the Top N activities (waits or ON CPU) from ASH for that SQL plan line id. A trailing "..." indicates that there are more than N activities found

Note that it can be configured which of those columns should be shown (or hidden).

Note in particular that the "Activity Graph ASH" roughly corresponds to the "Activity" columns of the "Plan Statistics" tab in Real-Time SQL Monitoring, but the "Parallel Distribution" columns along with the "Time Active Graph" column make it very obvious which parts of the execution plan are affected by (this type of) Parallel Execution skew and where the time is lost. This is critical information in understanding for systematic troubleshooting and what could be changed to address the problem.

This information usually is only available by looking at the dynamic performance view V$PQ_TQSTAT. But this view has some serious limitations:

It is only available from within the Query Coordinator session after statement execution. It cannot be queried from outside. It therefore requires you to reproduce the issue separately in a session which might be hard (sometimes impossible due to transient data / other run time effects) and troublesome.

It is only populated after a successful completion of the statement. In case of errors (running out of TEMP space for example) or cancelled queries due to long execution times it won't be populated.

Depending on the version of Oracle and the execution plan shape (multiple DFOs, Parallel DML/DDL) the output might be incomplete and misleading.

Here is a snippet from the V$PQ_TQSTAT output for this particular statement execution where you can easily tell that a particular re-distribution of the data caused the skew:

Historic ASH

Since the major use case of the tool is post-execution analysis, the same information can be pulled from the historic ASH information persisted on disk and available from the DBA_HIST_ACTIVE_SESS_HISTORY / DBA_HIST_SQL_PLAN views.

Although Oracle by default only writes every 10th sample to disk and hence we end up with samples of sampling, the output can still be helpful. As an example, here is what the tool can extract from the history views for the same execution:

So although we can see that a lot of information has been lost, in this case the most important aspects can still be told from the output based on historic data.

Notes

Note: Depending on the version and the particular circumstances the ASH data might be subject to bugs / instrumentation issues - so some of the samples might miss important information like the correct SQL_ID, or SQL_EXEC_START / SQL_EXEC_ID / SQL_PLAN_LINE_ID is blank. I've noticed that the Real-Time SQL Monitoring attempts to address some of these issues in a way that this script does not yet, so this might explain some differences in the output of the two. Obviously the script can only partially address such issues and the significance of the analysis depends on the accuracy of the ASH sample data.

Furthermore the treatment of recursive SQL is different between Real-Time SQL Monitoring and this script, so depending on what you exactly analyse the output might differ significantly - this script at present only shows output for ASH data that is assigned to the SQL_ID in question. Note that the way the the Real-Time SQL Monitoring handles this can have some interesting side effects, too.
I already have some ideas how this could be changed in a future version, but decided to publish the script in its current version as it should be helpful in many cases as it is now.

The script requires write access to the current working directory. If you get errors shown during the execution of the script like:

SP2-0110: Cannot create save file ".xplan_ash_temp"

plan_operations as *ERROR at line 14:ORA-00933: SQL command not properly ended

then you cannot write to the current work directory

There are multiple comments in the script like "If you need to debug". By following the instructions shown there you can change the verbosity of the script in order to understand what went wrong. I would need such detailed output in order to help if the script doesn't work as intended

A note to RAC users: If the current instance was *not* involved in executing the SQL, and the execution plan should be displayed from the Shared Pool (CURR option), in best case the execution plan cannot be found.

In worst case an incorrect plan will be associated from the local instance Shared Pool - it is possible to have the same SQL_ID / CHILD_NUMBER with different plans in different RAC instances.

Therefore you need to be careful with cross-instance / remote-instance executions in RAC.

Why? The tool at present relies on DBMS_XPLAN.DISPLAY_CURSOR for showing the execution plan from the Shared Pool - but DISPLAY_CURSOR is limited to the local Shared Pool. There is already a future version in planning that addresses this shortcoming

At present the script assigns operations that don't have a SQL_PLAN_LINE_ID to the operation ID = 0. So you need to be careful if you see an unusual high activity of the initial plan operation. It is a good idea in such a case to cross-check the raw ASH data and see if there are many samples that lack a SQL_PLAN_LINE_ID information

Usually the DISTRIB_REL option is the most reasonable choice for analysing the distribution of work between Parallel Slaves using the "Parallel Distribution Graph" on execution line level, but in case of very different workload of execution plan lines (for example one line accounts for thousands of samples, but others only for far less) the skew in other lines might not be that obvious. In that case the DISTRIB option can make that skew more obvious, but the downside of that option is that insignificant plan lines that contribute only a few samples might show up as very skewed which can be misleading. You need to cross-check then the number of actual number samples for such plan lines in order to understand the significance of the output.

Some options of the the script don't work like similar options to Real-Time SQL Monitoring. This is a deliberate design decision, since the tool doesn't require necessarily the Tuning Pack license to function and therefore sometimes uses a different approach. One example is the last SQL executed for a given Session ID - Real-Time SQL Monitoring offers a similar option, but reports then the last monitored SQL of that session, which isn't necessarily the last SQL executed by that session.

SQL_ID (or SID=<nnn>[@<inst_id>])

Specify the SQL_ID to analyse. Alternatively specify a Session Identifier, optionally with a trailing Instance ID (for RAC). The tool will attempt to get the current / previous SQL_ID / CHILD_NUMBER of that session and use that as SQL_ID / CHILD_NUMBER input

CHILD_NUMBER (or PLAN_HASH_VALUE)

Identify the CHILD cursor in the Shared Pool, or the Execution Plan for the given SQL_ID from AWR by PLAN_HASH_VALUE (if you use the other option - see below - to obtain execution plan information from AWR instead of Shared Pool)

If both SQL_ID and CHILD_NUMBER/PLAN_HASH_VALUE are omitted the previous SQL_ID / CHILD_NUMBER of the current session will be used.

Note that if the CHILD_NUMBER is not provided, but a SQL_ID only, based on the remaining options specified or not (SQL_EXEC_START / SQL_EXEC_ID) the script will attempt to lookup this information from ASH / Real-Time SQL Monitoring.

DBMS_XPLAN.DISPLAY* format option

By default the tool uses the most verbose formatting of DBMS_XPLAN. You can either change the default in the script or specify here a different format option. Note that you need to explicitly request ALLSTATS / MEMSTATS / IOSTATS or similar if you want to make use of the extended Rowsource Statistics feature of the tool

SQL_EXEC_START / SQL_EXEC_ID

This is a date / number pair that together uniquely identifies an SQL statement execution instance. The default date format is "YYYY-MM-DD HH24:MI:SS" but can be changed in the configuration section of the script.

If that information is omitted and ASH information should be processed, the tool searches for the last execution of the given SQL_ID information in either ASH or Real-Time SQL Monitoring data.

MONITOR/ASH

Where to search for the last execution, in GV$SQL_MONITOR or GV$ACTIVE_SESSION_HISTORY. By default ASH is searched, but in the presence of Real-Time SQL Monitoring (Tuning Pack license) this can be used to attempt that both tools pick up the same last execution.

ASH options

A string consisting of ASH, DISTRIB[_REL|TOT] and/or TIMELINE that defines which ASH information will be shown in the output. The DISTRIB options relate to Parallel Execution information, the TIMELINE to the "Start / Duration Time Active Graph" info on the execution plan line level, and ASH to everything else.

By default the tool uses the DISTRIB_REL option. The other options DISTRIB_TOT and DISTRIB change the proportion of the "Parallel Distribution Graph". You can play with these different options - depending on the individual execution they might make things more or less obvious.

Note: Usually the DISTRIB_REL option is the most reasonable choice, but in case of very different workload of execution plan lines (for example one line accounts for thousands of samples, but others only for far less) the skew in other lines might not be that obvious. In that case the DISTRIB option can make that skew more obvious, but the downside of that option is that insignificant plan lines that contribute only a few samples might show up as very skewed which can be misleading. You need to cross-check then the number of actual number samples for such plan lines in order to understand the significance of the output.

If NONE (or any other invalid string) is specified as ASH option, no ASH based information will be shown. This might make sense if you only want to use the Rowsource Statistics feature of the script.

CURR/HIST/MIXED

HIST means DBA_HIST_ACTIVE_SESS_HISTORY and DBMS_XPLAN.DISPLAY_AWR for plan formatting - you need to specify the PLAN_HASH_VALUE instead of the CHILD_NUMBER in that case.

MIXED means ASH data from GV$ACTIVE_SESSION_HISTORY but obtain the execution plan via DBMS_XPLAN.DISPLAY_AWR, again you need to specify the PLAN_HASH_VALUE instead of the CHILD_NUMBER in that case

Columns to show / hide

This applies to the output on Execution Plan line level. You can either provide a comma separated list of columns to show, or specify which columns *not* to show by using a leading "-" (minus sign). You can find a list of valid column names in the "configuration" section of the script.

Here are some example calls to the script. Note that by default the interactive prompt for options is enabled in the script, so whatever you pass on the command line can be modified / amended at the interactive prompt:

@XPLAN_ASH

This call will use all configured defaults. The SQL_ID / CHILD_NUMBER will be taken from the current session and will attempt to provide detailed output of the previously executed statement (PREV_SQL_ID from V$SESSION)

@XPLAN_ASH SID=128@3

This call will again use all configured defaults and get the SQL_ID / CHILD_NUMBER information from GV$SESSION for SID=128 on INSTANCE_ID = 3. If the current SQL_ID for that session is blank it will use PREV_SQL_ID instead.

! Note that this is different from Real-Time SQL Monitoring, which will report the latest SQL_ID for this session that has been subject to monitoring, which isn't necessarily the latest SQL executed by that session. !

@XPLAN_ASH xyz

This call will again use all configured defaults and report activity for SQL_ID xyz.

Note that since the CHILD_NUMBER is not provided, based on the remaining options specified or not (SQL_EXEC_START / SQL_EXEC_ID) the script will attempt to lookup this information from ASH / Real-Time SQL Monitoring.

This call specifies most of the options. It will analyze SQL_ID "xyz", child number 2. The call to DBMS_XPLAN.DISPLAY_CURSOR will use the "ADVANCED ALLSTATS ALL" option which includes the Rowsource Statistics for all previous executions (you need this for Parallel Execution with Rowsource Statistics, but need to be careful with multiple executions of the same statement) - assuming that child cursor 2 of SQL_ID xyz holds that additional information (GATHER_PLAN_STATISTICS hint or STATISTICS_LEVEL = ALL). The SQL_EXEC_START and SQL_EXEC_ID information is not provided, but the MONITOR option tells the script to search for the last monitored execution of that SQL_ID / CHILD_NUMBER information.

For the ASH based analysis it uses the "ASH,DISTRIB,TIMELINE" option which is different from the default provided (ASH,DISTRIB_REL,TIMELINE). See above for an explanation of these options.

The ASH source is left as default (should be CURR, but can be configured in the script), and two columns will not be shown on execution plan line level

This call specifies for SQL_ID xyz a PLAN_HASH_VALUE of 11223445566, because it refers to the historic ASH from DBA_HIST_ACTIVE_SESS_HISTORY and tries to get an execution plan via DBMS_XPLAN.DISPLAY_AWR from DBA_HIST_SQL_PLAN. The SQL_EXEC_START and SQL_EXEC_ID are also specified, therefore the ASH/MONITOR option isn't relevant here (the script won't search for an execution instance if a SQL_EXEC_START / SQL_EXEC_ID is explicitly specified). Finally the HIST option specifies where to get the information from.

@XPLAN_ASH xyz 1 "ADVANCED ALLSTATS LAST" "" "" "" NONE

For the given SQL_ID xyz, child cursor 1, provide only the additional information on execution plan line level based on the Rowsource Executionstatistics. No ASH data will be analysed.