March 4, 2011

After you upgrade to 11G from lower versions you might realize that some of your plans involving nested loops are changed and you started to see extra nested loops.
According to documentation this is because of the “new” implementation for Nested Loop Joins.

Document saysOracle Database 11g introduces a new implementation for nested loop joins. As a result, execution plans that include nested loops might appear different
than they did in previous releases of Oracle Database. Both the new implementation and the original implementation for nested loop
joins are possible in Oracle Database 11g. So, when analyzing execution plans, it is important to understand that the number of NESTED LOOPS join row sources
might be different.

Oracle Database 11g introduces a new implementation for nested loop joins to reduce overall latency for physical I/O.
When an index or a table block is not in the buffer cache and is needed to process the join, a physical I/O is required.
Oracle Database 11g can batch multiple physical I/O requests and process them using a vector I/O instead of processing them one at a time.

As part of the new implementation for nested loop joins, two NESTED LOOPS join row sources might appear in the execution plan where only one
would have appeared in prior releases. In such cases, Oracle Database allocates one NESTED LOOPS join row source to join the values from the table
on the outer side of the join with the index on the inner side. A second row source is allocated to join the result of the first join,
which includes the rowids stored in the index, with the table on the inner side of the join.
Here are the plan differences between new implementation and old one.

According to me, this change brought extra complexity to read the plans. NLJ plans prior to 11G are way easier to read and bear in mind this example is just a simple 2 table join.
Once you query has got crazy number of joins it gets harder and harder to read. One thing you can spot is that the cost is “same” which is very good for plan change analysis.
According to my experience, %99 of same cost plan changes are because of this implementation and this made me exclude same cost-different plan_hash_value plans from comparisons.

On the last post, I already gave some information about the controlling hidden CBO parameter _nlj_batching_enabled.
Better I repeat here, this parameter has 2 values, 0 to disable the behaviour, 1 (which is default) to enable the behavior and this parameter is there since 11GR1.

When I interpret the sentence “new implementation for nested loop joins to reduce overall latency for physical I/O”. I expect some improvement with runtime like the effect of bulk collect or changing the array size of your sessions. But is my assumption true ?
I have doubts about this definition because in reality I do not feel any difference as interpreted from the official definition.
Nothing got faster in our production environment related with this change.

To prove my experience, I build a test case which you can find in the bottom of the page.

My test environment is 11.2.0.1 on Ubuntu with 1.5G memory target on a machine with 4 cores.

For this test I have 3 tables all of which have ID columns and have an index on ID columns

On the last post I already gave some information about the controlling parameter _nlj_batching_enabled so I change only this parameter to do my test and “keep OFE 11.2.0.1”

I ran these queries 4*50 times

1- _nlj_batching_enabled=1 (11G behaviour), flush the buffer cache twice and flush the cursor before first run, and run the statement 50 times.2- _nlj_batching_enabled=1 (11G behaviour), flush the cursor at first run, flush the buffer cache twice before “each run” and run the statement 50 times to force phsical read all the time3- _nlj_batching_enabled=0 (pre 11G behaviour), flush the buffer cache twice and flush the cursor before first run, and run the statement 50 times.4- _nlj_batching_enabled=0 (pre 11G behaviour), flush the cursor at first run, flush the buffer cache twice before “each run” and run the statement 50 times to force phsical read all the time

After each run I snap the statistics from v$sql for the related sql and then do the comparison afterwards

Here are the results -( AVG1 includes first run for the step 1 and 3 which does both phsical and buffer cache read first run is not included on other calculations)

New behavior was faster in my Small dataset test.
I only consider median and for buffer cache reads it 10 milisecond less for CPU and 25 miliseconds less for Total ELA.
In Percentage we can say ~%5 for CPU and %6 for Total ELA
When it comes to phsical reads again 10 milisecond less for CPU and 11 miliseconds less for Total ELA
In percentage improvement is ~%4 for CPU and ~%1.5 for Total ELA

I was initially expecting bigger dataset will be again faster because we will do more phsical read but
This time new behaviour is slower for big dataset
For buffer cache reads 20 miliseconds slower for CPU and ~3 miliseconds slower for total ELA
In percentage degredation is ~%1.5 for CPU and ~%0.06 for Total ELA
For phsical reads 40 miliseconds slower for CPU and 7.5 miliseconds slower for total ELA
In percentage degredation is ~%2.5 for CPU and ~%0.14 for Total ELA

When I interpret these results I have the gut feeling that this is not like what is written in the documentation.
It looks like response time wise behavior is nearly same. From this output one can say either CBO engineers wanted to make it harder to read the plans or maybe they wanted to be more honest to show what was always there.
Better optimizer group write a blog post about it rather than me🙂

There is a bug note which also supports the idea that its been there since 9i

There are also some other bugs related with dumps and wrong results reported for this behavior which is inconsistent with the idea that its been there for a long time.
Bugs disappeared when behavior is disabled and revert back to original so it loooks like has some effect somewhere but not necesarrily on performance as it can be interpreted??

Now its time for bonus material, probably careful eyes already spotted that the plan with _nlj_batching_enabled=0 is different than plan would be generated on 10.2.04 and also
there is a problem with runtime A-time values on 3th step of the plans when _nlj_batching_enabled=0 in 11.2.0.1. They are basically reported way higher than actual elapsed time.

First let me show you how actual pre 11G plan and reported A-Time values are

(to get this plan I flush buffer cache twice,flush the sql from shared pool, set ofe to 10.2.0.4 instead of _nlj_batching_enabled=0 on OFE=11.2.0.1)

As you can see timings are now more accurate and Nested Loop step is now changed its location in the plan.
I don’t know why A-time is reported wrong but thanks to the methods I described in earlier methods I can say, What is causing the plan change difference like below

Since I know what is causing the plan changes for nested loop now its time for me to go ahead and test if OFE=10.2.0.4 is better for the tests I run above.

Here are the comparison between OFE 11.2.0.1 and OFE 10.2.0.4 (instead of setting _nlj_batching_enabled=0 while OFE is still 11.2.0.1 )

This time 10.2.0.4 results were better for both small and big datasetfor small dataset
For buffer cache reads, 10 miliseconds slower for CPU and ~3 miliseconds slower for total ELA
In percentage degredation is ~%7 for CPU and ~%0.06 for Total ELA
For phsical reads 10 miliseconds slower for CPU and ~19 miliseconds slower for total ELA
In percentage degredation is ~%7 for CPU and ~%8 for Total ELA

for big dataset
For buffer cache reads 70 miliseconds slower for CPU and ~65 miliseconds slower for total ELA
In percentage degredation is ~%5 for CPU and ~%1.3 for Total ELA
For phsical reads 80 miliseconds slower for CPU and 120 miliseconds slower for total ELA
In percentage degredation is ~%4.8 for CPU and ~%2.2 for Total ELA

As a last word I can’t say my tests are best way to measure, however results are not looking consistent with what is written in the official documentation.
I think this is a new feature that will effect your plans after you upgrade your DB to 11G, but it will not necessarily make any perceivable difference to your response times.
In my humble opinion as long as the cost of your queries are same with the new implementation in place,
you can ignore the existence of it and never bother to change the parameter unless support says so.

Please feel free to comment if you have anything to say on this one and if you spot anything wrong in my analysis.

[…] optimizer parameters : how you can find which parameter change/addition may caused your plan change 5-Why is my plan changed?-extra nested loop : what is the new nested loop step you will see after 11G upgrade 6-Why is my plan changed?-stats : I […]

while I was performing recently some tests regarding Nested Loop Join Batching I realized that the Vector / Batched I/O feature seems to become disabled when enabling row source statistics sampling, so basically anything that touches “_rowsource_statistics_sampfreq” – this is can be either STATISTICS_LEVEL = ALL, GATHER_PLAN_STATISTICS hint or even enabling SQL trace.

So if all of your tests where using one of these mentioned features this would explain why you didn’t see any significant improvement with Nested Loop Join Batching – and it can actually improve physical I/O dramatically in some of my tests.

You can verify this in your tests simply by checking the “Batched IO%” session statistics. If the optimization kicks in, these statistics will be non-zero. In my tests with row source sampling enabled these statistics stayed at zero and no runtime improvement could be seen (and single block reads instead of scattered reads / db file parallel reads where used).

I’ve no idea at present why the optimization gets disabled in these cases, but at least for me it was reproducible in both 11.1 and 11.2 – making it hard to “study” the optimization by the way