Scripts for Drilling Down Into Unknown Optimizer Changes

2018/11/08

When I delivered my Advanced Oracle SQL Tuning training last week, I jumped into a free-form demo to show that the PARALLEL hints don’t really force anything other than adjusting the cost of parallelizable operations for CBO evaluation as usual. If a serial plan (or branch) still has a lower cost than the respective parallel processing cost, then serial execution would be used there. But when demoing this, I got a surprise.

We still get a serial plan as the optimizer has found that a serial index range scan would still be cheaper than a parallel full table scan or index fast full scan with the specified degree. An excerpt from the CBO trace for the query with PARALLEL hint is here (some lines highlighted with “<<<<<<<<<<<“)

See, the Resp cost - response-time-reduction optimized cost (parallel), not resource-efficiency-optimized cost (serial) is about 95.5 and 11.6 for the TableScan and the index FFS above. But the serial IndexRange cost is still just ~1. So this demo successfully shows that a parallel hint doesn’t force parallelism, it just adjusts costs (and the optimizer typically ends up using parallelism for complex enough plans as a result).

However, I ran the free-form demo slightly differently. While previously I projected just a COUNT(*) that could be satisfied by walking the index structure alone, last week I also selected a table column that wasn’t part of the index:

Uh, what? Now that we have to visit the table too, we suddenly get a parallel plan - just for fetching a single row! If you look in the IN-OUT column you see that the index range scan is actually done in serial mode (IN-OUT column value starts with S) and the following table access is parallelized.

The line #8 - INDEX RANGE SCAN indirectly produces matching rowids to its “grandparent” consumer #3 - TABLE ACCESS BY INDEX ROWID BATCHED through the slave PX communication mechanisms PX* and apparently there’s a BUFFER SORT in the mix too, probably for fetching enough rowids from the index range scan producers, before passing a batch of them to the table access operator. And the #6 - PX SEND HASH (BLOCK ADDRESS) seems to show that the rowids are distributed to the consumer slaves by the data block address of the rowid to ensure better batching efficiency & cache locality for table access slaves.

While this relatively little-known plan shape is interesting to read, during my demo surprise, my main question was “what has changed in the optimizer” as I recall doing such demos without a problem in past. Also, for fetching just one row via an index lookup, doing a couple of logical IOs in serial mode (in the existing session) should definitely be faster than allocating PX processes/sessions and distributing work to parallel slaves.

Switching to Old Optimizer Behavior

Since I didn’t recall such behavior from past and parallelizing such a plan didn’t make sense anyway, I decided to see if I’d get back the old behavior with using smaller (older) values for the optimizer_features_enable setting. You can either set the session level parameter or use a hint as I’ve done below:

I was demoing this on Oracle 18c, so I lowered the OFE parameter first to 12.2.0.1, then 12.1.0.2, then 12.1.0.1 and the problematic PX plan was picked up in all these versions. However with OFE 11.2.0.4 and lower, I got the desired serial plan! So something must have had changed between 11.2.0.4 and 12.1.0.1.

Exploring Optimizer Features

So, which exact optimizer features or decisions have changed across Oracle versions? I have a script cofep.sql (Compare Optimizer Features Enabled Parameters) for getting a high level overview. It actually requires some setup - look into the optimizer_features_matrix.sql setup script and add a schema name if you don’t want the “matrix” table to be created under SYS schema. You don’t have to install this into production, you can use this in your dev/sandbox database as long as it has the same version of Oracle installed as production.

The idea is to change the optimizer_features_enable variable to different DB version numbers and see which (undocumented) optimizer parameters also change as a result. So if you didn’t have a problem say in Oracle 11.2.0.3 but do have a problem in 11.2.0.4, you can list the undocumented parameters (optimizer features) that have been enabled or changed between these versions and see if something looks familiar from there:

Now (after you’ve ran the setup script described above), you can report the optimizer parameter difference between any DB version. For example, here are the optimizer features (that have an associated parameter) that changed between 11.2.0.1 and 11.2.0.2:

Now I have a more detailed view - the 29 parameters that have changed across optimizer versions. This is not nearly the complete amount of changes in the CBO of course, but typically the more prominent (or tunable) optimizer features/settings that “deserve” their own parameters. So as a next step I set the optimizer_features_enable to 12.1.0.1 (the first version where the problem manifested itself) and started setting the more suspicious-looking parameters (given the parallel plan problem we witnessed) from TRUE to FALSE. For example, I disabled features like _optimizer_batch_table_access_by_rowid and _optimizer_cluster_by_rowid in my test environment, but no luck. Note that it’s probably a bad idea to just try out changing a variety undocumented parameters back to their “good” values in production systems, better to find the culprit in a test environment and once the cause is determined, decide what to do with it in production.

Nevertheless, none of parameters I tried gave me the good plan, so I wasn’t able to isolate the optimizer behavior change with the help of (undocumented) parameters.

So I had to go one level deeper.

Exploring Optimizer Bugfixes

Earlier I said that the (undocumented) optimizer parameters that change together with optimizer_features_enable settings don’t represent all changes in the CBO. Not every new behavior and decision has its own parameter. I’m sure there are quite a few completely “dark” changes in the optimizer introduced with every new release of the code. However, since Oracle 10g, Oracle introduced a more fine-grained way for controlling (and reverting) smaller changes in the optimizer behavior. Changes that are typically (but not always) introduced for fixing some specific bug.

So with many optimizer bugfixes and behavior changes, it’s possible to enable/disable or adjust them using the _fix_control session level parameter. You can see all possible (registered) CBO fixes/features from V$SYSTEM_FIX_CONTROL and V$SESSION_FIX_CONTROL. Every fix is identified by a bug number so you could try to look it up in My Oracle Support, but there’s also a helpful description column too.

For example, if I want to see the status of all fixes where the description contains %hash%, I could run this script:

The columns should be pretty self-explanatory, the value column shows whether a fix is enabled (0 = disabled, 1 = enabled). Some fixes use values bigger than 1 too where the decision is not simply about enabling/disabling something, but about thresholds or bitmaps.

In the following example I’m looking a fix up by bug#, not description (scroll right to see the whole output):

So, it is possible to use the same technique of changing the optimizer_features_enable to different database versions and witnessing some fixes automatically to be turned on and off as a result!

Note that while there is an optimizer_features_enablecolumn also in the above V$ view output, in some cases it’s NULL and I want to be sure about behavior by altering the session and checking the true status of these fixes from V$SESSION_FIX_CONTROL.

This is why I recently wrote another script cofef.sql (Compare Optimizer Features Enabled Fixes) that lists you the status of “registered” optimizer bugfixes and behavior changes. And since I knew that the change happened somewhere between 11.2.0.4 and 12.1.0.1, I listed all known optimizer bugfix changes between these versions:

I have removed most of the output, but this script helped me to narrow down my search from over 1300 registered bugfixes (in 18c) to 53. When visually scanning through the list, I easily noticed a candidate with a pretty convincing description - bug 13345888 - parallel table lookup access by rowid. If you scroll right you’ll see that its value has changed from “0” in 11.2.0.4 to “1” in 12.1.0.1.

Now all I had to do is to turn this individual fix off and see if it gives me the better serial plan (having optimizer_features_enable set to 12c and even 18c):

Now, with this specific 12c feature parallel table lookup access by rowid disabled, I get the fast plan again and I have worked around my problem. I could now set that fix control systemwide until I’ve patched the database or fixed the root cause by some other means or use a login trigger to set this fix only for some sessions.

Notes & Summary

Interestingly this problem didn’t show up when I used object level parallelism hints, a’la PARALLEL(t 4) or just set the table/index degree to 4 and removed any hints. I chose to use the statement level parallelism and the CBO happened to hit this problem in this mode. The CBO hit the same problem also when using ALTER SESSION FORCE PARALLEL QUERY PARALLEL 4 to adjust parallel costing.

It looks like not everything listed under the V$SESSION_FIX_CONTROL is about bug-fixes. This parallel table lookup fix above looks like a new feature for me. But perhaps it wasn’t worthy of having its own parameter (18c has over 5000 parameters already).

Coskan and others have written about the _fix_control parameter years ago, so it’s nothing new, but I hope that my scripts and the example story help you to discover some optimizer behavior changes easier.

Out of the 1300+ optimizer fixes in 18c, there are 295 that are apparently not affected by the optimizer_features_enable parameter at all. I have a cofef_missing.sql script for showing such “independent” fixes.

One interesting example of such an independent fix is the “bugfix” 6941515 - use high watermark brokering for insert into single segment. The fix has been there since 10g or so but its value is 0, so despite the fix being shipped, it’s disabled by default!

You can read more about how this HWM brokering issue (and the fix being disabled by default) caused buffer busy waits contention on the SEG$ data dictionary base table in one of the complex Oracle performance troubleshooting case studies I’ve experienced in past.