Monday, December 2, 2013

A new version of the XPLAN_ASH tool (detailed analysis of a single SQL statement execution) is available for download. The previous post includes links to video tutorials explaining what the tool is about.

The new version comes with numerous improvements and new features. The most important ones are:

Real-Time SQL Monitoring info included

Complete coverage including recursive SQL

Improved performance

12c compatible

Simplified usage

Although I unfortunately haven't managed yet to publish the remaining parts of the video tutorial based on the previous version, I hope now that the new version is out to come up with the most interesting parts soon.

You can watch this short video that demonstrates the most important improvements:

Friday, October 11, 2013

When the optimizer has to estimate the data volume (the BYTES column in the plan output), it usually bases this information on the column statistics, if applicable and available (think of complex expressions).
However, whenever there is a VIEW operator in an execution plan, that represents an unmerged view, the optimizer obviously "loses" this information and starts applying defaults that are based on the column definition.
Depending on the actual content of the columns this can lead to dramatic differences in data volume estimates.
Both, under- and overestimates are possible, because for character based columns these defaults seem to be based on an assumed 50% fill grade, so a VARCHAR2(100 BYTE) column counts as 50 bytes data volume.
For multi-byte character sets the same rule applies based on the maximum width of a column using the "char" semantics, so a VARCHAR2(1000 CHAR) column counts as 2000 bytes data volume when using the AL32UTF8 character set, which is 50% of the 4000 bytes such a column could require at maximum - so with multi-byte character set this effect can be exaggerated.
The cost calculation of data access operations like full table scans isn't influenced by these different data volume estimates, but the decision for hash joins which of the two rowsources will used as hash and probe are basically driven by the estimated data volume.
Of course the cost estimates of other operations like sorts or aggregates are also based on the data volumes.
But for hash joins particularly the possible difference in data volume estimates can lead to bad decisions, using the effectively larger row source for building the hash table, and therefore leading to slower, less efficient join processing with increased memory, TEMP and CPU usage.
Here is a simple sample demonstrating the point.
First, create two tables, both using VARCHAR2(4000 BYTE) fields, but one has these fields only populated using a single character, whereas the other one fills them completely:

So we can see that the optimizer understands that the table with more rows actually results in a much smaller row source in terms of data volume as the character columns are only holding only a single character.
What happens if we now deliberately turn the tables into views?

You can now spot what I've described above: The table T2 row source is actually 50% underestimated by the VIEW operator, (152M vs. 76M Bytes), because the character columns are actually filled to their maximum size, whereas the table T1 is heavily overestimated in size now (1269K vs. 764M (!) Bytes), and these differences mean that the hash join now uses the actually much larger row source T2 to build the hash table. You can see the effect already in the estimates of the optimizer - it assumes now a 76M TEMP space usage of the hash join instead of 2448K when simply joining the tables.
As a side note, this is one of the areas where Dynamic Sampling has a severe shortcoming when comparing the estimates to those based on actual statistics.
This is what I get when deleting the stats from both tables and running the simple join again:

Since Dynamic Sampling doesn't evaluate the average row size it uses a similar (but somewhat different) assumption as the VIEW operator, and again the hash join due to these estimates uses the "wrong" row source as source for the hash table.
And finally: It gets even worse when using the VIEW variant with Dynamic Sampling:

The VIEW operator now doesn't change the data volume estimate based on Dynamic Sampling information, but the hash join suddenly changes the estimated data volume to 1570M (!) bytes, because the join cardinality estimate is now 102K rows instead of the more realistic 10.000 - so the cardinality estimate is now screwed due to the VIEW operator.

Summary

If you happen to have a large discrepancy between the column definitions and the actual column usage, which is particularly relevant for character based columns, the data volume estimates can vary significantly between merged and non-merged views. The usage of multi-byte character sets can exaggerate this effect in case of char semantics.
Most significantly this can lead to bad decisions regarding hash joins, using the larger rowsource as hash table.
Whether this effect of the VIEW operator is a feature or a bug I can't tell, there might good reasons why the information about the column statistics gets lost, but it certainly can lead to performance problems in particular with hash joins.
The effect can be reproduced across all currently supported versions including 12.1.

Friday, June 14, 2013

The TIMESTAMP WITH TIME ZONE data type that got introduced a long time ago is known for some oddities, for example Tony Hasler has a nice summary of some of them here.
Here is another oddity that shows up when trying to aggregate on such a data type. Have a look at the following simple example:

Notice the HASH GROUP BY operation selected by default by the optimizer (which can be influenced using the [NO_]USE_HASH_AGGREGATION hint to switch between a SORT and HASH GROUP BY).
But for the TIMESTAMP WITH TIME ZONE column, the following execution plan will be shown:

Notice the SORT GROUP BY instead - and this cannot be influenced using the above mentioned hint. So when using TIMESTAMP WITH TIME ZONE, the hash aggregation obviously isn't supported, but for all other TIMEZONE data types it is.
Depending on the scenario this might already influence the performance as the HASH based aggregation in many cases is more efficient than the sort based aggregation (bugs aside).
Things however get really bad when using Parallel Execution:

Notice how the (hash) aggregation is performed as a parallel operation (and you might even see a so called GROUP BY PUSHDOWN from 11g on represented by a second GROUP BY operation before re-distributing the data depending on the cardinalities or the usage of the [NO_]GPY_PUSHDOWN hint).
Now look closely at the execution plan of the last statement using the TIMESTAMP WITH TIME ZONE data type:

So the (sort) aggregation is no longer performed in parallel and this means that the single Query Coordinator process has to perform the aggregation on its own, which clearly can be a threat when dealing with larger data sets that need to be aggregated.

Footnote

Since internally the DISTINCT / UNIQUE operator uses the same implementation as the GROUP BY operator, exactly the same limitations apply when trying to do a DISTINCT / UNIQUE on a TIMESTAMP WITH TIME ZONE data type.
I could reproduce the described behaviour on all versions tested, starting from 10.2.0.4 and including the latest 11.2.0.3.

Wednesday, May 8, 2013

In addition to many changes to the way the information is presented and many other smaller changes to functionality there is one major new feature: XPLAN_ASH now also supports S-ASH, the free ASH implementation.

If you run XPLAN_ASH in a S-ASH repository owner schema, it will automatically detect that and adjust accordingly.

XPLAN_ASH was tested against the latest stable version of S-ASH (2.3). There are some minor changes required to that S-ASH release in order to function properly with XPLAN_ASH. Most of them will be included in the next S-ASH release as they really are only minor and don't influence the general S-ASH functionality at all.

If you're interested in using XPLAN_ASH with an existing S-ASH installation get in touch with me so I can provide the necessary scripts that apply the necessary changes.

Rather than writing another lengthy blog post about the changes and new features introduced I thought I start a multi-part video tutorial where I explain the purpose of the tool and how to use it based on the new version - some parts of the tutorial will focus on specific functionality of the tool and are therefore probably also quite useful as some kind of general tutorial on that Oracle feature and SQL execution troubleshooting guide in general.

The tutorial will consist of six parts initially, the first two are already available - the next ones to follow over time.

Feel free to post questions/requests for clarification that are not covered in the tutorials in the comments section - if there are topics of general interest I might publish a seventh part addressing those questions.

In future I might use that video style more often since it's a nicer way of conveying certain kind of information.

Wednesday, April 17, 2013

When using Locally Managed Tablespaces (LMT) with variable, system managed extent sizes (AUTOALLOCATE) and data files residing in ASM the Allocation Unit (AU) size can make a significant difference to the algorithm that searches for free extents.
The corresponding free extent search algorithm when searching for free extents >= the AU size seems to only search for free extents on AU boundaries in order to avoid I/O splitting.
Furthermore the algorithm seems to use two extent sizes when searching for free extents: A "desired" (for example 8MB) and a "minimum acceptable" (for example 1MB) extent size - however when performing the search the "desired" size seems to be relevant when limiting the search to free extents on AU boundaries.
This can lead to some surprising side effects, in particular when using 4MB AUs.
It effectively means that although you might have plenty of space (I've seen cases with more than 90% free of a several hundred GB tablespace) processes inserting data might fail with "unable to extend segment" error messages when the "desired" extent size is >= 4MB and the AU size is 4MB, although there might be for example hundreds of (fragmented) free extents available of < 8MB size.
This is particularly relevant to Exadata systems because I believe the default ASM AU size is 4MB there.
This behaviour can be influenced by using event 60060, which disables the extent alignment to AU boundaries.
Here is a simple test case demonstrating the issue. It deliberately interleaves extents of two tables and then drops one of them to leave non-contiguous free space behind:

-----------------------------------------------------------------------
-- Exadata / ASM 4MB AUs LMT AUTOALLOCATE extent fragmentation issue --
-----------------------------------------------------------------------
--
-- Tablespaces using Locally Managed Extents and the AUTOALLOCATE option
-- seem to behave differently on Exadata storage / ASM 4MB AUs than on regular storage / ASM 1MB AUs
--
-- On regular storage / ASM 1MB AUs the algorithm has more flexibility when allocating a next extent
-- For example, if according to the rules an 8MB extent would be required but only 1MB extents
-- are available, it gracefully re-uses the smaller extents, at least to some degree,
-- if they are not too small (for example it won't start re-using 64K free extents if an 8MB extent should be allocated)
--
-- On Exadata / ASM 4MB AUs the same test case fails with an ORA-01652 since it obviously can't re-use the existing free extents
-- Since in the below example there is no suitable extent on AU boundary available the table creation fails
--
-- Given a corresponding usage pattern this can lead to huge space wastage or "unable to extend" errors with plenty of (fragmented) free space
-- in Exadata / ASM 4MB AUs tablespaces
set echo on timing on time on
-- Enable this EVENT to disable the extent alignment to AU boundaries
-- This will allow below table to get created also in a tablespace using 4MB AUs in ASM
-- alter session set events '60060 trace name context forever, level 1';
drop tablespace auto_alloc_test including contents and datafiles;
-- Create tablespace either on ASM / Exadata storage or outside on regular file system
-- by uncommenting the datafile name clause
create tablespace auto_alloc_test
datafile
--'auto_alloc.dbf'
size 400M
extent management local
autoallocate
segment space management auto;
-- Create two tables
begin
for i in 1..2 loop
execute immediate 'create table table'||i||'(col1 number,col2 number) /*segment creation immediate*/ tablespace auto_alloc_test';
end loop;
end;
/
-- Interleave the extents until no space left (the ORA-01653 is expected)
-- This will generate lots of 1MB extents
begin
for i in 1..1000 loop
execute immediate 'alter table table'||(mod(i,2)+1)||' allocate extent';
end loop;
end;
/
-- Free half of the tablespace, but free extents are fragmented, and max. free contiguous space is 1MB
-- If you drop TABLE2, no extents on 4MB AU boundaries will be available
drop table table2;
-- If you drop TABLE1, extents on 4MB AU boundaries will be available and below CREATE TABLE will be successful
--drop table table1;
select sum(bytes)/1024/1024 sum_free_mb, max(bytes)/1024/1024 max_free_mb from dba_free_space where tablespace_name='AUTO_ALLOC_TEST';
-- Create a table that fits into the free space (less than 200MB) but usually will request 8MB extents
--
-- When you drop TABLE2 above, this fails on Exadata storage/ ASM 4MB AUs because it attempts to find extents on AU boundaries that are not available
--
-- When you drop TABLE2 above, this succeeds on regular storage / ASM 1MB AUs because it gracefully re-uses the existing 1MB extents if no 8MB extents can be found among the free extents
--
-- This leads to ORA-01652 error messages on Exadata storage / ASM 4MB AUs with a suitable extent usage pattern although there is plenty of (fragmented) free space
create table test
tablespace auto_alloc_test
as
select rpad('x', 100) as col1 from
(select /*+ cardinality(1e5) */ * from dual connect by level <= 1e5),
(select /*+ cardinality(11) */ * from dual connect by level <= 11)
;
-- In case of success check segment size and extent layout
select bytes/1024/1024 as MB from dba_segments where segment_name = 'TEST' and owner = USER;
select extent_id, bytes, blocks from user_extents where segment_name = 'TEST';
-- In case of failure check AU boundaries of free extents
with au_size
as
(
select allocation_unit_size from v$asm_diskgroup where name = (select substr(file_name, 2, instr(file_name, '/') - 2) from dba_data_files where tablespace_name = 'AUTO_ALLOC_TEST')
)
select count(*) from (
select block_id * 16384 / allocation_unit_size as AU_info, a.* from dba_free_space a, au_size where tablespace_name = 'AUTO_ALLOC_TEST'
)
where au_info = trunc(au_info);

If you run this test case on a tablespace using 4MB AU ASM data files, it will fail to create the last table, although it only requires approx. 127MB and there is approx. 200MB of (non-contiguous) free space (200 times 1MB extents), simply because the "desired" extent size is 8MB and no suitable free extents on "AU boundary" can be found.
The relevance of the "AU boundary" condition can easily be checked by changing the test case to drop the other table. This table's extents allocate all the "AU boundaries" and hence the table creation will succeed as a sufficient number of (1MB) extents on "AU boundaries" could be found.
If you repeat the same test case with the event set or using data files with a 1MB AU (or simply standard file system data files) the table will be created successfully (even with no extents on "AU boundaries" available), re-using the available 1MB extents. Although the "desired" extent size is still 8MB, the "minimum acceptable" extent size of 1MB allows the re-usage since the free extents don't need to be aligned on AU boundaries.

Performance Vs. Space Usage

So with the event you can choose between performance (larger extents, no I/O splitting for desired larger extents) and space usage (non-contiguous, smaller extents).
Another option in such cases is the usage of tablespaces with UNIFORM extent allocation, however it might be hard to find a good extent size when dealing with many segments of vastly differing sizes.

Footnote

This should only be relevant if the processes inserting into the tablespace manage to run at a high concurrency, therefore interleaving the extents (causing non-contiguous free extents when moving / dropping / truncating segments) in conjunction with extent sizes of vastly different sizes.
Interestingly I've observed this behaviour on an Exadata system running 11.2.0.3 BP14 and using a tablespace where mostly partitioned objects resided. Since partitioned objects are using an initial extent size of 8MB in recent versions you wouldn't expect this problem in such a scenario, but it looks like that mixing regular load processes with ALTER TABLE ... MOVE COMPRESS PARALLEL can lead to much smaller extent sizes than 8MB and therefore allowing the issue to occur.

Abstract

Oracle Parallel Execution, a feature of the Enterprise Edition, allows you to automatically distribute the processing of a SQL statement execution among multiple worker processes. This requires additional effort when analysing and troubleshooting such parallel executions, since it adds complexity that is simply not there with normal serial execution where only a single process is involved.

In this webinar I'll provide an overview of what these additional challenges are and how these can be approached.
A live Q&A session will follow the presentation.

Wednesday, March 13, 2013

In the previous post I've demonstrated an unexpected Nested Loop Join caused by an extreme data distribution. Although unexpected at first sight, the performance of the execution plan selected by the optimizer is decent - provided the estimates are in the right ballpark.
Here is another case of an unexpected execution plan, this time about Merge Joins.

Merge Joins

In order to appreciate why the execution plan encountered is unexpected, first a quick summary about how Merge Joins work:
A Merge Join is essentially a Nested Loop operation from one sorted row source into another sorted row source. In contrast to a Nested Loop the join condition is not used for a possible index-driven lookup from the driving, outer row source into the inner row source, simply because Oracle usually first needs to run separate operations on each rowsource for sorting.
This means that in most cases the Merge Join requires to sort both row sources and therefore a Hash Join is usually preferred where possible (for example, Hash Joins are only suitable for Equi-Joins, whereas a Merge Join also supports non-Equi Joins), because it only needs to "prepare" one row source for building the hash table, and can then process the second row source as it is without any further start-up cost / preparation steps.
Let's have a look at some common execution plans using Merge Joins. Consider this simple setup:

As usual I had to force the Merge Join via a hint, since in my (default 11.2.0.1) setup a Hash Join would be preferred. Notice the two SORT JOIN operations that first create two (ideally in-memory) sorted/indexed tables from the two row sources to be joined and how the SORT JOIN on the larger row source basically determines the overall cost of this MERGE JOIN.
A corresponding Hash Join could use the smaller row source as hash table and therefore very likely would be much more efficient.
Since the MERGE JOIN usually needs to SORT both row sources it doesn't make such a big difference which of the two row sources comes first, but it is interesting to note that the MERGE JOIN is not able to "swap" the join inputs as the HASH JOIN is able to, which, in particular for outer joins, makes the MERGE JOIN less flexible.
Here is a variation of a MERGE JOIN that avoids a SORT JOIN operation. This is only supported for the "driving" row source:

The MERGE JOIN knows that the driving row source will be accessed in sorted order due to the suitable INDEX RANGE SCAN operation and therefore doesn't add a SORT operation on top.
If we now run the same statement using Parallel Execution (note that the statement level PARALLEL hint used in the example is only supported from 11g on), we'll see the following:

So usually, due to the way things run in parallel, Oracle assumes it cannot guarantee the order of the row source and includes a SORT operation for both row sources joined.
Although there are special cases where this could be avoided even for Parallel Execution, it looks like the code adds this SORT operation unconditionally in case of Parallel Execution. We'll see how this can become a threat in a moment.

The Special Case

Now back to the special case I want to demonstrate here. Let's have a look at the following query:

Notice that I now got a MERGE JOIN although I haven't provided any hints to do so, so this execution plan was automatically favored by optimizer. Why?
This is a special case, because the optimizer understands that the join key is actually a single value, due to the predicate on T2.FK. So for a serial execution it doesn't bother to SORT the large row source (since it knows there will only be the value "1") and hence the MERGE JOIN comes out with a (slightly) lower cost estimate than a corresponding HASH JOIN.
It's interesting to note that in this particular case here it could even be avoided to SORT the second row source, since it, too, can only return a single value. But obviously the MERGE JOIN always runs a SORT JOIN operation on the second row source, as already outlined above.
Due to the way the data is designed and the direction of the outer join a NESTED LOOP join isn't a reasonable alternative either here.
Note that at runtime a HASH JOIN seems to be slightly more efficient in this particular case here, so this is already an indication that the cost estimates do not reflect the efficiency at runtime very well, in particular the CPU overhead of the actual join operation seems to be underestimated for the MERGE JOIN.
Now let's see what happens if we run this query using Parallel Execution:

Look very carefully at the order of the operations, and what part of the execution plan runs in parallel and what is executed serially.
This is where things become pretty weird and threatening: The TABLE ACCESS to the large row source T2 runs in parallel (with the corresponding lower cost), but the data is then handed over to the Query Coordinator for a SORT JOIN operation - which wasn't there in serial execution and is in fact unnecessary since we still have a single value in the join key.
After sorting the large row source, the MERGE JOIN operation itself is performed by the Query Coordinator, so no Parallel Execution is involved here either.
Both the serial SORT JOIN of the large row source and the MERGE JOIN operation itself are literally free of cost here, which is clearly unreasonable, in particular if the row source is very large.
Although the SORT JOIN will basically turn into a simple "BUFFER SORT" operation, since there is effectively nothing to sort, it still means that a potentially very big volume of data will have to be handed over from the Parallel Worker processes scanning the row source to the Query Coordinator - in this particular case by definition an inefficient operation, because a large data volume has to be passed from multiple Parallel Processes to the single Query Coordinator - and then this potentially very big volume of data will have to be SORTED by the Query Coordinator, which very likely means that this operation won't fit into PGA memory of that single process, hence spill to TEMP causing potentially significant additional (and unnecessary) read and write I/O, all to be done serially by the Query Coordinator.
This is a textbook example of a Parallel Execution plan that is deemed to take longer than the corresponding serial execution plan, and it is the execution plan that is preferred by the optimizer when left unhinted.
Let's have a look at the Parallel Execution plan when using a HASH JOIN:

Looking at the child operations' cost estimates of the HASH JOIN it becomes obvious that it is the costing of the HASH JOIN itself that makes the whole operation more costly than the MERGE JOIN, which is clearly questionable.
So the strange thing about the MERGE JOIN Parallel Execution plan is that the join operation itself is done serially, whereas the HASH JOIN execution plan, although it uses the same access to the row sources (INDEX UNIQUE SCAN and FULL TABLE SCAN), happily runs in parallel.
What causes this strange execution plan shape? Obviously it is the UNIQUE index on the other, smaller row source. Somehow the MERGE JOIN code is mislead by the UNIQUE index scan, which causes the join operation to run serially.
Replacing the UNIQUE index with a NON-UNIQUE index (and using a UNIQUE constraint on top to achieve the same uniqueness) gives this execution plan:

So now we still have the unnecessary SORT JOIN operation of the large row source, but at least the SORT JOIN and MERGE JOIN operations are now executed in parallel, which should make it far less threatening.
Of course, a corresponding HASH JOIN will still be much more efficient for larger row sources, but needs to be hinted in this special case here.

Summary

For MERGE JOINs there are some special cases where the current costing model doesn't properly reflect the actual work - together with some strange behaviour of the MERGE JOIN code when using Parallel Execution this can lead to questionable execution plans preferred by the optimizer.
Carefully check the resulting execution plans when using Parallel Execution and MERGE JOINs get preferred by the optimizer.

Wednesday, February 27, 2013

Recently I came across some interesting edge cases regarding the costing of joins. They all have in common that they result in (at first sight) unexpected execution plans, but only some of them are actual threats to performance.

Outer Joins

The first one is about outer joins with an extreme data distribution. Consider the following data setup:

The optimizer preferred a Nested Loop join albeit the fact that the number of estimated loop iterations is pretty large. Notice in particular the cost column: Although the inner rowsource is estimated to be started 1000K times, the cost of doing so corresponds to just a single execution.
For reference here is a cost estimate for a similar operation that corresponds to the expected costing model:

I now had to force the Nested Loop join via a hint, because by default other join methods were preferred by the optimizer. The cost of a single iteration of the loop has now increased to 3, although the access is exactly the same as before (T2 random table access via index lookup of T2.ID), and the cost of the Nested Loop join corresponds to the known formula: Estimated starts times the cost of a single iteration, which is 3000K in this case here, plus the 4200 of the Full Table Scan for accessing the outer row source, plus CPU costing overhead.
So what makes the difference between the two? It's the data. The column name chosen for the column in T1 already reveals what's so special: The join column used (NULL_FK) is actually NULL for all rows.
The optimizer takes this into account and assumes that none of those rows from the driving row source will actually match a row of the inner row source - in fact the lookup to the inner row source could be short-circuited in some way, since a NULL value by definition isn't supposed to find a match for this join. I haven't investigated to what extent the runtime engine does this, however in the Rowsource Statistics the inner row source is started the expected number of times, although no logical I/O is recorded for it, but some CPU time, so at least some work seems to be done there.
Modifying the test case so that more of the FKs are actually non-null shows that the cost calculation is scaled accordingly. In fact the cost calculation is more or less the same than that of a corresponding inner join that could filter out those driving rows with NULL values in the join column.
The overall performance of the execution plan is quite decent, so although it looks quite unusual it performs pretty well.
In the second part I'll show another interesting, unexpected join execution plan that however can cause real performance problems.

Wednesday, February 20, 2013

Oracle 10g introduced the QB_NAME hint that can come handy in case hints need to be applied to more complex statements, in particular when possibly multiple layers of views / subqueries are involved.
Jonathan Lewis has a older blog post that describes more details.
Just in case you wonder why sometimes apparently the QB_NAME hint - along with all other hints that refer to the assigned query block name - seems to be ignored: One possible reason is that it looks like there is an undocumented length limitation of the query block names that can be assigned - 20 characters seem to be the maximum possible (I haven't checked the effect of multi-byte database character sets).
Consider this simple example:

The correlated subquery is now executed using a FILTER operator as requested. Notice the aliases and query block names, in particular for the instance of T1 at operation id = 3
Now if I repeat the same query, but use a longer query block name, the hints are effectively ignored and the default unnesting takes place again:

Notice how the default aliases and query block names get used although explicitly hinted.
It is probably rather uncommon to use such lengthy query block names, nevertheless it can be puzzling when hitting such an undocumented limitation.

Thursday, January 31, 2013

Here is an interesting limitation to Exadata Smart Scans - if more than 254 columns from a table (not HCC compressed, more on that in moment) need to be projected, Smart Scans for that particular segment will be disabled and Exadata will fall back to conventional I/O. This means that the number of columns in the projection clause can make a significant difference to performance, since only Smart Scans allow taking advantage of offloading and particularly avoiding I/O via Storage Indexes.
Now the expression "254 columns" might ring a bell, since it is the maximum number of columns that Oracle can store in a single row piece - tables consisting of more than 254 columns will have to be stored in multiple row pieces.
However, what I'm talking about here is not related to such columns residing in different row pieces of a row - Smart Scans still happily work even if columns from different row pieces are projected (which was subject to several bugs in the past), although you might end up with additional "cell single block reads" in case of truly chained rows rather than just additional logical I/O for picking up the different row pieces from the same block, also sometimes called "intra-block" chaining.
No, the limitation simply seems to be that Smart Scans - broadly speaking and ignoring edge cases - can only transport a maximum of 254 columns from a single (non-HCC) segment. Requesting more columns will simply disable Smart Scans for that segment.
Now you might say, offloading and in particular offloading column projection isn't that much relevant if you select that many columns from a table anyway, but the point is that you loose the ability to benefit from Storage Indexes and only transporting the relevant rows back to the compute nodes.
Both features can speed up the processing significantly, in particular if the number of rows selected is only a fraction of the total number of rows, and/or the cells could avoid a significant amount of I/O via Storage Indexes.
To demonstrate the point I've put together a simple test case that generates a test table with more than 254 columns - the script below generates a table of approx. 40GB uncompressed size so that a significant difference in performance could be measured.

Assuming a Storage Index was generated on the ID column a query like the following (using Parallel Query at a DOP of 32 in the test runs here) can benefit from offloading, since in principle all I/O could be avoided via the Storage Index, and virtually no data needs to be transported from the cells to the compute nodes.
Note that it projects exactly 254 columns from different row pieces.

As you can see from the snippet, it took less than 2.5 seconds to run the query on the 40GB segment, and the session statistics report 20GB avoided via the Storage Index (which seems to be an instrumentation bug as it always reports only 50% of the total segment size as a maximum, this output was taken from 11.2.0.2 Exadata BP14). Furthermore only a couple of MB were exchanged between the cells and the compute nodes.
The corresponding Real-Time SQL Monitoring report confirms the "Smart Scan":

Increasing the number of columns projected from the segment above 254 (and as outlined above it doesn't matter from which row pieces these columns come from) disables the Smart Scan, and it takes more than 9 seconds to run essentially the same query, pumping all 40GB through the compute nodes to filter all rows.

Funnily, in this deliberately crafted, extreme case here, it is much faster to access the segment twice and get the remaining columns via a self-join in order to benefit from the offloading features - it only takes 4.7 seconds to run the self-join, and the session statistics confirm that both segment scans could leverage offloading and in particular avoid I/O via Storage Indexes:

Note: This is one of the cases where you don't want the optimizer to eliminate a self-join via a query transformation based on corresponding constraints on the join columns :-)

HCC Compression

Now the funny thing is that if you repeat the table creation script but uncomment the HCC compression, the Smart Scan happily works with up to 1,000 columns of such a compressed segment.
So obviously the general code implementation supports transporting rows with more than 254 columns from the cell to the compute nodes, but the question is why does it only do so with HCC compressed segments. It's probably a question that my client will raise with Oracle Support to find out the answer.

Footnote

At least only the number of "raw" columns projected count towards the limitation - any expressions based on columns don't count, therefore you can project actually more than 254 expressions from a segment and still benefit from Smart Scans as long as the expressions refer to a maximum of 254 "base" columns.
The same limitation could also be reproduced when using (almost) the latest available Exadata version as of writing this (11.2.0.3 BP12, almost because BP14 just came out as far as I know).

Sunday, January 13, 2013

When performing aggregate GROUP BY operations an additional filter on the aggregates can be applied using the HAVING clause.
Usually aggregates are one of the last steps executed before the final result set is returned to the client.
However there are various reasons, why a GROUP BY operation might be somewhere in the middle of the execution plan operation, for example it might be part of a view that cannot be merged (or was hinted not to be merged using the NO_MERGE hint), or in the more recent releases (11g+) the optimizer decided to use the GROUP BY PLACEMENT transformation that deliberately can move the GROUP BY operation to a different execution step of the plan.
In such cases, when the GROUP BY operation will be input to some other operation, it becomes essential for the overall efficiency of the execution plan preferred by the optimizer that the cardinality estimates are in the right ballpark, as it will influence the choice of other related execution steps like join orders and methods or simply the decision between an index-based access or a full table scan.
While the optimizer based on the statistics can come up with a reasonable estimate regarding the cardinality of the GROUP BY expression (the emphasis here is on *can*, it might also be wrong), it is important to understand that an additional filter on the aggregates using the HAVING clause is in principle treated like an "unknown" expression and therefore the estimates are based on built-in defaults that might not have much to do with actual filter selectivities of that HAVING expression.
Here is a simple example to demonstrate the point:

There is a table of 100K rows with two dates in it that have each 100 distinct values (we can ignore in this case here that the values generated are actually correlated) - the ID column is unique.
If I ask for an estimate for queries similar to the following on this table:

Note how the optimizer gets the cardinality right for the first statement. The aggregation doesn't reduce the cardinality due to the uniqueness of the ID column.
Notice how the filter predicates on the aggregates in the second case actually do not filter at all as they select the whole date range available. But the optimizer simply assumes 5% selectivity per unknown range predicate resulting in 0.05 times 0.05 = 0.0025 selectivity.
I don't think that it is a bug - it simply cannot assume anything regarding the possible MIN and MAX values given any potential arbitrary other filters.
If I now wrap the aggregate query as a view into a more complex statement, it becomes obvious that the HAVING clause can also implicitly be derived by applying corresponding filters to the view:

Notice how the incorrect cardinality estimate leads to a NESTED LOOP for the join operation, which is very likely a bad idea in such cases where the number of actual loop iterations is much higher than estimated.
In general such bad cardinality estimates can echo through the whole execution plan with potentially devastating results.
For my particular case here one potential workaround besides using undocumented CARDINALITY or OPT_ESTIMATE hints is to prevent the optimizer from pushing the filter into the view (and thereby avoiding the implicit generation of the HAVING clause) by wrapping the view with another view on top that includes a reference to ROWNUM:

This way the cardinality estimate is much better for my particular case here and a more suitable HASH JOIN gets used instead.
There are however a couple of noticable drawbacks with that workaround:
- The optimizer is still clueless: If for example the filter on the aggregates actually filtered any data, it would still assume NO filtering at all (still 100K in this case here), which might be just as bad, but in the opposite direction (over-estimate instead of under-estimate)
- Any other potentially useful filters (the "ID > 50000" in my case here) will also be prevented from being pushed into the view, so in my case the GROUP BY has to operate on a much larger data set than actually necessary (100K rows instead of 50K rows) - not good
- The ROWNUM evaluation causes overhead and will be a problem when trying to run this statement using Parallel Execution, as it will cause the plan to be split into multiple DFOs (you'll find multiple PX COORDINATOR operations in such plans which can have nasty side effects, for more info read my OTN mini-series on Parallel Execution) with a serialized operation in between to determine the row number

Summary

Be careful whenever the cardinality estimates for HAVING clauses become relevant to your execution plan, as the optimizer simply applies default selectivities that can be way off.
If the aggregation is one of the final steps of execution, the cardinality estimate is probably not that relevant, but there are other cases possible where it matters a lot.

Footnote

From 11.2.0.3 on there is a new undocumented parameter "_optimizer_filter_pushdown" that defaults to "true".
When setting it to "false" the "Filter Pushdown" (FPD) transformation used above and prevented via ROWNUM will be disabled, however on a global statement / session level and not only for a particular view / query block as with the ROWNUM workaround.