Wednesday, January 28, 2009

It seems to be a commonly encountered issue that as part of the analysis of a potentially poor performing SQL query statement the investigation is mislead by the fact that running the query in a GUI database front-end tool like TOAD, SQLDeveloper or similar shows an instant reply.

Most of these tools by default fetch only the first n rows of the result set conveniently into a grid-like GUI component.

This might lead to the conclusion that the statement seems to perform well and therefore doesn't require further investigations respectively tuning efforts.

The potential problem with this approach is that due to several possible causes a statement might return the first rows very quickly, but depending on how the statement is actually used this doesn't necessarily mean that all rows will be processed in a efficient manner.

Usually the following circumstances can lead to such side-effects:

- Grossly underestimated cardinalities by the optimizer- An efficient access path available to the optimizer based on that wrong cardinality estimate, usually an index

There are several scenarios that can be the root cause of the underestimated cardinalities, among them are:

- Wrong statistics of involved tables or indexes- Sharing of child cursors (execution plans) when using bind variables that cause different cardinalities, can also be forced indirectly by CURSOR_SHARING=FORCE- Incorrect estimation of join cardinalities: There are simply scenarios where the optimizer is not able to come up with a good estimate of a join, even (or possibly because of) with the help of histograms

Although starting from 10g on there is a default statistics collection job which should help to prevent the first point on the list from happening, it's actually this default job in conjunction with a particular pattern of database usage that can lead exactly to this result.

The following test case run on Oracle 10g XE should simulate this pattern to demonstrate one of the possible scenarios.

It creates table which represents a "staging" table of a bulk load process. This "staging" table is accidentally empty during the night when the default statistics collection job is executed. Afterwards the table gets loaded with 1,500,000 rows and a query is executed without refreshing the statistics after the load is complete.

You'll see that based on the wrong statistics the optimizer comes to a wrong conclusion.

The interesting side effect of this is that the query performs very well in a front-end tool that fetches only the first n rows, since the plan accidentally corresponds to a very efficient "first_rows(n)" plan, although the optimizer_mode is ALL_ROWS by default in 10g.

I've carefully crafted this example to show that even the deliberate usage of an "ORDER BY" in an attempt to force the database to process the whole result set for sorting before returning the first rows can be misleading, since it's possible to process such a query without a sort operation in case a suitable index exists.

Only by definitely forcing the database to process all rows you get the real picture if you know that all rows from your statement actually need to be processed. But it's not that easy to force this; the cost based optimizer is smart and most of the time simply wrapping the statement in a "COUNT(*) from (...)" is again misleading, because the optimizer often is able to avoid the random row access to the table, which is where most of the time is usually spent under such circumstances.

The "empty" staging table has been analyzed and afterwards 1,500,000 rows have been loaded which represent approx. 60MB of data, supported by a 30MB index. The table consists of 7,680 blocks (8KB default block size), the index has 3,840 blocks.

Note that the data is skewed, most of entries have a object type of "VIEW", so filtering for this value will return almost all of the rows from the table.

Elapsed: 00:00:18.95SQL> SQL> set autotrace traceonlySQL> SQL> -- Force to fetch all rowsSQL> -- Note that COUNT(*) itself is not sufficientSQL> -- because the optimizer is smart and doesn't visit the tableSQL> -- in this case which means that the test case would be no longerSQL> -- representativeSQL> select count(*) from ( 2 select * from t_fetch_first_rows 3 where type = 'VIEW' 4 order by id 5 );

Running above query including a filter predicate and an ORDER BY clause will return almost immediately, suggesting good response time.

And by simply wrapping the query in a "SELECT COUNT(*) FROM (...)" block it seems the result confirms that the query performs good: Only 3,758 blocks read (by the way, if you check the execution plan, it's an index only execution and not visiting the table at all - the blocks read basically represent all blocks from the index) in five seconds. So obviously there is nothing wrong...

SQL> SQL> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:00.01SQL> -- alter session set tracefile_identifier = 'fetch_first_rows1';SQL> SQL> -- exec dbms_monitor.session_trace_enableSQL> SQL> set autotrace traceonlySQL> SQL> -- Force to fetch all rowsSQL> -- Note that COUNT(*) itself is not sufficientSQL> -- because the optimizer is smart and doesn't visit the tableSQL> -- in this case which means that the test case would be no longerSQL> -- representativeSQL> select sum(measure), count(*) from ( 2 select * from t_fetch_first_rows 3 where type = 'VIEW' 4 order by id 5 );

But look what happens if I force to process all rows in a realistic manner: It takes roughly a minute and more significantly approx. 1,500,000 consistent block gets to process a table that consists of 7,680 blocks ! That's a logical I/O amount representing roughly 12GB.

And if you check the AUTOTRACE statistics you'll notice that I was actually lucky: I only read 11,236 blocks physically, the vast majority of the blocks came from the buffer cache after all blocks of the table have been read into it. In a real production system where the segments might be much larger and potentially multiple of such statements might compete for the buffer cache simultaneously a lot more buffer gets might lead to real physical I/O making things even worse.

The impressive amount of logical I/O is caused by the fact that using this execution plan the database scanned the index leaf blocks for the rows according to the filter criteria and accessed the corresponding table in a random fashion.

But the smallest entity that the database knows in terms of I/O is a block, so in order to read a single random row the whole block where the row resides in needs to be read. Due to the bad clustering factor of the index, which means that the table rows are differently ordered than the index, for each row randomly accessed a different block of the table had to be read. So on average each block of the table was accessed almost 200 times during the process!

If the table and the index were ordered in the same way (represented by a low clustering factor of the index close to number of blocks of the table, whereas a bad clustering factor is close to the number of rows of the table) then the amount of logical I/O required could be significantly less, since the rows accessed randomly reside in the same block and therefore the same table block can be re-used to process the row.

Now I'm going to show that having representative index statistics is as important as having the tables properly analyzed:

A simple full table scan took only three seconds, and processed 7,488 blocks, representing simply the net size of the table. What a vast difference compared to the 12GB processed via the index.

Note that when I repeat the statement using the index access without clearing the buffer cache its actual run time will drop in my test environment from one minute to less than 10 seconds, but that's not really good news, since it's usually quite unrealistic to have all blocks in the buffer cache.

To underline that this inefficient statement represents a real scaling inhibitor, I'll use Tom Kyte's RUNSTATS package to demonstrate the different resource consumption of the two statements:

The most obvious threat can be seen from the LATCH statistics: Latches are Oracle internal serialization devices that are used to protect resources from being modified by multiple processes at the same time. Latches therefore serialize access to certain memory structures that e.g. are used to maintain the buffers in the buffer cache. The more latch activity, the more CPU is used and the less scalable is a system, since the concurrent processes will have to wait for each other at the serialization points.

So the inefficient statement reported more than 3,000,000 latches, whereas the simple table scan required only approx. 50,000. So even if you not too concerned with the different run times in this single user test case, running such statements simultaneously in a real multi-user system will significantly influence the performance of the whole system.

Saturday, January 24, 2009

Riyaj Shamsudeen recently published a very interesting blog post about the 11g extended statistics feature and correlated column values.

Because I think that some of his findings were significantly influenced by the fact that he generated frequency histograms on all columns, I've repeated his test case with some more variations, in particular with and without histograms and in case of histograms, with and without the new feature "Bug 5483301: Cardinality of 1 when predicate value non-existent in frequency histogram" which was introduced in 10.2.0.4 and 11.1.0.6. More details about this change can be found in Metalink Doc ID 5483301.8.

The following results were obtained using 11.1.0.7 Enterprise Edition on Windows 32bit.

I'll start with his case 1:

create table y1 (a number, b number, c number);

begin for i in 1..1000 loop for j in 1..10 loop insert into y1 values (j,mod(j,5), mod(j,2) ); end loop; end loop;end;/

REM Let's also add an index to this tablecreate index y1_i1 on y1 (a, b);

REM The number of distinct keys is usedREM to determine selectivity ifREM a) an all-equal operation on the entire index is usedREM and obviouslyREM b) NO histogram is presentselect distinct_keys from user_indexes where index_name = 'Y1_I1';

REM this shows that although the index is in placeREM the DISTINCT_KEYS information is ignoredREM but the histogram is used insteadREM and we fall back to the default selectivity formulaREM see at the end of the script for a demonstrationREM without index which comes to the same resultalter session set tracefile_identifier = 'correlated1';

In contrast to Riyah's interpretation I would say that it shows that the index information is not used to derive the selectivity ("Matches Full:"). Rather it looks like the presence of the frequency histogram causes a fall back to the default formula used for non-correlated columns: selectivity(A) * selectivity(B) which leads to the selectivity of 0.02 and hence cardinality of 200.

Now it can be seen that the index information actually gets used ("Matches Full: #1 Partial: Sel: 0.1000") and we come to the correct cardinality estimate.

Unfortunately this approach has some shortcomings, as can be seen when using out-of-range attributes:

REM unfortunately this approachREM doesn't cope very well with "out-of-range" predicatesREM although the 10053 trace shows a detection of the "out-of-range"REM attribute, the final index selectivity is left unchangedREM => Room for improvement?alter session set tracefile_identifier = 'correlated3';

Although there is a "out-of-range" detection, the final selectivity is left unchanged, which looks a bit odd, and the default formula that adjusts the selectivity based on the distance from the known low/high values doesn't apply either. Switching off above mentioned "bug fix" doesn't change this behaviour which seems to be consistent since we don't have a histogram in place.

REM this corresponds to having histogramREM but no index in place, since the histogramsREM take obviously precedenceREM not sure how we arrive at the cardinality of 50REM it looks like the NewDensity is 0.5 / NUM_DISTINCTREM and gets divided by 2 according to the formulaREM for out-of-range predicates (see case 14 below)alter session set "_fix_control"='5483301:on';

I'm not sure about how the optimizer arrives at the selectivity of 0.005 resp. cardinality of 50 in this particular case.

Switching off the new behaviour regarding non-existing values in frequency histograms outlined above shows that we get a cardinality of 1 in this case:

REM switch to the old behaviourREM interestingly this influences the "NewDensity"REM calculation and sets it effectively to the "OldDensity"REM Now we get a cardinality of 1REM when querying for a non-existent valuealter session set "_fix_control"='5483301:off';

The "bug fix" switch for the non-existent values in frequency histograms doesn't change the result - which is consistent since we're looking for a existent value.

But since we have a frequency histogram in place, once we start to query for non-existent values, the bug fix becomes relevant:

REM new rule introduced with 10.2.0.4REM and 11.1.0.6, see bug 5483301 resp.REM Metalink Note 5483301.8:REM Instead of reporting a cardinality of 1REM for a non-existent value in a frequencyREM histogram, the cardinality will beREM 0.5 times the cardinality of the leastREM popular value in the histogramREM => hence 1,000 * 0.5 in our case => 500

The new rule applies that returns us 0.5 times the cardinality of the least popular value in the histogram. Since we have uniform distribution, every popular value has a cardinality of 1,000, hence the adjusted cardinality according to the new formula is 500 (or lets say, close to 500).

Switching off the bug fix returns us to the old behaviour, which seems to be reasonable: If the value can not be found in the frequency histogram, a cardinality of 1 should be estimated:

REM switch to the old behaviourREM interestingly this influences the "NewDensity"REM calculation and sets it effectively to the "OldDensity"REM Now we get a cardinality of 1REM when querying for a non-existent valuealter session set "_fix_control"='5483301:off';

So the cardinality estimate is correct, but see what happens again when using "out-of-range" predicates:

REM But same behaviour as case 3 (index without histogram):REM Unfortunately this approachREM doesn't cope very well with "out-of-range" predicatesREM although the 10053 trace shows a detection of the "out-of-range"REM attribute, the final index selectivity is left unchangedREM => Room for improvement?alter session set tracefile_identifier = 'correlated12';

REM without extended statsREM and without indexREM and without histogramREM the standard formula for non-correlated dataREM applies, which gives us a cardinality of 200alter session set tracefile_identifier = 'correlated13';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=1 and b=1;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for Y1[Y1]

REM finally generating histogramsREM but still without extended statisticsREM and without indexREM compare this to the case "index" + "histogram"REM which was case 1alter session set tracefile_identifier = 'correlated16';

Note that this case corresponds to the case "index with histogram" and shows that having the histogram in place effectively makes the index useless for selectivity/cardinality estimates. Same applies when querying for non-existent values:

I'm now going to simulate what is going to happen if one decides to merge indexes, e.g. two indexes on a,b and a,b,c will render the one on a,b redundant, or if it is decided to add column c to the existing index since some queries might benefit from this additional column:

It's interesting to note that the optimizer using a higher "CorStregth" value of 10 and decides on the partial match ("Partial: #1 (1 2 )") to use the information from the index, although we don't use the entire index.

For the remaining test cases with out-of-range predicates we get consistent behaviour:

The "CorStregth" value reverts to 5 and the index information is ignored. The selectivity/cardinality corresponds to the default formula used for uncorrelated column values, so in this case we lose the additional information about the correlation and fall back to default statistics.

Finally the "ZODIAC" example which actually represents correlated and skewed data is influenced by the bug fix for the non-existent values in frequency histograms as well.

We can see that we get an estimated cardinality of 3.5 (which is again consistent with the rule introduced by the bug fix: 0.5 times the cardinality of the least popular value in the frequency histogram, the query returns 7 => 7 * 0.5 => 3.5) due to a "unpopular" value, but if you think about it, there no such thing like a "unpopular" value in a frequency histogram, basically the value should be treated as "non-existent", which is prevented by the bug fix.

So in summary this is again a very interesting test case that shows that generating histograms can have a significant impact on the estimates of the optimizer - even in the particular case where the column or expression was distributed uniformly.

Friday, January 9, 2009

As outlined in e.g. Jonathan Lewis' "Cost-Based Oracle Fundamentals" on page 117 et seqq., histograms are based only on a subset of column data, e.g. for character columns only the first 32 bytes (not characters in case of multi-byte character sets) are examined and while converting the resulting substring to a number, further truncation takes place: Only the first 15 bytes are used, the remaining ones are padded with zeros, the resulting hex number is converted to decimal and again rounded to 15 significant figures.

Since this means that for single-byte character strings only the first six to seven characters are relevant for the finally resulting number representation (in case of multi-byte character sets it's potentially even less relevant characters) Oracle populates the ENDPOINT_ACTUAL_VALUE using the first 32 bytes to be able to distinguish strings that are the same for the first six to seven bytes.

Recently there was a interesting thread on the OTN forum where it was pointed out that similar issues can arise if you attempt to generate a histogram on numeric data with a very high precision. It turned out that Oracle actually performs a similar rounding with number data although from a technical point of view it's not obvious why a rounding would be required since no conversion is necessary for a column of NUMBER data type; the ENDPOINT_VALUE seems to be a NUMBER type, too, in the data dictionary.

The behaviour although is a bit different between the handling of strings and numbers: Whereas strings that do not differ for the first 32 bytes simply generate a one bucket histogram (so they group to a single value), a number that is the same for the first 15 digits still generates an appropriate number of histogram buckets, but the ENDPOINT_VALUE is the same for all entries. Note that it doesn't matter if the digits are left or right to the decimal point, any number having a precision larger than 15 will get rounded.

The following test case (run on Windows (32bit) 10.2.0.4, but the results on 9.2.0.8 and 11.1.0.7 were the same) shall demonstrate the issues described. It populates a table with suitable data to demonstrate the rounding issues.

As you can see from the histogram dictionary information, the numeric data gets rounded but the correct number of histogram buckets get generated. The string behaves differently, and you can see the different handling of the ENDPOINT_VALUE and the ENDPOINT_ACTUAL_VALUE.

When the rounding issues kick in for the numbers you start to get (slightly) incorrect estimates based on the histogram information, and once the precision gets greater than 15, the estimates will be significantly incorrect.

As a side note, as you can see from the test case, it's a bad idea to generate a histogram on a string that is the same for at least the 32 first bytes, since you'll get a single bucket and the estimates will be way off.

A potential workaround of the numeric issue is to store the number as string (as long as the precision is less than 32 due to the limitation of the string handling). According to "Cost-Based Oracle Fundamentals" by Jonathan Lewis from 9.2.0.6 and 10.1.0.4 on Oracle will use the histogram information properly, even with the implicit number conversion in place, but I don't have the older versions currently available for testing, so I can't tell the exact outcome when using the older releases.

Of course this workaround introduces various issues, the most obvious threat being the implicit conversion relying on client NLS settings, as can be seen in the script.