Wednesday, January 28, 2009

Common tuning issues - avoiding the "fetch first rows quickly" trap

It seems to be a commonly encountered issue that as part of the analysis of a potentially poor performing SQL query statement the investigation is mislead by the fact that running the query in a GUI database front-end tool like TOAD, SQLDeveloper or similar shows an instant reply.

Most of these tools by default fetch only the first n rows of the result set conveniently into a grid-like GUI component.

This might lead to the conclusion that the statement seems to perform well and therefore doesn't require further investigations respectively tuning efforts.

The potential problem with this approach is that due to several possible causes a statement might return the first rows very quickly, but depending on how the statement is actually used this doesn't necessarily mean that all rows will be processed in a efficient manner.

Usually the following circumstances can lead to such side-effects:

- Grossly underestimated cardinalities by the optimizer- An efficient access path available to the optimizer based on that wrong cardinality estimate, usually an index

There are several scenarios that can be the root cause of the underestimated cardinalities, among them are:

- Wrong statistics of involved tables or indexes- Sharing of child cursors (execution plans) when using bind variables that cause different cardinalities, can also be forced indirectly by CURSOR_SHARING=FORCE- Incorrect estimation of join cardinalities: There are simply scenarios where the optimizer is not able to come up with a good estimate of a join, even (or possibly because of) with the help of histograms

Although starting from 10g on there is a default statistics collection job which should help to prevent the first point on the list from happening, it's actually this default job in conjunction with a particular pattern of database usage that can lead exactly to this result.

The following test case run on Oracle 10g XE should simulate this pattern to demonstrate one of the possible scenarios.

It creates table which represents a "staging" table of a bulk load process. This "staging" table is accidentally empty during the night when the default statistics collection job is executed. Afterwards the table gets loaded with 1,500,000 rows and a query is executed without refreshing the statistics after the load is complete.

You'll see that based on the wrong statistics the optimizer comes to a wrong conclusion.

The interesting side effect of this is that the query performs very well in a front-end tool that fetches only the first n rows, since the plan accidentally corresponds to a very efficient "first_rows(n)" plan, although the optimizer_mode is ALL_ROWS by default in 10g.

I've carefully crafted this example to show that even the deliberate usage of an "ORDER BY" in an attempt to force the database to process the whole result set for sorting before returning the first rows can be misleading, since it's possible to process such a query without a sort operation in case a suitable index exists.

Only by definitely forcing the database to process all rows you get the real picture if you know that all rows from your statement actually need to be processed. But it's not that easy to force this; the cost based optimizer is smart and most of the time simply wrapping the statement in a "COUNT(*) from (...)" is again misleading, because the optimizer often is able to avoid the random row access to the table, which is where most of the time is usually spent under such circumstances.

The "empty" staging table has been analyzed and afterwards 1,500,000 rows have been loaded which represent approx. 60MB of data, supported by a 30MB index. The table consists of 7,680 blocks (8KB default block size), the index has 3,840 blocks.

Note that the data is skewed, most of entries have a object type of "VIEW", so filtering for this value will return almost all of the rows from the table.

Elapsed: 00:00:18.95SQL> SQL> set autotrace traceonlySQL> SQL> -- Force to fetch all rowsSQL> -- Note that COUNT(*) itself is not sufficientSQL> -- because the optimizer is smart and doesn't visit the tableSQL> -- in this case which means that the test case would be no longerSQL> -- representativeSQL> select count(*) from ( 2 select * from t_fetch_first_rows 3 where type = 'VIEW' 4 order by id 5 );

Running above query including a filter predicate and an ORDER BY clause will return almost immediately, suggesting good response time.

And by simply wrapping the query in a "SELECT COUNT(*) FROM (...)" block it seems the result confirms that the query performs good: Only 3,758 blocks read (by the way, if you check the execution plan, it's an index only execution and not visiting the table at all - the blocks read basically represent all blocks from the index) in five seconds. So obviously there is nothing wrong...

SQL> SQL> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:00.01SQL> -- alter session set tracefile_identifier = 'fetch_first_rows1';SQL> SQL> -- exec dbms_monitor.session_trace_enableSQL> SQL> set autotrace traceonlySQL> SQL> -- Force to fetch all rowsSQL> -- Note that COUNT(*) itself is not sufficientSQL> -- because the optimizer is smart and doesn't visit the tableSQL> -- in this case which means that the test case would be no longerSQL> -- representativeSQL> select sum(measure), count(*) from ( 2 select * from t_fetch_first_rows 3 where type = 'VIEW' 4 order by id 5 );

But look what happens if I force to process all rows in a realistic manner: It takes roughly a minute and more significantly approx. 1,500,000 consistent block gets to process a table that consists of 7,680 blocks ! That's a logical I/O amount representing roughly 12GB.

And if you check the AUTOTRACE statistics you'll notice that I was actually lucky: I only read 11,236 blocks physically, the vast majority of the blocks came from the buffer cache after all blocks of the table have been read into it. In a real production system where the segments might be much larger and potentially multiple of such statements might compete for the buffer cache simultaneously a lot more buffer gets might lead to real physical I/O making things even worse.

The impressive amount of logical I/O is caused by the fact that using this execution plan the database scanned the index leaf blocks for the rows according to the filter criteria and accessed the corresponding table in a random fashion.

But the smallest entity that the database knows in terms of I/O is a block, so in order to read a single random row the whole block where the row resides in needs to be read. Due to the bad clustering factor of the index, which means that the table rows are differently ordered than the index, for each row randomly accessed a different block of the table had to be read. So on average each block of the table was accessed almost 200 times during the process!

If the table and the index were ordered in the same way (represented by a low clustering factor of the index close to number of blocks of the table, whereas a bad clustering factor is close to the number of rows of the table) then the amount of logical I/O required could be significantly less, since the rows accessed randomly reside in the same block and therefore the same table block can be re-used to process the row.

Now I'm going to show that having representative index statistics is as important as having the tables properly analyzed:

A simple full table scan took only three seconds, and processed 7,488 blocks, representing simply the net size of the table. What a vast difference compared to the 12GB processed via the index.

Note that when I repeat the statement using the index access without clearing the buffer cache its actual run time will drop in my test environment from one minute to less than 10 seconds, but that's not really good news, since it's usually quite unrealistic to have all blocks in the buffer cache.

To underline that this inefficient statement represents a real scaling inhibitor, I'll use Tom Kyte's RUNSTATS package to demonstrate the different resource consumption of the two statements:

The most obvious threat can be seen from the LATCH statistics: Latches are Oracle internal serialization devices that are used to protect resources from being modified by multiple processes at the same time. Latches therefore serialize access to certain memory structures that e.g. are used to maintain the buffers in the buffer cache. The more latch activity, the more CPU is used and the less scalable is a system, since the concurrent processes will have to wait for each other at the serialization points.

So the inefficient statement reported more than 3,000,000 latches, whereas the simple table scan required only approx. 50,000. So even if you not too concerned with the different run times in this single user test case, running such statements simultaneously in a real multi-user system will significantly influence the performance of the whole system.

5 comments:

Anonymous
said...

Hi Randolf,

My oltp application is working the best with optimizer mode : FIRST_ROWS_100 .

I followed your example (only changed the name of the table and the index ) but found that the CBO is still choosing INDEX as access path and not TABLE ACCESS FULL .Bellow all the steps i took.How come that only when i used ALL_ROWS hint it choose TABLE ACCESS FULL :

In your example you tell the optimizer that you're going to fetch 100 rows at most, so therefore it's legitimate/expected that the optimizer uses the index, no matter how many rows are in the table/satisfying the filter predicate, because using the index will be the cheapest way to fetch the first 100 records.

If you fetch all rows but tell the optimizer that you're only fetching the first 100, you can expect that things are not going to perform well.

You need to understand what the FIRST_ROWS_n mode implies.

If your application doesn't behave like only fetching the first rows from a result set, the FIRST_ROWS_n mode is very likely not going to be best for your application.

If you have different performance issues using the ALL_ROWS mode you need to find out what the issue is about in detail, but in general if your application uses to fetch all rows from a result set, ALL_ROWS is the mode to use.

If you have further questions I suggest you post a request on the worldwide OTN forum so that other people can comment, too.

That doesn't explain your previous issue described, since you were running EXPLAIN PLANs which should perform a hard parse every time.

But if you have a statement cached in the shared pool the default behaviour of 10g is to invalidate the statements affected by the updated statistics within a certain period, which is by default 18,000 secs (5 hours), controlled by the internal parameter "_optimizer_invalidation_period".

This can lead to the situation that you've updated the statistics but affected statements don't get reparsed immediately, but in a deferred fashion.

The rationale behind this default behaviour is to prevent high system load due to potentially excessive hard parsing.

If you want to have all affected statement invalidated immediately, you need the option you've described.

But again, this is a different issue: This is about the actual execution of the statement.