Saturday, April 25, 2009

This is the first part of a series of posts that cover one of the fundamentals of the cost based optimizer in 9i and later. Understanding how the different system statistics modes work is crucial in making the most out of the cost based optimizer, therefore I'll attempt to provide some detailed explanations and samples about the formulas and arithmetics used. Finally I'll show (again) that using multiple block sizes for "tuning" purposes is a bad idea in general, along with detailed examples why I think this is so.

One of the deficiencies of the traditional I/O based costing was that it simply counted the number of I/O requests making no differentation between single-block I/O and multi-block I/O.

System statistics were introduced in Oracle 9i to allow the cost based optimizer to take into account that single-block I/Os and multi-block I/Os should be treated differently in terms of costing and to include a CPU component in the cost calculation.

The system statistics tell the cost based optimizer (CBO) among other things the time it takes to perform a single block read request and a multi-block read request. Given this information the optimizer ought to be able to come to estimates that better fit the particular environment where the database is running on and additionally use an appropriate costing for multi-block read requests that usually take longer than single block read requests. Given the information about the time it takes to perform the read requests the cost calculated can be turned into a time estimate.

The cost calculated with system statistics is still expressed in the same units as with traditional I/O based costing, which is in units of single-block read requests.

Although the mode using system statistics is also known as "CPU costing" despite the name the system statistics have the most significant impact on the I/O costs calculated for full table scans due to the different measure MREADTIM used for multi-block read requests.

Starting with Oracle 10g you have actually the choice of three different modes of system statistics also known as CPU costing:

The important point to understand here is that starting with Oracle 10g system statistics are enabled by default (using the default NOWORKLOAD system statistics) and you can only disable them by either downgrading your optimizer (using the OPTIMIZER_FEATURES_ENABLE parameter) or using undocumented parameters or hints ("_optimizer_cost_model" respectively the CPU_COSTING and NOCPU_COSTING hints).

This initial part of the series will focus on the default NOWORKLOAD system statistics introduced with Oracle 10g.

Default NOWORKLOAD system statistics

The default NOWORKLOAD system statistics measure only the CPU speed (CPUSPEEDNW), the two other remaining values used for NOWORKLOAD system statistics IOSEEKTIM (seek time) and IOTFRSPEED (transfer speed) are using default values (10 milliseconds seek time and 4096 bytes per millisecond transfer speed).

Using these default values for the I/O part the SREADTIM (single-block I/O read time) and MREADTIM (multi-block I/O read time) values are synthesized for cost calculation by applying the following formula:

SREADTIM = IOSEEKTIM + db_block_size / IOTFRSPEED

MREADTIM = IOSEEKTIM + mbrc * db_block_size / IOTFRSPEED

where "db_block_size" represents your database standard block size in bytes and "mbrc" is either the value of "db_file_multiblock_read_count" if it has been set explicitly, or a default of 8 if left unset. From 10.2 on this is controlled internally by the undocumented parameter "_db_file_optimizer_read_count". This means that in 10.2 and later the "mbrc" used by the optimizer to calculate the cost can be different from the "mbrc" actually used at runtime when performing multi-block read requests. If you leave the "db_file_multiblock_read_count" unset in 10.2 and later then Oracle uses a default of 8 for cost calculation but uses the largest possible I/O request size depending on the platform, which is usually 1MB (e.g. 128 blocks when using a block size of 8KB). In 10.2 and later this is controlled internally by the undocumented parameter "_db_file_exec_read_count".

Assuming a default block size of 8KB (8192 bytes) and "db_file_multiblock_read_count" left unset, this results in the following calculation:

SREADTIM = 10 + 8192 / 4096 = 10 + 2 = 12ms

MREADTIM = 10 + 8 * 8192 / 4096 = 10 + 16 = 26ms

These values will then be used to calculate the I/O cost of single block and multi-block read requests according to the execution plan (number of single-block reads + number of multi-block reads * MREADTIM / SREADTIM), which means that the I/O cost with system statistics aka. CPU costing is expressed in units of single block reads.

You can derive from above formula that with system statistics the cost of a full table scan operation is going to be more expensive approximately by the factor MREADTIM / SREADTIM compared to the traditional I/O based costing used in pre-10g by default, therefore system statistics usually tend to favor index access a bit more.

Note that above factor MREADTIM / SREADTIM is not entirely correct since the traditional I/O costing introduces a efficiency reduction factor when using higher MBRC settings, presumably to reflect that the larger the number of blocks per I/O request the higher the possibility that it won't be possible to use that large number of blocks per I/O request due to blocks already being in the buffer cache or hitting extent boundaries.

So with a MBRC setting of 8 the adjusted MBRC used for calculation is actually 6.59. Using e.g. a very high setting of 128 for the MBRC will actually use 40.82 for calculation. So the higher the setting the more the MRBC used for calculation will be reduced.

The following test case shall demonstrate the difference between traditional I/O costing, CPU costing and the factor MREADTIM / SREADTIM when using different "db_file_multiblock_read_count" settings. The test case was run against 10.2.0.4 Win32.

Note that the test case removes your current system statistics so you should be cautious if you have non-default system statistics at present in your database.

-- Use default NOWORKLOAD system statistics-- for test but ignore CPU cost component-- by using an artificially high CPU speedbegin dbms_stats.delete_system_stats; dbms_stats.set_system_stats('CPUSPEEDNW',1000000);end;/

-- In order to verify the formula against the-- optimizer calculations-- don't increase the table scan cost by one-- which is done by default from 9i onalter session set "_table_scan_cost_plus_one" = false;

So as you can see the I/O costs for a full table scan are significantly different when using default NOWORKLOAD system statistics. You can also see that the SREADTIM and MREADTIM values derived are quite different when using different "db_file_multiblock_read_count" settings. Furthermore the difference between traditional I/O based costing and the CPU costing is not the factor MREADTIM / SREADTIM as suggested by the formula, but is reduced by the adjustment applied to the MBRC when using traditional I/O costing.

The next part of the series will cover the remaining available System Statistics mode.

Thursday, April 16, 2009

Thanks to Oracle ACE H.Tonguç Yılmaz and special thanks to Oracle ACE Dion Cho, who nominated me for the Oracle ACE award.

Some statistics (since I'm a CBO guy :-):

- I'm truly honored to be Oracle ACE no. 210 in the world- There are at present 57 Oracle ACEs in the "Database Management & Performance" category (53 in "Database App Development" and 10 in "Business Intelligence")- There are 7 ACEs from Germany at present

Thursday, April 9, 2009

Oracle uses for its read consistency model a true multi-versioning approach which allows readers to not block writers and vice-versa, writers to not block readers. Obviously this great feature allowing highly concurrent processing doesn't come for free, since somewhere the information to build multiple versions of the same data needs to be stored.

Oracle uses the so called undo information not only to rollback on-going transactions but also to re-construct old versions of blocks if required. Very simplified when reading data Oracle knows the point in time (which corresponds to an internal counter called SCN, System Change Number) that data needs to be consistent with. In the default READ COMMITTED isolation mode this point in time is defined when a statement starts to execute. You could also say at the moment a statement starts to run its result is pre-ordained. When Oracle processes a block it checks if the block is "old" enough and if it discovers that the block content is too new (has been changed by other sessions but the current access is not supposed to see this updated content according to the point-in-time assigned to the statement execution) it will start to create a copy of the block and use the information available from the corresponding undo segment to re-construct an older version of the block. Note that this process can be iterative: If after re-constructing the older version of the block it's still not sufficiently old more undo information will be used to go further back in time.

Since the undo information of transactions that have been committed is marked as re-usable Oracle is free to overwrite the corresponding undo data under certain circumstances (e.g. no more free space left in the UNDO tablespace). If now an older version of a block needs to be created but the corresponding undo information required to do so has been overridden, the infamous "ORA-01555 snapshot too old" error will be raised, since the required read-consistent view of the data can not be generated any longer.

In order to avoid this error starting from 10g on you only need to have a sufficiently large UNDO tablespace in automatic undo management mode so that the undo information required to create old versions of the blocks doesn't get overridden prematurely. In 9i you need to set the UNDO_RETENTION parameter according to the longest expected runtime of your queries and of course have sufficient space in the UNDO tablespace to allow Oracle to adhere to this setting.

So until now Oracle was either able to provide a consistent view of the data according to its read-consistency model, or you would get an error message if the required undo data wasn't available any longer.

-- small cache so that old copies of the blocks won't survive in the buffer cache-- and delayed block cleanout probability increasesalter system set db_cache_size = 2M scope = memory;

Note that all examples here use DBMS_JOB to simulate the simultaneous modification and reading of data, therefore you need to have the JOB_QUEUE_PROCESSES parameter set accordingly, otherwise the job won't get executed.

I've started with a variation of Tom Kyte's example how to deliberately force an ORA-01555 error, which looks like this:

1. It creates a table copy of the ALL_OBJECTS view ordered randomly, and adds a primary key index on the OBJECT_ID

2. It issues a query that uses the FIRST_ROWS hint to force an index access to the table because of the available primary key index and the corresponding ORDER BY. It's one of the built-in heuristic rules of the (deprecated) FIRST_ROWS cost based optimizer mode that an ORDER BY is going to use an index if possible to avoid a sort operation. By using this inefficient approach it is ensured that each block of the table will be accesses multiple times due to the random row access driven by the ordered index.

3. It then spawns a job simulating a separate session that starts to overwrite the data row-by-row the query is supposed to read. Specifically the MY_ID column which has been generated with 1 will be set to 0. By committing each single row update operation the small undo tablespace will eventually be filled up and old undo data can be and needs to be overwritten due to insufficient space.

4. While the update loop is running the data from the query gets slowly fetched. Due to the fact that each block will be visited many times according to the index access it's almost guaranteed that the undo information required to re-construct the old version of the block has been overwritten (due to the artificially small undo tablespace) and therefore the ORA-01555 error will occur.

And sure enough, when running this in 11.1.0.7 with the pre-requisites met, the output will look similar to the following. Note that the first line shows what we expect to get from the second query: Only one distinct value, namely 1

Re-running the test case shows that you still get the same error, and obviously the hint doesn't help to avoid the error in this case.

Now if you read the Metalink note subject again, you might notice that it says: "Allow ORA-1555 to be ignored during table scan". Presumably since our example doesn't use a full table scan but an table access by ROWID the hint may be doesn't work as expected.

Let's modify our test case a little bit to use a full table scan instead of the index access path:

It can be clearly seen that those 0 returned by the query shouldn't be there according to the first line of the output, so this scary feature seems to have worked in this case.

Interestingly you get the same result and behaviour when running the test case against 10.2.0.4, so although the hint is not documented for that version it seems to work there, too.

I couldn't reproduce this on 9.2.0.8, so obviously it wasn't backported there.

Here's another, slightly more complex but even more impressive test case, which basically does the same, but introduces some further stuff. Note that it might require the following to be granted as user SYS to the user executing the test case. The grant on DBMS_LOCK is actually also required for Tom Kyte's demonstration code above:

1. Creates two tables, one that will be modified and read, and another one whose sole purpose is to ensure that the undo will be overwritten

2. Uses a pipelined table function to fetch data slowly from a passed ref cursor object. Note that this is purely optional for demonstration purposes of a pipelined table function and as you can see the commented part simply fetches from the initial cursor directly to achieve the same result.

3. Uses DBMS_PIPE to perform very rudimentary synchronisation between the spawned jobs and the main session.

4. The basic principle is similar, but somewhat different to the previous test case: - We open a cursor. At that moment the result is pre-ordained. - Then we spawn a separate job that modifies the complete table that the query is based on. - Once this is successfully done we spawn another job that attempts to fill up and overwrite our small undo tablespace. - While this job is running we start to fetch from the initially opened cursor.- As soon as the fetch is complete, either due to errors or successfully completed, we tell the job to stop the update operation and finally clean up if the job for whatever reason is still running (which should not happen).

Here is the result from 11.1.0.7 (and 10.2.0.4 which behaves the same) without the SCN_ASCENDING hint:

In this case again I couldn't prevent the ORA-01555 error, so this seems to be corroborate the theory that only full table scans are able to use the SCN_ASCENDING request successfully.

So in summary I have to say that this feature seems to be quite questionable, may be even buggy, and even when it works it looks quite scary given the otherwise very robust multi-versioning capabilities of Oracle which represent one of the cornerstones of its fundamental architecture.

I haven't checked yet if the hint does also modify the behaviour of DML statements, but since these employ already their "write" consistency as it is called by Tom Kyte, it's quite unlikely that the SCN_ASCENDING hint is applicable. This means that an update DML statement (or SELECT FOR UPDATE) that while processing encounters that the data accessed has been modified in the meantime by others is going to "restart". This effectively means that any potential changes already applied are going to be rolled back, and the statement again starts from scratch based on the latest data. Note that this restart can happen multiple times, and yes, the amount of undo and redo generated will be increased if this is going to happen, although Oracle seems to rollback the changes only once and from then on switch to a SELECT FOR UPDATE mode first. This is a bit similar to the SCN_ASCENDING behaviour, but the crucial difference is that the DML statement is able to re-start its work, whereas the query might have already fetched numerous rows that already have been processed by the client, so whereas the DML statement is still consistent because it starts all over again, the query results are potentially inconsistent since there is no re-start possible if the client has already processed a part of the result set.

As a side note: The "restart" effect of the "write" consistency can actually lead to triggers being fired multiple times for the same row and is one of the reasons why you should never perform non-transactional operations (typically sending an email) from within a trigger. The non-transactional operation cannot be rolled back and therefore will be potentially repeated, e.g. sending out emails to the same recipient multiple times. One possible solution to this problem is to encapsulate the non-transactional operation into something that is transactional, e.g. a job submitted via DBMS_JOB, because DBMS_JOB is transactional and the job creation will be rolled back as part of the DML restart.