Tuesday, May 26, 2009

The answer - as already disclosed by Nicolas Gasparotto - to the question that I asked here in the "Weekend Quiz" is to run the script as SYS user, and then run the query shown against these objects in the SYS schema (tested against 10g XE, 10.2.0.4 and 11.1.0.7 on Win32).

Note: It's not recommended to create any non-SYS objects in the SYS schema and you should only perform this (if at all) in a test database.

All this came up in this recent OTN forum thread where it became obvious that the issue can only be reproduced if the objects are owned by SYS.

There are two interesting points to derive from this (apart from the obvious that one should not create any user objects in the SYS schema):

1. The optimizer seems to treat objects owned by SYS differently, in particular regarding the transformations applied. Note that the crucial point is not that the query is executed as SYS user, but that the objects are owned by the SYS user. Granting appropriate privileges to a non-SYS user on the objects owned by SYS allows to reproduce the issue even with a non-SYS user.

2. It's something to remind if there is the need to understand a performance issue with a recursive dictionary query performed on SYS-owned objects. Although you obviously can't influence the SQL generated by Oracle itself it might help to understand the issue and take appropriate steps to rectify the issue.

Oh, by the way, have I already mentioned that it's really a bad idea to create user objects in the SYS schema?

Sunday, May 24, 2009

I've already mentioned it several times on my blog but I would like to take the chance here again to stress the point that the cost based optimizer does a bad job when it comes to calculating costs for full table scans of objects residing in non-default block sizes. It really looks like that this feature has been introduced to support transportable tablespaces but it obviously hasn't been tested very thoroughly when it comes to cost calculation.

Each of the different modes has its deficiencies when dealing with objects in non-default blocksizes. The somehow odd thing is that the traditional I/O costing does the best job, and all system statistics based calculations are utterly wrong.

Traditional I/O based costing

The traditional I/O based costing simply scales the MBRC up or down according to the non-default blocksize to come to the same I/O read request size. So if you e.g. have a MBRC of 8 and a default blocksize of 8KB and now calculate the cost for an object residing in a 2KB tablespace, the MBRC will be multiplied 4, which results in a MBRC of 32. The I/O cost will be different although due to the different adjustment used with the higher MBRC setting. The adjusted MBRC for 32 is 16.39 whereas the adjusted MBRC for 8 is 6.59, so the calculated cost for the full table scan of the object residing in the 2KB tablespace will be higher. Likewise the same happens when using an object in a 16KB non-default tablespace. The MBRC will be reduced accordingly to 4 to get the same I/O read size again. Since adjusted MBRC for MBRC = 4 is 4.17, the cost calculated will actually be less for the object residing the 16KB tablespace.

System statistics

Things get more complicated when using NOWORKLOAD or WORKLOAD system statistics.To recall the formula to calculate the I/O cost of a full table scan with system statistics is:

Number of blocks / MBRC * MREADTIM / SREADTIM

And in case of NOWORKLOAD system statistics the MREADTIM and SREADTIM are synthesized using the following formula:

SREADTIM = IOSEEKTIM + DB_BLOCK_SIZE / IOTRFSPEED

MREADTIM = IOSEEKTIM + DB_BLOCK_SIZE * MBRC / IOTRFSPEED

Now if the object resides in a non-default blocksize tablespace, the following inconsistent adjustments are applied to the formula:

So obviously something is odd in above formulas: The SREADTIM and MREADTIM values are synthesized with a mixture of a scaled MBRC (according to the block size) but a non-adjusted default DB_BLOCK_SIZE, resulting in a large variation in cost, so a full table scan in a small blocksize is much more expensive than in the default blocksize, and likewise a full table scan in a large blocksize is much cheaper. Unfortunately this doesn't reflect at all the runtime behaviour, since Oracle acutally scales the I/O read request size accordingly meaning that the runtime difference usually is negligible, but the cost calculated is dramatically different.

WORKLOAD system statistics:

MBRC as measured/set

SREADTIM as measured/set

MREADTIM as measured/set

I/O cost = Number of blocks / MBRC * MREADTIM / SREADTIM

This is even worse than above NOWORKLOAD result because the I/O cost calculated simply is different by the factor of number of blocks in non-default block size / number of blocks in default block size, e.g. an object residing in a 2KB block size will have an I/O cost four times higher than an object residing in a 8KB default blocksize, and the MBRC is not adjusted at all for the calculation.

Some calculation examples

The following examples attempt to summarize what you get if you're using an object in a non-default blocksize tablespace:

1. I/O Cost of a full table scan of a 10,000 blocks / 80MB segment residing in a default blocksize of 8KB (in case of system statistics excluding the CPU cost) and using an MBRC of 8:

WORKLOAD system statistics: same as above provided you use the same for MREADTIM and SREADTIM otherwise as measured

2. I/O Cost of a full table scan of a 40,000 blocks / 80MB segment residing in a non-default blocksize of 2KB (in case of system statistics excluding the CPU cost) and using a default blocksize of 8KB and a MBRC of 8:

3. I/O Cost of a full table scan of a 5,000 blocks / 80MB segment residing in a non-default blocksize of 16KB (in case of system statistics excluding the CPU cost) and using a default blocksize of 8KB and a MBRC of 8:

And here is a final example that shows by simply moving an object to a non-default blocksize tablespace a change in the execution plan can be observed. An index scan is used instead of the full table scan since the cost of the full table scan operation is significantly increased. The SQL trace will reveal that at runtime however there is no significant difference between the full table scan executed in the default tablespace and non-standard blocksize tablespace.

Don't be mislead by the different number of consistent gets, this is just due to the different block size. As can be seen from the raw trace files the multi-block reads in the 2KB block size are reading 32 blocks at a time, whereas in the 8KB default block size 8 blocks are read at a time, so the number of I/O requests will be similar given a segment of the same size.

Note 1: It's assumed that everything is equal that is (normally) considered by the cost based optimizer: Object statistics, System statistics, segment sizes, the actual data and optimizer related settings including all "underscore" parameters. It can be reproduced on the same database.

Hint: The solution should not be applied to a production system.

Note 2: And no, Timur, you're not allowed to participate, since you already know the answer!

Sunday, May 17, 2009

In 10g the CPU costing mode is enabled by default and is supported by the default NOWORKLOAD system statistics.

But you can use system statistics already in 9i, although you have to enable them explicitly.

Oracle 9i is out of support, but I guess there are still a lot of systems out there that are using the 9.2 release, therefore I find it worth to mention what you can do in 9i with system statistics. This can be helpful if you consider to test your application already in the 9i environment with system statistics before upgrading to 10g.

In most descriptions about 9i and system statistics only WORKLOAD system statistics are mentioned, but 9i also supports NOWORKLOAD system statistics, although not in the same full flavour as 10g does.

You can activate CPU costing in 9i by running DBMS_STATS.GATHER_SYSTEM_STATS('NOWORKLOAD'), but this seems to work differently than in 10g and later.

Whereas in 10g and later this actually measures the IOSEEKTIM and IOTFRSPEED values, this seems to activate in 9i something that is comparable with the default NOWORKLOAD system statistics of 10g.

The SYS.AUX_STATS$ does not show any actual values, but tests revealed that 9i (at least 9.2.0.8) in that case seems to measure the CPU speed (in 10g this is the CPUSPEEDNW value) and uses the same default values for IOSEEKTIM and IOTFRSPEED as 10g does (10ms and 4096 bytes/ms resp.).

Running some tests showed that you arrive at the same I/O cost as you do in 10g with the default NOWORKLOAD system statistics.

exec dbms_stats.delete_system_stats;

exec dbms_stats.gather_system_stats('NOWORKLOAD')

alter session set "_table_scan_cost_plus_one" = false;

explain plan forselect max(val)from t1;

Since you can't modify the NOWORKLOAD system statistics values in 9i (i.e. you can't manipulate CPUSPEEDNW, IOSEEKTIM and IOTFRSPEED using DBMS_STATS.SET_SYSTEM_STATS), see the following excerpt from the 9.2.0.8 10053 trace file for the 10,000 blocks MSSM table with 8kb blocksize when using NOWORKLOAD system statistics:

If you gather or set WORKLOAD system statistics 9i calculates the same cost as 10g does, based on the SREADTIM, MREADTIM and MBRC values.

You can manipulate the WORKLOAD system statistics using DBMS_STATS.SET_SYSTEM_STATS like you can do in 10g. So the following code snippet mimics the default NOWORKLOAD SREADTIM and MREADTIM values for a 8kb blocksize with a DB_FILE_MULTIBLOCK_READ_COUNT = 8.

If the WORKLOAD system statistics are invalid (e.g. the MREADTIM is not greater than the SREADTIM value or some values are missing) then 9i falls back to traditional I/O cost based costing for the I/O cost but adds obviously CPU costs, which is different from 10g which falls back to NOWORKLOAD system statistics.

If you change in the above code snippet the corresponding code block like this:

This an interesting mixture of traditional I/O costing and CPU costing.

Quirks and oddities

While running my tests I came across several quirks and oddities:

- When gathering NOWORKLOAD system statistics with 10.2.0.4 and 11.1.0.7 on my test environment it happened regularly that the IOTFRSPEED was left unchanged at the default of 4096, but sometimes it also changed to a value that looked more like a measured one.

- You obviously can't set the CPUSPEEDNW/IOSEEKTIM/IOTFRSPEED values using DBMS_STATS.SET_SYSTEM_STATS in 9i for the NOWORKLOAD system statistics, so you're bound to the default values used by Oracle.

- If you're on 10g and later and use OPTIMIZER_FEATURES_ENABLE = '9.2.0.8' the NOWORKLOAD system statistics are not used properly. Although the 10053 optimizer trace suggests otherwise by mentioning the NOWORKLOAD values the final cost calculated falls back to traditional I/O based costing (and no CPU costing at all). This is probably caused by the fact that 9i doesn't support the NOWORKLOAD parameters in the SYS.AUX_STATS$ table that are always present in 10g. System statistics work however in 9.2.0.8 compatibility mode if you're using WORKLOAD system statistics with their corresponding values in SYS.AUX_STATS$.

- Altering the SYSTEM value of db_file_multiblock_read_count: This seems to be buggy because the value set is not getting used in the session that altered the system. You need to set other optimizer related values in the same session to make the change of the parameter effective. For more information, see Dion Cho's blog post about the issue.

The next part of the series will cover my "favorite" topic using objects in multiple block sizes and effects on costing, in particular when using system statistics.

Sunday, May 3, 2009

Before heading on to the remaining modes of system statistics, let's summarize what has been observed in part 1 regarding the default NOWORKLOAD system statistics in 10g and later. The following table shows what the test case from the previous post demonstrated:

If you happen to have a 16KB default blocksize the results would look like the following. Note that the table is now only 5,000 blocks in size, and the SREADTIM is now a bit longer (10+16384/4096=14ms instead of 10+8192/4096=12ms) therefore the 16KB blocksize calculation makes the full table scan look a bit cheaper to the optimizer when using the default NOWORKLOAD system statistics.

If you gather NOWORKLOAD system statistics using DBMS_STATS.GATHER_SYSTEM_STATS('NOWORKLOAD') the values IOSEEKTIM and IOTFRSPEED will actually get measured and used accordingly.

The remaining calculations how to derive the SREADTIM and MREADTIM values correspond to what has been outlined above.

Gathering NOWORKLOAD statistics:

exec DBMS_STATS.GATHER_SYSTEM_STATS('NOWORKLOAD')

This actually gathers the IOTFRSPEED and IOSEEKTIM values in addition to CPUSPEEDNW rather than using the default values of 4096 and 10.

In 10g and later this may take from a couple of seconds to a couple of minutes depending on the size of your database. Note that this puts additional load onto your system while gathering the NOWORKLOAD system statistics since it submits random reads against all data files.

The following test case shows the different I/O cost calculations when using default NOWORKLOAD system statistics and custom gathered NOWORKLOAD system statistics. It creates again the 10,000 blocks table in a 8KB default block size locally managed tablespace using manual segment space management:

Based on the gathered IOSEEKTIM and IOTFRSPEED values the I/O cost calculated is significantly different.

Applying the known formulas we can reproduce the calculated figures:

SREADTIM = IOSEEKTIM + DB_BLOCK_SIZE / IOTFRSPEED

SREADTIM = 14.226 + 8192 / 32,517.754 = 14.478

MREADTIM = IOSEEKTIM + MBRC * DB_BLOCK_SIZE / IOTFRSPEED

MREADTIM = 14.226 + 8 * 8192 / 32,517.754 = 16.241

FTS cost = Blocks below HWM / MBRC * MREADTIM / SREADTIM

FTS cost = 10,000 / 8 * 16.241 / 14.478 = 1,403

Gathered WORKLOAD system statistics

Gathering WORKLOAD system statistics measures a different set of values, including the actual MBRC, SREADTIM and MREADTIM values. The cost calculation therefore doesn't use the synthesized SREADTIM and MREADTIM values any longer, nor does it use the "_db_file_optimizer_read_count" parameter in 10g and later, but uses simply the measured values.

Therefore the I/O costs calculated with WORKLOAD system statistics are not dependent on the "db_file_multiblock_read_count" value used, but the important point to keep in mind is that the gathered WORKLOAD system statistics are based on the "db_file_multiblock_read_count" (in 10g and later on the internal parameter "_db_file_exec_read_count") value used at runtime, so the values measured are obviously influenced by this setting ("_db_file_exec_read_count" equals "db_file_multiblock_read_count" if this has been set and the underscore parameter hasn't been modified).

As already mentioned in part 1 Oracle has introduced with Oracle 10.2 that if the "db_file_multiblock_read_count" is left unset different values for cost calculation and at execution time will be used (8 for calculation and the largest possible I/O size at runtime, usually 1MB on most platforms), so that points in general into the right direction since it allows the calibration code to work out the largest MBRC possible at runtime that can be achieved. Note that Christian Antognini doesn't agree to this approach in his book "Troubleshooting Oracle Performance" where he advices to manually work out the "optimal" MBRC setting running suitable I/O tests.

Note that in 10g and later the runtime engine still uses the "_db_file_exec_read_count", regardless of the MBRC used to calculate the cost.

If you run the following code snippet in 10g and later and check the resulting trace files, you'll see this confirmed:

exec DBMS_STATS.GATHER_SYSTEM_STATS('START')-- some significant (ideally "representative") workload needs to be performed-- otherwise some or all of the measured values will be missingexec DBMS_STATS.GATHER_SYSTEM_STATS('STOP')

or

exec DBMS_STATS.GATHER_SYSTEM_STATS('INTERVAL', )

Note that gathering workload system statistics doesn't put additional load onto your system, since the values are derived from the delta in statistics already maintained by Oracle during database activity.

Furthermore if your workload doesn't use "db file scattered read" i.e. multi-block reads that are working with the buffer cache, then you might end up with WORKLOAD system statistics that are missing the MBRC and MREADTIM component (null values). This might happen if you e.g. use only index access paths with table row random lookups by ROWID and/or all your tablescans are all going parallel, or in 11g use serial direct reads that bypass the buffer cache (which can be activated in pre-11g using the hidden parameter "_serial_direct_read").

The same applies to "db file sequential read" i.e. single-block reads, if you only perform multi-block reads in your workload then the SREADTIM information might be missing from the gathered statistics.

Although the official documentation says in 10.2 for that case the following:"During the gathering process of workload statistics, it is possible that mbrc and mreadtim will not be gathered if no table scans are performed during serial workloads, as is often the case with OLTP systems. On the other hand, FTS occur frequently on DSS systems but may run parallel and bypass the buffer cache. In such cases, sreadtim will still be gathered since index lookup are performed using the buffer cache. If Oracle cannot gather or validate gathered mbrc or mreadtim, but has gathered sreadtim and cpuspeed, then only sreadtim and cpuspeed will be used for costing. FTS cost will be computed using analytical algorithm implemented in previous releases. Another alternative to computing mbrc and mreadtim is to force FTS in serial mode to allow the optimizer to gather the data."

And the 11g documentation says this:"If Oracle Database cannot gather or validate gathered mbrc or mreadtim values, but has gathered sreadtim and cpuspeed values, then only the sreadtim and cpuspeed values are used for costing. In this case, the optimizer uses the value of the initialization parameter DB_FILE_MULTIBLOCK_READ_COUNT to cost a full table scan. However, if DB_FILE_MULTIBLOCK_READ_COUNT is not set or is set to 0 (zero), then the optimizer uses a value of 8 for costing."

But when testing this it looked like that if either MBRC or MREADTIM was missing then the optimizer simply reverted to the available NOWORKLOAD system statistics (Note that this applies to 10g and later; I'll show in the next part of the series what happens in 9i since things are different there).

Note in order to have the optimizer accept the WORKLOAD system statistic the MREADTIM needs to greater than the SREADTIM. If your multi-block read requests are served from a cache or your storage system performs aggresssive read-aheads the measured MREADTIM can be less than the SREADTIM. In this case you might need to adjust the MREADTIM manually using the GET_SYSTEM_STATS/SET_SYSTEM_STATS API, which will be covered below.

One interesting oddity showed up when MBRC was available but MREADTIM was missing or not greater than SREADTIM: In that case it looks like that the NOWORKLOAD statistics use the MBRC set in their calculations for synthesizing the MREADTIM and calculating the full table scan cost. This makes sense but is interesting mixture of NOWORKLOAD and WORKLOAD system statistics.

The following test case shows how to gather WORKLOAD system statistics, and how to correct manually a MREADTIM value gathered too low.

Number of multi-block read requests * MREADTIM = time it takes to perform these number of read requests in milliseconds.

Finally this is divided by SREADTIM to arrive at the known unit used for cost representation which is number of single read requests.

10,000 / 8 = 1,250 multi-block read requests

1,250 * 12.93 = 16,162.5 ms execution time (which is shown as 17 seconds in the plan by the way)

16,162.5 / 8.021 = 2,015.02 (which is shown as 2,016 in the plan)

Manually writing and maintaining system statistics

Using the DBMS_STATS.GET_SYSTEM_STATS / SET_SYSTEM_STATS API you can write your own set of system statistics for both NOWORKLOAD and WORKLOAD values.

You can use DBMS_STATS.DELETE_SYSTEM_STATS to remove the system statistics, which will activate the default NOWORKLOAD system statistics in 10g and disable CPU costing in 9i.

You can use DBMS_STATS.EXPORT_SYSTEM_STATS / IMPORT_SYSTEM_STATS to export and import system statistics to a user statistics table created with DBMS_STATS.CREATE_STAT_TABLE.

Note that DBMS_STATS.GATHER_SYSTEM_STATS when used with an user stats table (created with DBMS_STATS.CREATE_STAT_TABLE) behaves differently than e.g. DBMS_STATS.GATHER_TABLE_STATS: Whereas object related statistics always go to the data dictionary and you only have the option to save the current statistics to the user stats table before replacing them with the new values, GATHER_SYSTEM_STATS actually writes the system statistics into the user stats table and doesn't change the actual system statistics if you're supplying a user stats table name.

For further discussion how you could use system statistics see Jonathan Lewis' thoughts about this topic:

The next part of the series will cover the usage of system statistics in 9i, highlight some quirks and oddities observed and finally show what happens if you attempt to use multiple block sizes for "tuning" purposes.