Sunday, August 28, 2016

This is the third part of this installment, comparing the performance consistency of the DBaaS cloud offering with a dedicated physical host. This time instead of burning CPU using logical I/O only (see part 2) the test harness executes a SQL statement that has to perform physical I/O mostly, and to be more precise this round is a read-only test, so doesn't include any modification to data (except the logging and any other background system generated write activity, like maintaining ASH /AWR data etc.).In order to maximize the physical I/O part the database instance was configured with a minimum sized buffer cache (16k block size in this case) and the following script was executed as many times as CPUs were available:declare n number;begin loop select /*+ leading(t_o) use_nl(t_i) index(t_o) index(t_i) */ sum(t_i.n) into n from t_o , t_i&tabname t_i where t_o.id_fk = t_i.id; insert into timings(testtype, thread_id, ts) values ('&testtype', &thread_id, systimestamp); commit; end loop;end;/The outer table T_O was created like this:create table t_o (id primary key, id_fk)organization indexasselect rownum as id , case mod(rownum, 2) + 1 when 1 then mod(rownum, &tab_size / 20) else &tab_size / 10 - mod(rownum, &tab_size / 20) + 1 end as id_fkfrom dualconnect by level <= &tab_size;Each thread got its dedicated inner table of the Nested Loop join, created like this:create table t_i' || i || ' (id not null, n, filler)pctfree 99 pctused 1tablespace &tbsasselect cast(rownum as integer) as id,cast(rownum as number) as n,cast(rpad('x', 200) as varchar2(200)) as fillerfrom dualconnect by level <= &tab_size / 10;

create index t_i' || i || '_idx on t_i' || i || ' (id, filler) pctfree 99 tablespace &tbs;Due to the way the data was crafted and thanks to the Nested Loop join batching performed by Oracle this ensured that the sessions were performing "db file parallel read" I/O as much and hard as possible, so were submitting multiple I/O requests using a single call and taking advantage of asynchronous I/O where possible.When running with 8 threads on those 8 CPUs on the DBaaS service this resulted in approx. 30.000 IOPS for those 16K blocks (approx. 480MB / sec throughput). Note that I got these 30.000 IOPS for 16K blocks also on a different instance with just 4 CPUs (2 OCPUs) and 4 threads, so it looks like this is a common 16K blocks IOPS limit for a DBaaS instance independently from the CPU count without further tweaking (I didn't attempt to optimize this in any way but used the service as provided / configured by the Oracle Cloud). It looks like this is more a throughput limitation at 480MB / sec than an IOPS limitation, because when using a 4K blocksize I was able to get 60.000 IOPS (middle value), but the IOPS rate was much more fluctuating (between 20.000 and 90.000) whereas the 30.000 IOPS using 16K blocksize was pretty much stable.On the physical host I got approx. half of these IOPS (roughly 15.500) for those 16K blocks (approx. 250MB / sec throughput), which is very likely the throughput limit for the rather outdated hardware configuration, albeit using a rather modern Samsung SSD drive, but also meant that running at 4 threads I got a comparable number of IOPS per thread as in the DBaaS setup. Using a 4K blocksize the physical host maxed out at (very stable) 52.000 IOPS.The overall results look like this:

DBaaS:

Physical host:

Again the graph is supposed to show how many of the runs deviated how much from the overall median runtime. Similar to the previous, logical I/O only test this time the DBaaS service shows a significantly different profile, with an even larger spread of deviation up to almost 9 percent from the median runtime, whereas the physical host only shows significant deviation up to 2.5 percent.

Again, like in the previous test, the physical host shows more extreme outliers than the DBaaS service.

The same graph on a per day basis this time shows significant differences between the days for the DBaaS service:

The physical host shows a very common pattern, except for the first day:

Looking at the individual performance of each thread the DBaaS shows a significant variation in performance per day:

The physical host shows a pretty consistent performance pattern, but interestingly the different threads show different, but consistent runtimes:

Sunday, August 21, 2016

Here is another example (besides the fact that Adaptive Cursor Sharing only gets evaluated during a PARSE call (still valid in 12c) and supports a maximum of 14 bind variables) I've recently come across at a client site where the default implementation of Adaptive Cursor Sharing fails to create a more suitable execution plan for different bind variable values.
Broken down to a bare minimum the query was sometimes executed using non-existing values for a particular bind variable, but other times these values were existing and very popular. There were two suitable candidate indexes and one of them appeared to the optimizer more attractive in case of the "non-existing" value case. Unfortunately this index was a very bad choice for the "existing and very popular" case.
The critical point of the case is that due to other, additional filters on the same table/index the final number of rows produced by the row source operation was (almost) the same for the "non-existing" and the "existing and very popular" case, but in the latter case a large proportion of the bad index had to be scanned due to the popularity of the bind value, which was the only predicate that could be used as ACCESS predicate and the additional ones could only be applied as FILTER on the index.
So although there was a suitable histogram available on the column in question and the optimizer in principle was aware of the different popularities of values and marked the cursor as "bind sensitive" it looks like when a cursor is then monitored it just is assigned to different buckets as exposed via V$SQL_CS_HISTOGRAM, based on the rows produced by the row source. Since the operation in question didn't differ much in the final number of rows produced between the two cases (either 0 rows or 1 row) the different executions were all assigned to the same bucket - although the amount of work required to produce this single row was huge, reflected in execution time and physical / logical I/Os performed per execution.
Obviously this vastly different amount of work performed to produce a similar number of rows is not used as criteria for marking a cursor as "bind aware" and evaluate different execution plans based on ranges of bind values.
There is a view V$SQL_CS_STATISTICS that includes information about CPU_TIME and BUFFER_GETS but in 11.2.0.4 it only reflects the initial execution when starting with the "non-existing" value case, but not the slow executions afterwards. In 12.1.0.2 the view is no longer populated at all, which suggests that this information is simply not used for deciding the "bind aware" status of a cursor.
Discussing the case with Mohamed Houri, who has spent significant time on investigating the "Adaptive Cursor Sharing" feature seemed to confirm this assumption.
Here is a simple test case that allows reproducing the issue:

The critical point is that the good index has a higher BLEVEL than the bad index. So in case a non-existing value for FILTER1 gets used the optimizer will favor the index with the lower BLEVEL, which is for that case the key cost component:

So when optimizing for the two different cases the optimizer does the right thing and understands that for the case "existing and very popular" T_IDX2 is a bad choice. It's also obvious from the "Predicate Information" section that the index T_IDX2 only can use FILTER1 as ACCESS predicate.
But when using bind variables the different cases are not recognized and the bad index is used for both cases when the optimization is based on the "non-existing value" case:

Executing the unhinted, default case leads to a very inefficient execution of the thousand executions using the popular value (and the actual table / index at the client site was much larger than this example and the query got executed very frequently) - compare that to the execution time for the other two variants, either using the BIND_AWARE hint or simply requesting the good index.
So when requesting to mark the cursor immediately BIND_AWARE via the corresponding hint (only possible if the cursor qualifies to be bind sensitive) Oracle happily generates a second child cursor at the second PARSE call with a different, more efficient execution plan for the popular value, because when a cursor is marked bind aware a completely different algorithm gets used that is based on range of values as exposed via V$SQL_CS_SELECTIVITY, and since the different values are falling into different ranges of values a new plan gets generated that is different from the previous one and hence gets used from then on for those (range of) values.

Monday, August 15, 2016

Having done my mini-series on Nested Loop join logical I/O optimizations a while ago I unfortunately never managed to publish anything regarding the Nested Loop join physical I/O optimizations, which are certainly much more relevant to real-life performance.
Therefore the main purpose of this blog post is to point you to Nikolay Savvinov's (whose blog I can recommend in general) great mini-series covering various aspects of these optimizations:
Part 1Part 2Part 3Summary
One point that - at least to me - isn't entirely clear when reading Nikolay's series is which specific plan shape he refers to, in particular since in 12c even more plan shapes for a Nested Loop join are possible.
Hence here is an attempt to summarize the various two table Nested Loop join plan shapes as of 12c and what kind of physical I/O optimizations one can expect from them:
1. Nested Loop Join Batching, the default in most cases since 11g and also in 12c

This is the plan shape that Nikolay calls "batch" or "batching" in his posts. It can provide batched I/O (mainly "db file parallel read" issuing multiple I/O requests in a single call, potentially asynchronous I/O) on both the INDEX RANGE SCAN as well as the TABLE ACCESS BY ROWID operation as described in Nikolay's blog post.
Please note that I deliberately write "can provide batched I/O", because, as Nikolay also points out in his posts, the runtime engine monitors the execution and can dynamically adjust the behaviour, which also means that it might decide to use conventional single block reads ("db file sequential read").

Please also note that in addition to the "db file parallel read" operation that submits multiple I/O requests in a single I/O submit call (see Frits Hoogland's blog post from some time ago about Linux internals of this I/O operation) the runtime engine might use "db file scattered read" multi-block I/Os under certain circumstances, in particular when performing "cache warmup prefetching".

Also note that as Nikolay points out when enabling SQL Trace or "rowsource statistics" the "Batched I/O" optimization for some reason gets disabled.

Also this plan shape at least in 12c seems to lead to inconsistencies in the Real-Time SQL Monitoring in several ways. First the number of "Executions" for the INDEX RANGE SCAN and TABLE ACCESS component of the inner row source is not necessarily consistent with the number of rows in the driving row source, second almost all activity and I/O volume seems to be contributed to the "INDEX RANGE SCAN" plan operation and not the "TABLE ACCESS" operation, even if it's the "TABLE ACCESS" that causes physical I/O.

This is the Nested Loop prefetching plan shape introduced in Oracle 9i combined with the new TABLE ACCESS BY INDEX ROWID BATCHED introduced in 12c. It's the plan shape Nikolay refers to with "BATCHED ROWID" in his posts. In my (and Nikolay's) tests this provides the most aggressive batching of I/O (highest number of I/O requests submitted per call in "db file parallel read") for the TABLE ACCESS BY ROWID BATCHED, but didn't perform the same on the index ("db file sequential read", single block reads on the index segment), which doesn't mean that different test and data setups might provide that, too.
Note you should get this plan shape only with explicit hinting or disabling Nested Loop Join Batching via parameter.
In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:

The "batch_table_access_by_rowid" isn't strictly necessary since it's the default behaviour in 12c.
Other variants are possible, like "use_nl_with_index" instead of "use_nl" and "index" separately or "opt_param('_nlj_batching_enabled', 0)" instead of "no_nlj_batching" to disable the batching (but then batching is disabled for the whole execution plan, not just for a particular join).
3. Classic Nested Loop Join Prefetch introduced in 9i

This is what you get in pre-12c when preventing Nested Loop Join Batching, in 12c the BATCHED table access needs to be disabled in addition.
This plan shape provides the less aggressive table prefetching as described in Nikolay's posts (he refers to this plan shape as "Prefetch"), maximum number of requests submitted per call in a "db file parallel read" operation seems to 39.
As mentioned above, it didn't provide index prefetching in my tests.
The session statistics don't mention "Batched I/O", so although the "db file parallel read" wait event is the same the internal implementation obviously is a different code path.
In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:

This is the classic Nested Loop join plan shape (pre-9i) combined with the new TABLE ACCESS BY INDEX ROWID BATCHED introduced in 12c.
In my tests it provides only "intra-loop" table prefetching, so if the INDEX RANGE SCAN of a single loop iteration points to different table blocks the TABLE ACCESS BY INDEX ROWID BATCHED can make use of "db file parallel read" to batch these I/O requests to the table blocks, and the session statistics show "Batched I/O" counters increasing.
However, it doesn't attempt to optimize "inter-loop" / multiple loop iterations, so if each loop iteration via the INDEX RANGE SCAN only points to a single table block, only single block reads ("db file sequential read") on the TABLE ACCESS BY INDEX ROWID BATCHED can be seen.
In my tests this plan shape didn't provide index prefetching / I/O optimizations and performed single block reads ("db file sequential read") on the INDEX RANGE SCAN operation.
For a two table join it needs explicit hinting in 12c to arrive at this plan shape, but it is the default plan shape for the "inner" joins in case of nested Nested Loop joins (multiple, consecutive Nested Loop joins in a row), see below for more information.
In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:

This is the classic Nested Loop join plan shape (pre-9i).
Interestingly in my tests at least in 12c it provides also "intra-loop" table prefetching, so if the INDEX RANGE SCAN of a single loop iteration points to different table blocks the TABLE ACCESS BY INDEX ROWID can make use of "db file parallel read" to submit multiple of these I/O requests to the table blocks in a single call, but the session statistics don't show "Batched I/O" counters increasing and the maximum number of requests seem to be 39, so it is less aggressive and looks very similar to the internal implementation used for the classic "table prefetching" plan shape above.
There is no sign of "inter-loop" optimizations and the INDEX RANGE SCAN seems to make use of single block reads only ("db file sequential read").
For pre-12c this is default plan shape used for the "inner" joins in case of nested Nested Loop joins (multiple Nested Loop joins in a row), see below for more information.
In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:

Multiple, consecutive Nested Loop Joins

Another point that I think it is important to mention when describing these "inter-loop" prefetching and batching Nested Loop join optimization techniques is that they only apply to the "outer-most" Nested Loop join in case of multiple, consecutive Nested Loop joins, which makes their "game changing" character that Nikolay mentions in his posts less strong to me.
For example, this is the 12c default shape of a four table join using three Nested Loop joins:

As it can be seen only the outer-most Nested Loop joins gets the "batching" plan shape and can benefit from the optimizations described above. The inner Nested Loop joins show the classic plan shape, in case of 12c with the ROWID BATCHED option, so they can only benefit from the "intra-loop" optimizations described above, if a single loop iteration INDEX RANGE SCAN points to several table blocks.
Note that even when using explicit, additional NLJ_BATCHING hints for the inner tables joined I wasn't able to enforce any other plan shape.
If there are multiple blocks of (consecutive) Nested Loop joins (for example a HASH JOIN in between), then in each block the outer-most Nested Loop join gets the optimization, so multiple of those are possible per execution plan:

As it can be seen now that T_I2 is joined via a HASH JOIN both Nested Loop joins to T_I1 and T_I3 get the "batched" optimization plan shape.
I don't know why this limitation is there that only the outer-most in a block of Nested Loop joins gets the optimized plan shape, but it certainly can reduce the possible benefit.
Since I don't think there is a costing difference on CBO level for the different Nested Loop join plan shapes (so the decision which shape to use isn't cost-driven) it might be possible to see a significant performance difference depending on which join gets the optimization - there might be room for improvement by manually influencing the join order in case of multiple, consecutive Nested Loop joins.

Sunday, August 7, 2016

This is the second part of this installment, comparing the performance consistency of the DBaaS cloud offering with a dedicated physical host. This time instead of burning CPU using a trivial PL/SQL loop (see part 1) the test harness executes a SQL statement that performs logical I/O only, so no physical I/O involved.

In order to achieve that a variation of Jonathan Lewis' good old "kill_cpu" script got executed. In principle each thread performed the following:

Each thread got its own, exclusive "kill_cpu<n>" table, so this setup attempted to avoid competition for buffer cache latches. The tables (in fact IOTs) were created like the following:

create table kill_cpu<n>(n primary key)

organization index

as

select

rownum n

from

all_objects

where

rownum <= 25

;

Again there were as many threads started as CPUs available - which meant 8 threads for the DBaaS environment and 4 threads for the physical host and again this was left running for several days.

The overall results look like this:

DBaaS:

Physical host:

Again the graph is supposed to show how many of the runs deviated how much from the overall median runtime. In contrast to the previous, simple PL/SQL test this time the DBaaS service shows a significantly different profile with a larger spread of deviation up to 3 percent from the median runtime, whereas the physical host only shows significant deviation up to 1.5 percent.

It's interesting to note however, that the physical host this time shows more extreme outliers than the DBaaS service, which was the other way around in the previous PL/SQL test.

The same graph on a per day basis doesn't show too significant differences between the days for either environment (here DBaaS):

Physical host:

Looking at the individual performance of each thread again the DBaaS shows a similar behaviour than last time - the different threads show a slightly different performance, and they also get slightly slower towards the end of the measurement:

And again the physical host shows a more consistent performance between threads:

Wednesday, August 3, 2016

This is just a short 12c update on my post about gathering statistics on tables with many columns from some time ago.
I'm currently investigating the "Incremental Statistics" feature in 12.1.0.2 for a client, which probably will be worth one or more other posts, but since we're still in the process of evaluating and installing various patches it's too early to write about that.
As part of the investigation I've noticed a significant change in behaviour in 12.1.0.2 compared to previous versions when it comes to gathering statistics on tables with many columns, hence this post here.
The key message of the original post was that DBMS_STATS needs potentially several passes when gathering statistics on tables with many columns, which meant a significant increase in overall work and resource consumption, exaggerated by the fact that tables with that many columns consist of multiple row pieces.
Along with other significant improvements in 12c (among others new types of histograms and the capability to derive some types of these histograms from a single pass with no need to run separate passes for each histogram) obviously now no longer multiple passes are required for such tables - Oracle can obviously now cope with up to and including 1000 columns in a single pass.
Repeating the test case from the original post for a table with 1.000 columns / 10.000 blocks and using a default METHOD_OPT setting of "FOR ALL COLUMNS SIZE AUTO" results in the following single query executed by the DBMS_STATS call:

As it can be seen the "APPROXIMATE NDV AGGREGATE" operation introduced in 11g for the new statistics gathering mode has been renamed in 12c to "OPTIMIZER STATISTICS GATHERING".
Apart from that this is good news as it minimizes the work required to gather statistics for such tables - it takes the minimum of logical / physical I/Os to process the data. And as already mentioned 12c is also capable of more, like generating synopses for incremental statistics and histograms from such a single pass.