10gR2

In this post I pointed out that I like to have a copy of the documentation in PDF format on my notebook. In the same post, and its comments, I also described how I generate the scripts I use to download the files. Recently I updated the scripts and, as a result, I thought to share them with you. So, below you find the CMD and SH scripts for the documentation related to 10.2, 11.1 and 11.2.

Continuing from the previous installment of this series I'll cover in this post some of the inevitable classics regarding Parallel Execution Control. So forgive me if you're bored by the repetition of known facts - however I still see these things too often used incorrectly, therefore I decided: This is worth to mention and remember!

- Up to and including version 10.2 PARALLEL (without any parameters) and NOLOGGING are valid keywords only in DDL commands

- Applies to all versions: NOLOGGING can not be used as a hint. It can only be specified as part of DDL, for example ALTER INDEX ... REBUILD PARALLEL NOLOGGING.

When I read the recent post by the optimizer group about the new concurrent gather stats feature added in 11.2.0.2 it reminded me of the fact that I intended to publish something based on the same idea already some time ago.

The Problem

It was motivated by a client's regular need during a transition phase from non-Exadata to Exadata to create literally thousands of indexes with potentially a multitude of (sub-)partitions as fast as possible - as part of a full datapump import job of a multi-terabyte database running 11.1.0.7 and 11.2.0.1 (Exadata V2).

There are actually two issues regarding the index creation part of a large database import:

1. The datapump import performs the index creation only by a single worker thread even when using the PARALLEL worker thread import feature. Although an index could be created in parallel if you have thousands of smaller index objects this single worker thread potentially does not make efficient use of the available hardware resources with high-end configurations, including and in particular Exadata.

2. There is a nasty bug 8604502 that has been introduced with 11.1.0.7 that affects also 11.2.0.1 (fixed in 11.2.0.2 and a generic one-off patch is available on My Oracle Support for 11.1.0.7 and 11.2.0.1): The IMPDP creates all indexes serially, even those supposed to be created in parallel, and only after the creation ALTERs them to the defined PARALLEL degree. Note that the fix actually only fixes the problem at actual execution time, even with the fix installed (and in 11.2.0.2) the SQLFILE option of IMPDP still generates CREATE INDEX DDLs that will always have the parallel degree set to PARALLEL 1 (see MOS document 1289032.1 and bug 10408313 - INDEXES ARE CREATED WITH PARALLEL DEGREE 1 DURING IMPORT which has been closed as not being a bug). This "not-being-a-bug" also affects all other versions that support the datapump utility - the SQLFILE option always generates CREATE INDEX scripts with the parallel degree set to 1 no matter what the actual degree of the index is supposed to be. It's only the ALTER INDEX DDL command following the CREATE INDEX command that sets the parallel degree correctly.

These two issues in combination meant to them that a full database import job took ages to complete the index creation step after loading quite quickly the vast amount of table data in parallel.

In case of partitioned indexes there is another complication independently from the mentioned issues: Oracle uses only one parallel slave per partition for creation - in case of large and/or few partitions this again doesn't make efficient use of the available resources.

Oracle therefore provides several means to speed up index creation and rebuild tasks, in particular the documented DBMS_PCLXUTIL package that is around since the Oracle 8 days to overcome the above mentioned limitation of partitioned index creation by spawning multiple jobs each rebuilding an index partition in parallel.

Another, undocumented feature is the DBMS_INDEX_UTL package that is obviously used internally as part of several maintenance operations, for example those DDLs that include the "UPDATE INDEXES" clause. According to the spec it allows to rebuild multiple indexes concurrently by spawning multiple jobs - however since it is undocumented it might not be safe to use in production-like configurations - furthermore it might be changed in future releases without further notice and therefore is potentially unreliable.

A Solution

Since the client wanted a quick solution that ideally addressed all of the above issues I came up with a simple implementation that uses Advanced Queueing and background jobs to create as many indexes as desired concurrently.

The solution is targeted towards the client's scenario, so the following is assumed:

- There is a SQL file that contains the CREATE INDEX statements. This can easily be generated via IMPDP based on the dump files using the SQLFILE option.

- To address the CREATE INDEX (not-being-a-)bug (the bugfix for the bug 8604502 still generates incorrect CREATE INDEX DDLs with the SQLFILE option of IMPDP as mentioned above) I've created a combination of "sed" and "awk" unix scripts that take the IMPDP SQLFILE potentially including all DDLs commands as input and create a output file that consists solely of the CREATE INDEX commands with correct PARALLEL clauses based on the ALTER INDEX command following the CREATE INDEX in the script

- To address the lengthy index creation process I've created a small PL/SQL package that sets up the required AQ infrastructure, takes the CREATE INDEX DDL file as input, populates a queue with the index creation commands and spawns as many worker threads as specified that will take care of the actual index creation (that in turn might be a parallel index creation)

As a side note it is interesting that Oracle actually allows to build several indexes concurrently on the same segment (which makes totally sense but does probably not happen too often in practice).

Note that in principle this code could be used as a general template to execute arbitrary DDLs concurrently (of course with corresponding modifications).

- correct_parallel_clause: This directory contains the Unix scripts mentioned above that allow to process a SQLFILE generated by IMPDP and output a DDL file that solely consists of the CREATE INDEX commands contained in the SQLFILE. The generated CREATE INDEX statements also use a correct PARALLEL clause - the degree is taken from the ALTER INDEX DDL command following the CREATE INDEX in the SQLFILE. For further details refer to the README.txt in that directory. Note that the script at present does not handle Domain Indexes, only conventional and bitmap.

- source: Contains the package source for the concurrent index creation, furthermore a package that is required by the provided automated unit testing (see below for more details) and a script that prompts for the required details to initiate a concurrent index creation. The README.txt in that directory provides a quick start guide how to use the concurrent index creation.

- test: Contains two flavours of test harnesses for automated unit testing of the package. One based on the unit testing feature implemented in SQLDeveloper 2.1.1, and another one based on "dbunit", an open-source unit testing framework based on jUnit. The README.txt in the respective subdirectories explain how to use these unit tests.

How to use it

The usage is split into two parts: The first part deals with preparing a suitable text file that consists of the CREATE INDEX commands, the second part is about processing this text file with as many worker threads as desired.

Preparing the file is straightforward: You can use the "transform_all_sql.sh" script to generate the required CREATE INDEX script from a DDL script created via IMPDP SQLFILE.

The script has been tested primarily with bash, sed and awk under Cygwin 1.7.1 and OEL5, different Unix flavors might have different versions of the shell, awk or sed and therefore might behave differently.

Simply put all four Unix scripts in the "correct_parallel_clause" directory into the same directory, mark them as executable and run the "transform_all_sql.sh" like that:

./transform_all_sql.sh < input_file > output_file

where "input_file" is the file generated via IMPDP SQLFILE option and "output_file" will be the result.

In order to perform the parallel index creation, you need an account that has suitable privileges granted. Since it is assumed that the indexes will have to be created in different schemas this account will have to have extended privileges granted. The package is implemented using invoker's rights so granting these privileges via roles is sufficient. A quick and dirty solution could be creating a temporary account and granting simply the DBA role to it (this is what I used to do to test it). Note that the account also requires EXECUTE privileges on the DBMS_AQ and DBMS_AQADM packages for the AQ stuff. It also needs a simple logging table where errors and progress will be written to as well as a type that is used as payload of the queue. Obviously the account also needs to be able to create jobs - in this version of the package this is done via DBMS_SCHEDULER. At execution time the package is going to create a queue plus queue table that also needs to be stored in a tablespace - so you should make sure that the account (or at least the database) that executes the index creation has an appropriate default tablespace defined.

You can simply run the "pk_create_index_concurrent.sql" script (located in the "source" directory) in such a suitable account which will deinstall/install all required objects.

The execution of the index creation is then straightforward (taken from the package specification):

/** * The main entry point to create indexes via parallel threads / AQ * @param p_directory_name The directory where the file resides that contains the CREATE INDEX DDLs * @param p_file_name The file name in the directory above * @param p_parallel_degree_set_1 The number threads to start for the worker thread 1 which usually represents the SERIAL_INDEX threads - G_AUTO_PARALLEL_DEGREE means use the CPU_COUNT and CLUSTER_DATABASE_INSTANCES parameter to determine number of threads automatically * @param p_parallel_degree_set_2 The number threads to start for the worker thread 2 which usually represents the PARALLEL_INDEX threads - G_AUTO_PARALLEL_DEGREE means get the CPU_COUNT and CLUSTER_DATABASE_INSTANCES parameter to determine number of threads automatically, however 1 is the default here since we assume that these indexes use parallel DDL * @param p_job_submit_delay The number of seconds each job will be delayed to allow Oracle proper load balancing in a cluster, default 30 seconds (commented out at present due to odd locking issues on the queue table in RAC environments) * @param p_sleep_seconds The number of seconds to wait for the threads to startup before attempting to teardown the AQ infrastructure again * @param p_optional_init Optionally a SQL can be passed usually used to initialize the session for example forcing a particular parallel degree * @param p_worker_set_id_1 The character identifier used to identify the indexes to process by the first worker thread set Default value is "SERIAL_INDEX" * @param p_worker_set_id_2 The character identifier used to identify the indexes to process by the second worker thread set Default value is "PARALLEL_INDEX" **/ procedure create_index_concurrent( p_directory_name in varchar2 , p_file_name in varchar2 , p_parallel_degree_set_1 in integer default G_AUTO_PARALLEL_DEGREE , p_parallel_degree_set_2 in integer default 1 , p_job_submit_delay in integer default 30 , p_sleep_seconds in integer default 10 , p_optional_init in varchar2 default null , p_worker_set_id_1 in varchar2 default G_WORKER_SET_ID_1 , p_worker_set_id_2 in varchar2 default G_WORKER_SET_ID_2 );

Note that the "p_job_submit_delay" parameter is currently not used - there were some odd locking issues on the AQ table in case of a RAC environment when using that option so I have commented out its usage at present - I haven't had a chance yet to investigate further what the problem actually was.

So the only required input to the CREATE_INDEX_CONCURRENT procedure is the name of the directory object that points to the directory where the file to process resides and the name of the file itself.

You probably want to specify the number of worker threads for the two sets: The idea here is to distinguish between the creation of serial and parallel indexes. The first parameter specifies the number of worker threads used for serial indexes, the second one the number of concurrent threads for parallel indexes.

The default is CPU_COUNT * INSTANCES threads for serial indexes and a single thread for parallel indexes.

If you don't want/need this separation of serial and parallel indexes simple use the same "worker_set_id" for both parameters "p_worker_set_id_1" and "p_worker_set_id_2" and specify the desired total parallel degree in one of the degree parameters and set the other one to 0 (the 0 is required otherwise one of the DBMS_SCHEDULER.CREATE_JOB calls will fail with a "duplicate job name/job name already exists").

The "p_sleep_seconds" parameter is only used to allow the jobs spawned to put a lock on the queue table - the teardown is then going to wait until all locks have been removed and therefore all queue processing has ended. The default of 10 seconds was sufficient in all cases I've encountered.

Since the package requires as prerequisite a directory where the file to process resides, I've prepared the script "create_index_concurrent.sql" that guides through the required inputs and takes care of that step as well.

It takes the full O/S path to the file and the file name as input, creates a directory CREATE_INDEX_CONCURRENT_DIR pointing to that directory and prompts then for the two degrees as input and the names of the two worker thread sets before calling the CREATE_INDEX_CONCURRENT stored procedure.

Caveats

Please note that you should double-check not to pass a non-transformed SQLFILE generated via IMPDP to the procedure - the results may be dire since the generated SQLFILE always contains much more than the bare CREATE INDEX commands, no matter what options you use for IMPDP. Always use the provided Unix scripts to post-process the SQLFILE before initiating the index creation.

Furthermore you need to be aware of the current limitation of the package that it does not attempt to tokenize the file contents. It simply uses a semicolon as delimiter to separate the DDL commands. This should be sufficient for most cases, but in case you have a function-based index using a string expression containing a semicolon as part of the index definition this will not work as expected. Also if you plan to use this package for other DDL execution activities like CTAS statements you might again hit this limitation if the DDL text contains semicolons.

Note that creating indexes using this tool results potentially in different index statistics than creating the indexes using IMPDP since IMPDP by default also imports the index statistics whereas the indexes created using this tool will end up with the current index statistics automatically generated during index creation (from 10g onwards, and the code requires at least 10.2). If you want to have the index statistics imported you can run IMPDP after the index creation using the INCLUDE=INDEX_STATISTICS option. This should complete fairly quickly and will import the index statistics only.

If you have SERVEROUTPUT enabled by default then you will very likely see some errors that will be printed by the initial attempt to tear down the AQ infrastructure. These errors are expected if the previous run was completed successfully or in case of the initial run and can be ignored (and will be catched/ignored by the default implementation).

Note also that all provided scripts except for the Unix shell scripts use DOS file format - under OEL this isn't a problem but it might be on your platform.

Finally the inevitable disclaimer: Although this has been tested thoroughly it comes with absolutely no warranty. Use it at your own risk and test it in your environment before attempting any runs against anything important.

Monitoring the execution

The code logs errors and progress into the table CREATE_INDEX_CONCURRENT_LOG. At present the code logs every attempt to execute DDL into the table as well as any errors that are raised during that DDL execution.

So the table can be used for both, monitoring the progress as well as checking for errors. The code currently continues the execution in case of errors encountered using the dreaded WHEN OTHERS THEN NULL construct, but the code is already prepared for a more granular error handling if required - see the defined exceptions and commented out exception handler.

You can view the queue contents in the corresponding queue view created by the AQ setup (AQ$CREATE_INDEX_QUEUE) in order to see the data to process. Note that due to the fact that all worker threads do not commit the queue transaction you won't be able to see the progress in the queue table until all worker threads committed. If you don't like that you can remove the wait and "teardown_aq" call at the end of the main procedure "create_index_concurrent" and uncomment the dequeue option "visibility=immediate" in the "create_index_thread" procedure. You would need then to call "teardown_aq" in a separate step as desired. With this modification you can monitor the progress by monitoring the queue, but the provided automated unit testing won't work with that variant since it relies on the main call to wait for all worker threads to complete before validating the results.

However you can see the progress also in the log table using the following sample query:

If you want to perform more sophisticated queries on the that table you might need to use some casts similar to the following, because the text columns are defined as CLOBs in order to be able to hold the complete DDLs and error messages in case of errors. The casts allow you to perform for example GROUP BYs etc.

Here we come to a completely different issue that is off-topic for this post, however in my experience so far it seems to be a very important one and I hopefully will have the time to cover it in the future with separate posts.

Generally speaking I've seen to many shops that don't follow best-practice when it comes to database deployment and development, therefore here is what you should know/do about it ideally - in a nutshell:

- Treat your database like source code, which means put everything related to the database under version control. This includes not only the obvious database source code but also DDL and DML scripts for schema evolution- Use unit testing to test database code. Automate this unit testing- Automate the deployment of your database related changes- Install a continuous integration environment that runs the automated deployment and unit tests regularly, for example every night- Automate deployment everywhere - starting from the development databases up to the production environment- Follow your guidelines strictly - for example any hotfix-like adhoc change should still go through the established processes - code changes, testing, deployment etc.

I've helped several clients in the past to setup corresponding tools and processes for implementing above - if you are interested, get in touch with me.

So as a bonus, if you haven't spent too much time yet with above mentioned topics, in order to get you started at least with automated unit testing, I've included two different examples for this small source provided, one using the built-in unit test feature of SQLDeveloper and the other one using "dbunit". You can find both in the corresponding subdirectories of the "test" folder in the archive.

The unit testing is based on the "pk_create_index_concur_test.sql" package that is used to setup and teardown the environment for running the unit test. It assumes at present the existence of a directory "C:\app\oracle\admin\orcl112\dpdump" on O/S level. It will create a directory object for the path and attempt to create/write a file used for the unit test runs. You can pass any valid O/S directory path to the "pk_create_index_concur_test.setup" procedure if you want/need to use a different one.

All provided automated tests assume that both scripts, "pk_create_index_concurrent.sql" and "pk_create_index_concur_test.sql" have been run in the schema that should be used for test runs.

You can use the SQLDeveloper Unit Test feature to run the provided Unit Test. You can either use the GUI to import and run the test, or you can use a command line version that is actually using ANT to run the UTUTIL command line tool that comes with SQLDeveloper. You can read and follow the instructions in the "README.txt" in the test/SQLDeveloper directory how to do so. You'll need to setup a unit test repository initially if you want to use SQLDeveloper's unit testing feature either way (GUI or UTUTIL command line). See the SQLDeveloper's user's guide or online help how to do that (Hint: Menu item "Extras->Unit Testing" gets you started).

If you don't like the SQLDeveloper unit test approach or you are simply to lazy to install the tool, the unit test repository etc., you can alternatively try the automated unit testing using "dbunit". Follow the instructions in the "README.txt" in the test/dbunit directory how to run the unit tests using "dbunit".

This version of the package has successfully been tested using these unit tests on 10.2.0.4, 10.2.0.5, 11.1.0.7, 11.2.0.1 and 11.2.0.2 (after all it's dead easy with automated unit testing :-).

Summary

The provided tool set should represent a solid foundation for the given task of concurrent index creation. In particular it has been designed with the following in mind:

- Efficient use of privileges granted via roles: The package uses invoker's rights and most operations use dynamic SQL to avoid compilation issues, therefore granting the required privileges to the account used via roles should be sufficient

- The Unix scripts should be able to deal with table-, schema- and database-level datapump formats from Oracle 10g and 11g (all these variants use slightly different texts to identify the relevant sections of the generated SQLFILE by IMPDP)

- Optional use of two separate worker thread sets: This allows the concurrent creation of a multitude of indexes, be it serial or parallel, with clear distinction between the handling of serial (possibly many worker threads) and parallel indexes (usually only a few worker threads)

- Support for arbitrarily sized SQL: The DDL commands for (sub-)partitioned indexes can become quite large due to the way the Oracle meta data API generates the SQL. Therefore these generated SQLs can easily exceed the usual 32KB limit for PL/SQL character strings. The implementation uses CLOBs for the processed SQLs (and DBMS_SQL in versions lower than 11 to handle these piecewise) to support these potentially very large SQLs

- RAC/Grid/Cluster support via DBMS_SCHEDULER: The usage of DBMS_SCHEDULER allows a fine grained control of the resource consumption by the optional use of job classes (not implemented yet but can easily be added - it is a simple additional parameter to the CREATE_JOB procedure) that allow to specify a resource consumer group and a specific service name for the spawned worker threads

I've recently come across an interesting variation of a "famous" ASSM bug. Probably some of you will remember that ASSM bug that was caused by row migrations in larger block sizes (16K/32K).

If you don't remember or don't know what I'm talking about, you can have a look here where Greg Rahn provides a summary of the issue or check My Oracle Support bug description 6918210.

Greg also links to a script originally created by Jonathan Lewis that allows to reproduce the issue at will.

So far the issue was only reproduced on block sizes greater 8K - the variation I've encountered however allows to reproduce the issue on 8K and 4K, possibly also on 2K, but I haven't tested 2K yet.

Below is my version of script. If you compare it to Jonathan's version you'll notice that it is very similar, if not to say almost the same except for additional optional instrumentation, that you can simply un-comment if you've installed my Advanced Oracle Troubleshooting script package that is based on Tanel Poder's awesome "tpt_public" tool set.

The SESSPACK tool can be found in Tanel's tool set (tools/sesspack_0.05_release) and the SNAP_KCBSW package has been developed by Jonathan a long time ago - it can be found here. Note that it only works for versions below 11g - this instrumentation has been "optimized away" in 11g, unfortunately.

In order to reduce the runtime, I've simply limited the number of rows in the table to 50,000 rows.

Here is the task: You are allowed to modify the script at exactly one single location - the modification can take a maximum of four keywords, which means you can add/modify/remove at most four keywords.

With the correct modification you will be able to reproduce the bug even in 8K and lower block sizes.

So, what to modify and why?

If you want to actually run the script yourself you need to use database versions prior 11.2 because the bug is obviously fixed there - this includes 10.2.0.5, which interestingly doesn't have the bug fixed.

I've used a 8K/4K ASSM tablespace with UNIFORM 1M extents for my tests, but I don't think that the extent management matters in that case. My test database uses 8K as default block size.

You'll notice the bug when checking the runtime and the trace file. If you encounter the bug, the runtime for the update will be several seconds (more than 10 seconds seen on my test system in some cases) and the number of current mode gets for the update will be in the millions.

If you've enabled the additional instrumentation it will tell you that the reasons for the buffer gets where "ktspfsrch" and "ktspscan_bmb" for most of the gets. You can also take stack traces (e.g. using Tanel's OStackProf tool) if you use more than 50,000 rows to have a longer runtime of the update statement which will show you similar function names on the stack.

If you don't hit the bug, the update usually takes max. 1-2 seconds, and the current mode gets should be far less than one million when sticking to the 50,000 rows.

P.S.: There is more than one correct answer - and it is possible to hit the bug for 8K block sizes with a single keyword modification (full points!).

Update 24th Jan: P.P.S: No takers yet... So here's an additional hint: The issue is caused by row migration...

Update 26th Jan: OK, time to post a quick answer here. As pointed out by Narendra below, simply setting PCTFREE to 0 already was sufficient to reproduce the issue with smaller block sizes. However, there is much more to tell about and therefore this deserves a separate post that I'll publish the next couple of days.

For the time being here are the correct answers that I'm aware of at present:

- PCTFREE 0- COMPRESS- COMPRESS FOR ALL OPERATIONS

But as I already said, there is much more, in particular when partitioning comes into the picture - and I hope to cover all these details in the upcoming post.

Oracle introduced in Oracle 10g the hash aggregation as a new feature. It can be used for both GROUP BY and UNIQUE operations (HASH GROUP BY and HASH UNIQUE respectively) and is by default the preferred aggregation method if there is no particular reason that lets the cost based optimizer prefer the sort based aggregation (SORT GROUP BY and SORT UNIQUE), for example if the GROUP BY is followed by an ORDER BY on the same expression (using the SORT GROUP BY in such cases is not always beneficial by the way, see Guy Harrison's blog for an example).

Ever since its introduction from time to time I've heard complaints about performance degradations of aggregation operations that are based on the new hash aggregation algorithm compared to the previously used sort based aggregations.

Now there may be many potential reasons for such performance degradations (and possibly many of them might not have anything to do with the hash aggregation) but here is a surprising revelation that might explain why some of them were indeed caused by the switch to the new algorithm: The hash aggregation operation does not work very well together with the automatic PGA management (WORKAREA_SIZE_POLICY = AUTO, default since 9i). The fundamental defect is that it is not able to dynamically resize to a larger workarea size when using automatic PGA management and therefore remains more or less at its initial expected size based on the estimates at optimization time.

This effectively means that the efficiency of the hash aggregation operation when using automatic PGA management is heavily dependant on the cardinality estimates at optimization time - in case of estimates in the right ballpark, the memory used at execution time will correspond to the actual requirements at runtime, but in case of bad estimates, the operation potentially uses far less memory than available and unnecessarily spills to disk.

I first start with showing the current PGA_AGGREGATE_TARGET (P_A_T) and PROCESSES setting.

I create then two tables as sample data that will be joined and already collect information about the minimum and maximum value of the join column.

Any cursors that are dependent on table T1 will then be invalidated to make sure that a reoptimization of the next query will take place (of course running the complete script invalidates such cursors anyway due to the drop and re-create of the tables).

I then run a simple join followed by a group by operation. I've added a cartesian join by the way, but its only purpose is to ensure that the generated row source is sufficiently large to give the group by something to do.

I have used hints to ensure that I always get the same execution plan even if I later on manipulate the statistics to see the effect of incorrect cardinality estimates.

After getting the actual runtime execution plan along with execution statistics I modify the column statistics of one of the tables' join column in such a way that the optimizer thinks that there is no overlap between the join column values and therefore computes a very low join cardinality.

After making sure again that a reoptimization will take place I run the same statement again with the same data volume and the same hints in place.

And this is what I get for all major versions that are currently out there (10.2.0.4, 10.2.0.5, 11.1.0.7, 11.2.0.1 and 11.2.0.2):

As it can be seen with an estimate in the right ballpark the HASH GROUP BY operation completes in memory (very close to the 20% PGA_AGGREGATE_TARGET maximum size of a single workarea from 10.2 on with automatic PGA management for PGA_AGGREGATE_TARGET < 500M, for more information see Joze Senegacnik's paper on the internals of automatic PGA management).

However repeating exactly the same operation with the fudged statistics it spills to disk and uses only 7M, although it could have used up to 40M (given that there is no concurrent workload).

The same does not happen when repeating this experiment with other operations that use a workarea - the most obvious one being a SORT GROUP BY, as can be seen from this output:

All I've done is to reduce the data set, because the SORT GROUP BY in this case required more memory for the same amount of data than the HASH GROUP BY, in order to prevent the operation to spill to disk with a 200M PGA_AGGREGATE_TARGET setting and change the USE_HASH_AGGREGATION hint to NO_USE_HASH_AGGREGATION.

As you can see, the operation completes both times in memory and uses the same amount of memory no matter what the estimates look like.

I've tested the serial execution of the most common workarea based operations like HASH JOIN, SORT ORDER BY, WINDOW SORT, SORT UNIQUE and all of them were able to dynamically resize the workarea in cases where the initial estimated size was too small.

If you carefully check then you'll notice that I haven't mentioned the HASH UNIQUE operation yet, and later on you'll see why.

A cunning feature of the automatic PGA management comes to help, however, which is a kind of "feedback loop" for the workareas based on statistics maintained by the automatic PGA memory management, and indeed when the HASH GROUP BY cursor based on the incorrect cardinality estimate gets shared (not invalidated) and re-executed, the next execution will look like this:

So Oracle this time based the workarea memory requirements on the feedback from the previous execution and therefore allocated sufficient memory to complete the operation without spilling to disk.

Notice however that the feedback loop unfortunately does not work as desired when the workarea execution fails due to insufficient TEMP space. Ideally the feedback loop should allow the subsequent executions to grab more memory specifically in such cases, but at present it doesn't - the failure obviously prevents an update of the statistics and therefore subsequent executions continue to fail since they still use the same amount of memory.

You can test this by simply assigned a very small TEMP tablespace to the user executing the query so that the second execution fails due to insufficient TEMP space. If you repeat the execution in this case, it will fail again and keeps doing so.

Back to square one - with the invalidation the statistics are gone, too, so the bad cardinality estimate again lead to the suboptimal execution of the HASH GROUP BY.

So you might think, why bother? Only the first execution of a cursor without workarea execution statistics will be affected by the problem, subsequent execution of the same cursor will benefit from the statistics from the previous executions.

The problem however is, that this is fine for applications that share cursors. Unfortunately applications that peform heavy duty aggregations like data warehouses typically do not share cursors, since they do not care about the optimization overhead and deliberately use literals to provide as much information to the optimizer as possible.

Also these heavy duty aggregations usually use Parallel Execution features, and as you'll see from the output of the same test case, if I run the HASH GROUP BY in parallel by simply setting table T1 to parallel degree 2, a similar problem occurs - so Parallel Execution is affected as well.

- uses the hash aggregation HASH GROUP BY (or HASH UNIQUE, more on that in a moment)- and does not share cursors

every execution that is based on bad cardinality estimates potentially has a problem with the hash aggregation because it might not make efficient use of the available memory.

The same applies to applications that share cursors, however in that case only the first execution after re-optimization / invalidation is affected.

So you might want to carefully check the runtime execution statistics of your critial hash aggregations.

Mind you, things could be worse, and that is where the HASH UNIQUE operation comes into the picture.

When I realised the issue with the HASH GROUP BY operation I was quite certain that the HASH UNIQUE operation will be affected in a similar way, since internally Oracle seems to use the same mechanism for both operations (In the SMM trace files both are called HASH GROUP BY).

To my surprise I noticed that in 10g versions below 10.2.0.5 and 11g versions below 11.2.0.1 (which means in this case 10.2.0.4 and 11.1.0.7 respectively, I didn't test other versions) the HASH UNIQUE operation suffers from an even more dramatic problem: The cardinality estimate is not considered and the initial workarea size is always based on minimum assumptions.

In passing, according to the SMM trace files that can be activated using the undocumented "_smm_trace" parameter it looks like that many of the sort-based workareas like SORT ORDER BY or WINDOW SORT seem to suffer from the same defect, since they however are able to dynamically resize and make use of the workarea statistics feedback they effectively work as expected, they just start with a too low workarea size estimate every time they are executed for the first time after re-optimization / invalidation.

The combination of the two issues - inability to dynamically resize and ignoring the optimizer estimates - leads to a dire result: Every first execution of a HASH UNIQUE operation in those versions will only use the minimum amount of memory. The following test case shows the problem:

Obviously it is very similar to the HASH GROUP BY test case, this time however I've used a DISTINCT clause and replaced the HASH JOIN with a NESTED LOOP which makes the generated trace files easier to read, since there is exactly one workarea involved in this execution.

And this is what I get from 11.1.0.7, this time using AMM, and therefore the PGA_AGGREGAT_TARGET has been set to 0 (You'll get the same result from 10.2.0.4 with a corresponding P_A_T setting):

As you can see no matter what the estimates looks like, on this system the first execution of a HASH UNIQUE will not get more than 6M - subsequent executions of the same cursor benefit from the statistics from the previous run as seen before.

Again, switching to a SORT UNIQUE and reducing the data set accordingly, the problem can not be reproduced:

In 10.2.0.5 and 11.2.0.1/11.2.0.2 this issue is apparently fixed and the cardinality estimates are used for an initial workarea size calculation - however this doesn't mean that the dynamic resize problem is fixed - it simply means that the HASH UNIQUE in those versions behaves exactly as the HASH GROUP BY and the memory usage for the first execution depends on the cardinality estimates.

Summary

If your database uses automatic PGA management then for the hash aggregation HASH GROUP BY / HASH UNIQUE operations every initial execution that is based on bad cardinality estimates potentially has a problem because it might not make efficient use of the available memory.

The same applies to applications that share cursors, however in that case only the initial execution after re-optimization / invalidation is affected, subsequent executions benefit from the workarea statistics feedback mechanism.

Furthermore in 10g versions below 10.2.0.5 and 11g versions below 11.2.0.1 the initial execution of a HASH UNIQUE operation ignores the cardinality estimates and will always be based on minimum assumptions.

So you might want to carefully check the runtime execution statistics of your critical hash aggregations.

Possible Workarounds

Different strategies are available as workarounds, depending on the situation:

- Upgrading to versions where the HASH UNIQUE operation at least considers the optimizer estimates might be beneficial

- Obviously good cardinality estimates are crucial, in that case and when using the correct Oracle versions you should be fine. Using the undocumented OPT_ESTIMATE hint (or the deprecated undocumented CARDINALITY hint) might help in cases where other options like manually crafted statistics are not able to help the optimizer to come up with reasonable cardinality estimates.

- Applications that are able to share cursors might not be too much affected due to the ability to use the workarea executions for subsequent executions of the same cursor

- The described problems disappear when switching to manual workarea size policy and allowing for sufficient memory of workarea. Interestingly the less obvious SORT_AREA_SIZE is used for the manual control of the HASH GROUP BY operation, and not the HASH_AREA_SIZE. Of course using manual PGA memory management system-wide is highly unrecommended, so this might only be a workaround in certain cases of large batch jobs where manual workarea sizes are used anyway. Also note, when switching to manual workareas be aware of a nasty bug that was introduced in the 10.2.0.3 patchset. For more information see MOS note "6053134.8: ALTER SESSION to set SORT_AREA_SIZE no honoured" and for example mine and Jonathen Lewis' post about the issue. According to the MOS note the bug has been fixed in 10.2.0.4.3, 10.2.0.5 and 11.1.0.7 / 11.2

- Obviously the hash aggregation can be avoided on statement level using the NO_USE_HASH_AGGREGATION hint or on session / system level using the _GBY_HASH_AGGREGATION_ENABLED parameter. Since the sort based aggregation can be less efficient (but note that it doesn't have to be, it depends on the individual grouping and data pattern) again this doesn't necessarily solve the problem since aggregate workareas might spill to disk which wouldn't be necessary when hash aggregation was used and worked as expected.

Final Note

Of course there are a lot of things that I haven't touched yet or only briefly, like Parallel Execution of the hash aggregations, workareas used for index sorts, and all the other details when the workarea spills to disk.

Oracle had the new version of the Exadata Machine on display at Oracle Open World this week. It’s called the Exadata X2-8. That’s a catchy name. It sounds very Iron Manish! In fact they had these fellows on display next to the demo machines.

The X2-8 uses two 4U Sun Fire x4800 servers which each have 8 x eight-core intel CPUs (X7560) and 1 Terabyte of memory along with 14 Exadata Storage Servers. Here’s a link to the spec sheet for the X2-8 and for the Oracle Sun X4800. Below are a couple of pictures. The first one shows one of the database servers (X4800) with one of the CPU modules out.

The storage cells have not changed much from the original V2. They still have 2 CPUs and 12 drives and 384G of flash cache. Although I’ve been told they have newer (faster) 6 core Intel CPUs. (I did get a look at an unpublished spec sheet on an Oracle employee’s iPhone and it said they were using the 5670 CPUs) Oh, and they will allow you to run Solaris on the database servers. Of course they will have to finish a new version of Solaris 11 before that can happen. It’s worth noting that X2-8 can be ordered but that they don’t have a firm delivery date yet.

So this configuration is definitely for the high end market and addresses a couple of issues. The increased memory will allow us to have a more robust consolidation platform. It will also allow bigger OLTP type systems to run better (i.e. the additional memory means we can support many more concurrent users and have much larger buffer caches). Note that Exadata’s offloading generally reduces the memory requirements, but nevertheless, very large systems, particularly ones with lot’s of fast OLTP type transactions and lot’s of users will be better satisfied by this type of configuration. Also note that there is no little brother version of the X2-8. It comes in a full rack only. Which makes sense because there are only two database machines. I don’t believe the price has been set yet, but the word on the street is that the hardware will be about 50% more than the full rack version with the small 2 CPU blades (now renamed X2-2 by the way).

I did a post a couple of weeks ago (Thoughts on Exadata V3) about what I thought we might get to see in the way of Exadata changes. We got most of the things I was expecting but not all of them. Among the things we got are bigger/beefier servers with more memory and available slots for HBAs to provide some additional connection capabilities (although I’m not sure if Oracle is going to want people to open the machines up and put additional controllers in). I did see mention in the x4800 spec sheet of an HBA so they may actually have one in there already (I need to check that out). They also announced that they will be offering a version of Solaris that can run on the database servers which I was expecting, although they are still using Intel chips. The thing I was expecting that didn’t happen was a change of mind set about flexibility of configuration. They seem pretty set on maintaining a fixed configuration that will be the same for every customer. That is probably not such a bad idea. Certainly it’s easier to support and faster to deploy. But you know how customers are, they want their hamburgers with extra mustard and no pickles. So we’ll see how that works out over time. But for now, it’s a fixed menu. To quote Henry Ford, “You can get it in any color you want as long as it’s black”.

So that’s all I can think of at this point. Please let me know if you have any questions and I’ll see what I can find out.

I saw a post today where the subject of Exadata Storage Indexes were being discussed. One of the things that caught my eye was a discussion of whether Storage Indexes worked with Bind Variables. One of the posters observed that since smart scan was aimed at data warehouse type queries, bind variables were pretty much irrelevant. Which is true. Still it’s an interesting question. So I thought I’d give it a quick test.

As usual I used a couple of scripts:

fsx.sql – queries v$sql and shows whether a statement has been offloaded or not (slightly modified to remove 2 columns)mystats.sql – just queries v$mystat

So the Storage Index was clearly used on this statement using a SQL*Plus number variable. Here’s some 10046 trace data to show that smart scan wait event was used – note also the “enq: KO – fast object checkpoint” wait event which is done before the direct path reads (replaced by the “cell smart table scan” event in Exadata land).

So it looks like a definite yes for offloading with bind variables and using Storage Indexes with bind variables. At least with numeric variables. Now let’s check out a varchar2 column and while we’re at it let’s check wild carding using % and the LIKE operator.

So like regular B-Tree indexes, implicit conversion, functions applied to columns, etc … can disable Storage Indexes. Not too surprising. It’s also interesting is that the Storage Indexes on dates are a little persnickety. Looks like literals work fine (at least in SQL*Plus) as long as the full 4 digit year is specified. You can see the format that Oracle converts it to is ‘syyyy-mm-dd hh24:mi:ss’. I was unable to get Storage Indexes to work with date columns using SQL*Plus varchar2 variables though. Any one got any ideas?

Even though in general parallel full table scans performs direct reads, some exceptions exist. The aim of this post is to show such an exception.
For test purposes I build in my own schema a copy of the SH.SALES table (the one distributed by Oracle with the demo schemas…). On that table I build an [...]

When two tables are equi-partitioned on their join keys, the query optimizer is able to take advantage of partition-wise joins. To make sure that the tables are equi-partitioned, as of Oracle Database 11g reference partitioning can be used. In fact, per definition, with reference partitioning all “related” tables have exactly the same partitioning schema. If [...]