May 29, 2007

Autoallocate and PX

Here’s a quirky little set of results from trace files. First a 10391 at level 64 – this shows the way in which an object was broken into granules for a parallel tablescan. I’ve just listed the first few granules:

Now, with just a little cosmetic enhancement, the 10046 trace at level 8 of the parallel server process that was given the first granule to process. You can tell from the parameter naming that this is a 10g trace file. Note especially the order in which the blocks are visited.

As you can see, this stream of waits corresponds to a PX slave waiting to be told to process the first granule (PX Deq: Execution Msg), and ends with the slave waiting to be told which granule to process next.

But why are the direct path reads so small and in such a funny order? After all, the slave was given a stream of 99 consecutive blocks – and even that’s a little odd because if you count from block 12 to block 130 you get rather more than 99 blocks. Moreover, there seem to be some blocks that didn’t get read.

The answer lies in the title to this piece. I have created the table in a tablespace with system managed extents (the auto allocation type for locally managed tablespaces). In a “clean” datafile, the effect of this is that if you create a table and start to insert data into it, the first 16 extents will be allocated at 64KB, the next 63 at 1MB, followed by 120 at 8MB, before Oracle finally switches to 64MB extents. (Current observations, not documented by Oracle).

However, if you run a parallel “insert /*+ append */”, every parallel execution slave thinks it is populating its own private data segment, so each slave will generate 16 extents of 64KB, then 63 extents of 1MB, and so on. Moreover, if you use 3 separate “insert /*+ append */” statements to load the table, each statement starts the cycle all over again.

On top of that, when you use autoallocate, the last stage in the insertion process does extent trimming to release the trailing free space from the last extent allocated by each slave.

Oracle tries very hard to avoid problems with this extent trimming – it’s done to the nearest “unit size”, i.e. to the next 64KB, 1MB, or 8MB boundary above the actual limit of the data. This has the effect of avoiding randomly sized holes all over the tablespace but it will still leave several holes of varying sizes.

Now assume you have a very large tablespace, are running with 32 CPUs and default parallelism, frequently drop, re-create, and populate objects, even dropping and recreating several objects concurrently. If you do this you will probably find that you gradually accumulate lots of little holes of all sorts of sizes all over the tablespace.

Oracle does try to use up these holes as fast as possible – but only by allocating extents of the legal unit sizes, starting from the start of the file and working along it in order. So if you have lots of little holes scattered through the tablespace, and lots of parallel slaves busily soaking them up, you end up with situations like the above: 108 consecutive blocks in one file consisting of 14 chunks of 8 blocks allocated by 14 different parallel execution slaves, and read in order of extent_id when a parallel tablescan starts.

Mixing very large autoallocate tablespaces with repeated parallel create/insert at high degrees of parallelism can result in objects acquiring a very large number of small extents – which can have a surprising impact on subsequent parallel queries.

[…] type of operation then these are just some of the questions you need to answer. (See, for example, a note I wrote three years ago about some of the anomalies of I/O sizes when running parallel query, and a related enhancement in […]

That ‘s solve our problem. we have a huge data warehouse with 32K block size and with quite a good IO system with oracle 11r2. but we never manage to get db_file_multiblock_read_count=128 and we have alot of holes in tablespace also. Now we are setting initial=8M but not able to decide Next value. Also a value of db_file_multiblock_read_count=256 is under discussion. our system admin says that we can change this value by setting the maxphy tunable parameter of solaris.

[…] few years ago (2007) I wrote about a problem that could appear when you mixed parallel execution with system managed extent allocation. A couple of years later I added a note that Christian Antognini had observed a patch in 11.1.0.7 […]