When it runs serially the join completes in memory and the only I/O you see comes from the two tablescans. When the query runs parallel something causes a spill to the temporary tablespace.
Here’s the code to build the tables, followed by the execution plans from 9.2.0.8 and 10.2.0.3:

I’ve created table t1 with 10,000 rows and a (unique) column n1 that ranges from 1 to 10,000. Table t2 then has 20,000 rows, again with an n1 column that ranges from 1 and 10,000 in this case with two rows per value. When you join the two tables on their n1 columns, the result set will have 20,000 rows.

The table-creation code is more complex than it really needs to be, but I wrote it that way in case I needed to create a few million rows in each table.

You will notice that I have switched to manual workarea management so that I can set the hash_area_size (rather than letting Oracle pick a dynamic limit based on the pga_aggregate_target and current PGA usage). This is simply to make the test case reproducible. I’ve also included the pq_distribute() hint in the query to tell the optimizer to use the hash/hash distribution rather than broadcasting the first table (which could be forced by the hint /*+ pq_distribute(t2, broadcast, none) */). In my example, the optimizer took the hash/hash option automatically and I have just included the hint for reasons of reproducibility, and to demonstrate the hint syntax.

Mechanisms:

Checking the execution plans, we can build the following picture:

One set of PX slaves scans t1 (referenced as para_1 in the picture) and distributes the data by hashing (on the n1 column) to the second set of slaves by writing virtual table (or table queue) Q0, then the same set of slaves scans table t2 (para_2) and distributes the data in the same way, using virtual table Q1.

The second set joins the two virtual tables by reading Q0 to build the hash table then probing the hash table with each row from Q1 as it arrives, passing each join result to the query coordinator by writing to virtual table Q2.

Update:in response to a question on the OTN DB Forum, I’ve been prompted to point out the significance of the “98,{N}” in the 9i plan and the “TQ1,{N}”. The first number is the “dataflow operation (DFO)” – it’s identifying a single parallel flow of data all the way through the query; the second number is the virtual table number within data flow operation. The two numbers appear in combination in the 10g “Name” column to identify completely the virtual table that an operation is populating. (If you see more than one dataflow operation in a single query then something inefficient may be going on.)

Data Volume

So, I’ve got 10,000 rows of about 200 bytes each for a total of about 2MB, and a table of 20,000 rows of about 500 bytes each for a total of about 10MB, and a join result of 20,000 rows of about 700 bytes for a total of 14MB. Since we are running parallel 2, we might expect each PX slave to see about 5,000 rows (1 MB) from t1, 10,000 rows (5MB) and 10,000 rows (7MB) of result.

Since we set the hash_area_size to 10MB, we seem to have enough memory to hold the hash table completely in memory. We could even hold the hash table and all the data we use from t2 in memory – we could even hold the hash table and the result set in memory. So why to we see writes and reads to the temporary tablespace?

Analysis

The first thing to do, of course, is to find out where that I/O comes from. Enabling events 10046 (wait tracing) and 10104 (hash join tracing) is a good starting point.

The 10104 trace (9i version shown below) starts like this (the line numbers are my addition to the normal trace):

The trace reports 10MB of memory available for the operating the hash join (line 1).

The optimizer has pre-empted the possibility of the hash table spilling to disc by allowing for 9 partitions (line 6) although it talks about the “fanout” being 8 (line 5). Now the number of partitions is “always” a power of 2 and “always” matches the fanout – so what’s that extra partition for ?

We see that the optimizer has decided to use units of 91 blocks for any I/O to the temporary tablespace (line 8 ) – this translates into a size of 728KB (line 10) as the unit I/O (“cluster”, or “slot”) size. (In terms of parameters, this setting can be dictated by the parameter hash_multiblock_io_count – which is hidden in newer versions of Oracle).

This “slot” size has a very important impact on the efficiency of the hash join. To start with each partition needs minimum of one slot to hold data. This means that the minimum size in memory of the hash table will have to be at least 5.6MB (728KB * 8 ) because the hash table is going to be partitioned 8 ways. We can see this a little further down the trace file, as the hash table build completes.

This means the entire hash table has been built completely in memory (kept = 1 for every partition) with a huge amount of space to spare. We have 5,029 rows in memory, and could handle an estimated 77,792 in the space allocated.

After the build completes, the 10046 trace details show the process reading data from the next virtual queue then, a bit further down the file, we get the following extra clues from the 10104 trace:

We have been dumping partition 8 to disc – but the hash table is in partitions 0 to 7, so what are we dumping?

There are only two possibilities – either Oracle is copying the incoming probe table to partition 8 before it does the join, or it is writing the result of the join to partition 8 before re-reading it and forwarding it to the query coordinator. A couple of simple checks (e.g. adjust the n1 values in t2 to ensure that the join returns no data) shows that it’s the result set that is being dumped.

Looking back at lines 17 and 18 of the trace file, we can see that this was dictated by the “immutable flags” which were set to “BUFFER the output of the join for Parallel Query”.

Why ?

This behaviour is astonishing – there is clearly no need to do things this way; it’s a total waste of effort and time. Why write, re-read, then forward the result set when it could simply be sent as it was created ? Alternatively, why doesn’t the PX slave write its virtual table (Q2) to the temporary tablespace and let the Query Co-ordinator read it ? The code seems to be doing something quite bizarre.

Here’s my guess. First, we’re looking at a special case; secondly, in the more general case a different read/write strategy could become quite complex – especially when the query started to run across multiple RAC instances.

Generically, a single DFO (data flow operation) may involve far more than two layers of parallel execution; but a single DFO can use only two sets of parallel execution slaves. So if you have a parallel execution plan had three “layers” of operations, what happens: slave set one does the first operation, slave set two does the second operation, and slave set one does the third operation.

But slave set two (in the middle) may have some output ready for the third operation (slave set one) before the first operation (also slave set one) has completed – which means slave set one won’t be in place to receive the output. If this happens, slave set two has to do something with the output – which explains why it has to have some sort of buffer operation.

My guess is that it is this buffering operation that appears in our example, even though in this particular case it is “obviously” not needed.

If you accept the need for some sort of buffering mechanism, the next question is, why not make one layer of slaves write to the temporary tablespace and the next layer read from it. This could cut out a huge amount of messaging between two layers of slaves – the only information that would have to be sent through the SGA is a list of which blocks in the temporary segment should be read by which slaves.

This sounds reasonable – but the code needed to handle the generic case may be sufficiently complicated that the cost/risk/benefit analysis says that it’s not worth doing.

Think about a simple cycle where slave set 1 writes to the temporary tablespace for slave set 2 to read, and slave set 2 writes to the temporary tablespace for slave set 1. We don’t need a more complicated example because slave set 1 is not going to start reading for the third operation until after it has finished the first operation, so a double layer is a complex as things can get.

But, even though we have a limit of two layers, think about the number of communication channels. If we are running degree N there are N slaves in slave set 1 which need to distribute data across N slaves in slave set 2, so we need to isolate N*N channels of communication. Then you have to double that number because you could have two layers of communication live at once.

There is Metalink note somewhere that points out the amount of memory you should allocate for PX messaging in the large pool is something like 3 * parallel_execution_message_size * parallel_max_servers * parallel_max_servers / 4 (which allows 3 buffer pages per possible channel when one query runs at degree parallel_max_servers / 2).

If you try to switch this messaging to the temporary tablespace, would you allocate one extent for each channel – probably not, as that could need a huge, and sparsely used, temporary tabelspace.

On the other hand if you tried something like one private segment per query, or per operation per query, you then have to write code that allows slaves in the same layer to negotiate with each other to ensure that they don’t overwrite each others blocks in that extent. This is the efficient, but complex, solution. And when you realise that this could include communication across a RAC interconnect, you can appreciate that Oracle Corp. might avoid rewriting the code unless it proved to be absolutely necessary for a number of really large customers.

So parallel queries may do far more I/O to the temporary tablespace than you expect (and this is just one of the reasons). The “HASH JOIN BUFFERED” operation that becomes visible in 10g (see line 3 of the plan above) may tell you where some of that I/O is coming from. (The same action appears in all versions I tested from 8i to 11g, by the way – it’s just that 10g is the first version to report the operation).

Damage limitation

If you do find that your parallel hash joins are working harder than you expect, check the plan, and check the distribution. It isn’t well known, but you can make a huge difference to the resource consumption by controlling exactly what Oracle does with parallel hash joins. In this particular case, switching to a “broadcast” distribution bypassed the need for any sort of buffering at all, and the I/O to the temporary tablespace disappeared completely. I’ll be writing more about that option some time in the future.

Footnote:

Although the hash join spilled to disc (direct path writes and reads to the temporary tablespace) it still appeared in v$sql_workarea as an optimal workarea operation, and the session statistics reported it (twice because of the parallelism) under “workarea executions – optimal”.

Update

In December 2012 Randolf Geist wrote an article showing that I was wrong in my description of the buffering. Because Oracle keeps an in-memory bitmap of the hash buckets used by the build table it is able to check very quickly whether or not an incoming row from the probe table hashes to a possible match in the build table – so it is able to buffer (and then spill to disc) just those rows from the probe table that are potential matches without actually going through all the steps of doing the hash join.

Related

One of our customers wants to have his temporary tablespaces put on RAM disks.
It has been done in the past for other databases.
We are a little reluctant to do so but reading your article I conclude that they have a point there.
Just setting pga_aggregate_target higher will not always reduce writing to the temporary tablespace as I understand.

maybe this could be done also to cope with a slow client, or the risk of the client being slow. That is, “partition 8” (that buffers the result set) is dumped to the temp tablespace to enable the kernel to free all the RAM it allocated (that could be huge of course). The result set rows are then read “slowly” as the client asks “slowly” for then by issuing the FETCH db-calls (more in detail, as the virtual tables Q2 become empty thanks to the client reading).

I’m thinking to something similar to the mechanisms linked to the “sort_area_retained_size” parameter for serial sorts.

It couldn’t be that bad when the client is fast enough to consume the result-set as it is produced: when the hash join completes, the buffer is empty, hence there will be nothing to dump.

That was a description of the sorting mechanism – with the side effect that the memory needed for sorting N items of data required an overhead of roughly (N * 3 * pointer size) bytes – where a pointer is 32 bytes in 4 bytes in 32-bit systems and 8 bytes in 64-bit systems.

I don’t know the algorithm that is used for the so-called “V2” sort in 10.2, but it seems to need an overhead of just one pointer per item. Not only is it less memory intensive, the algorithm is also less CPU intensive.

I think I suggested in my book that it might be a variation of the heap sort algorithm, but I recall an informal comment from an Oracle employee it wasn’t.

The trick with the “direct path reads” is that Oracle manages to dispatch N buffers for read requests (probably 4), which is a CPU and memory intensive strategy, then comes back to the first buffer – which in this case requires a very short wait to fill. Then Oracle uses the data, dispatches a new read and moves on to the second buffer. In the interim the second read has nearly completed, but requires a few more microseconds… if Doug’s hardware were just a little faster, we wouldn’t see any direct path read waits.

Jonathon since parallel processing uses direct path read. if my DOP is 3 and i have 3 producer processes procedusing data for consumer processes. Since PGA is being used for reading data, in which area of PGA data is being buffered and can we increase and decrease the size of that area.

There are at least three different areas to consider.
a) Buffer memory the process reserves for reading data blocks. This is process memory and is accounted for against the pga_aggregate_target; and I believe there is some scope for adjusting this by playing with the _db_file_direct_io_count (with the usual warning about hidden parameters, of course)

b) Memory used by PX slaves to pass data to other PX slaves – this comes from the SGA, and possibly the only way to modify it would be to change the parallel_execution_message_size (originally the default was ca. 2KB).

c) Memory used by a slave for the “PX Buffer” type of operation described in this note, and that comes out of a normal ‘pga workarea’ – with the usual limitations on how you can affect the amount of memory that can be used for workareas.

Thank you for your comment – it has drawn my attention to the fact that 11g (possible 11.2) has added a new feature to the pq_distribute hint that was in (or, at least) was not documented for 10g. The “single distribution” option for dealing with data loading.

The posting is simply an example of the type of investigation and level of knowledge required to make a decision of the type indicated by the manuals, so the quick answer to your last question is: “yes”.

[…] Pursue that “long chain” hypothesis to a slight extreme – what if there’s one mes_mese_id that appears in 250,000 of the 2M rows, and the sistema value is the same for every one of those quarter million rows – so would require Oracle to walk a chain of 250,000 elements 250,000 times, for a total of 62.5 billion pointers to follow and comparisions to make – how much CPU might that take. Worse still,since having the same mes_mese_id (the hashing is only on the equality predicate) means the quarter of a million rows had to go into the same hash bucket we might end up doing a multipass operation because that bucket was much larger than anything that Oracle had anticipated when it did its internal partitioning calculations for the hash table. (There are some related notes in this article.) […]