World's Fastest Scalable Join

One glance at my golf clubs would be enough to determine that I'm a terrible golfer. The pitching wedge is dirty. Nine-iron: dirty. Same with the eight, seven and six irons. Five, four and three irons are fairly clean. Woods: pristine. I play percentage golf (actually 110%, if you count penalties); I figure a 5-iron 150 meters down the fairway is a better bet than a 3-wood 200 meters into the trees.

So I've got a golf bag with 2 clubs that I paid for but never use. Madness? Well no, not really; but then I'm not paid to play golf. Can you imagine a professional golfer never using the driver? It wouldn't happen.

Can you picture an Oracle programmer never using the most powerful join method available? No? Get a mirror.

The Nested Loop join is an iterative join: for each row in the first (inner) row source, lookup matching rows in the second (outer) row source. If the nested lookup of the second row source performs a Unique or Range Index Scan, then we call this Indexed Nested Loops.

Indexed Nested Loops is used primarily in low volume joins; it is efficient over small volumes and versatile enough to be used in a variety of situations. Although it is fully scalable, Indexed Nested Loops is inefficient over large data volumes.

The hash join is used for high-volume equi-joins (joins with equals predicates). Oracle performs a single read of the smaller row source (call this T1) and builds a hash table in memory. The join key is used as the hash-key of the hash table. Then a single pass of the larger row source (call this T2) is performed, hashing the join key of each row to obtain an address in the hash table where it will find matching T1 rows.

Provided T1 remains small enough to build the hash table in memory, T2 can be scaled up to any arbitrarily large volume without affecting throughput or exceeding temp space. If T1 cannot be hashed in memory, then a portion of the hash-table spills to disk. When the hash table is probed by T2, the rows with join keys that match those parts of the in-memory hash table are joined immediately; the rest are written to TEMP and joined in a second pass. The bigger T1 is, the smaller the proportion of the hash table that can fit in memory, and the larger the proportion of T2 that must be scanned twice. This slows the Hash Join down considerably and also makes the join non-scalable.

A sort-merge join works by reading each row-source in the join separately; sorting both sets of results on the join column(s); then concurrently working through the two lists, joining the rows with matching keys. Sort-Merge is generally faster than Indexed Nested Loops but slower than Hash Join for equi-joins. It is used almost exclusively for non-equi joins (&gt, &lt, BETWEEN) and will occasionally be used when one of the row sources is pre-sorted (eg. a GROUP BY inline view)

If both row sources are small then they may both be sorted in memory, however large sorts will spill to disk making then non-scalable.

There is no way to make a Sort-Merge join scalable. The only other way to resolve a non-equijoin is to use Nested Loops, which is slower. As volumes increase, Sort-Merge will continue to out-perform Nested Loops, but will eventually run out of Temp space. The only solution is to extend TEMP, or convert the join to Nested Loops (and then wait).

Indexed Nested Loops may be slow over large volumes, but at least they don't use TEMP space. One of the subordinate operations (eg. you may be joining to a GROUP BY inline view) may use TEMP, but it would still do so even without the join.

You cannot determine whether a Sort-Merge join or a Hash join is using TEMP space unless you run it. For Sort-Merge, you can use SQL*Plus AUTOTRACE, and for Hash you can use a 10104 event trace.

Observe the second-last line in the Statistics: sorts (disk). If this value is anything but 0, then the SQL is performing a sort too large to be completed in memory. If your explain plan shows two or more sort steps (as is the case with a Sort-Merge join) then it is not possible to determine which step resulted in the disk sort.

The bad news is that making joins scalable is not a tuning exercise, it's a design exercise. It should be taken into consideration when a table is first built, as it can be difficult and risky to retrofit post-implementation.

It's worth the effort to design tables for scalable joins; not only do you not have to worry about TEMP space, the joins also run faster.

Other than Indexed Nested Loops, which performs sub-optimally over large data volumes, there are three other scalable joins: Serial Partition-Wise joins, Hash Cluster Nested Loops, and Cluster Joins. Prior to writing this article, I was of the mistaken belief that Cluster Join was the fastest type of join. In the course of benchmarking these three joins, I discovered that the awesome memory management of the Hash Join when combined with the scalability of partitions performed up to twice as fast. Although Hash Cluster Nested Loops and Cluster Joins are both fully scalable, they are a specialised solution to problems other than scalability so will not be discussed further.

A Serial Partition-Wise Join is only possible when joining two equi-partitioned tables that are partitioned on the join key, ie. both tables have the same number of partitions with the same bounds. They work by breaking a very large join down into several smaller independent parts. Partition-Wise joins are most commonly used in parallel queries where the purpose is to improve performance by joining different chunks at the same time, but when run in serial mode they can also be used to limit or avoid the use of TEMP space.

Serial Partition-Wise Joins are not inherently scalable; you need to make sure that no pair of partitions is too large to be rewritten in available TEMP space. It is impractical to avoid using TEMP space altogether because you need to make sure that each partition of the smaller table can be hashed in memory; this would result in impractically small partitions.

Note the much reduced disk usage (and execution time) in the following example when the join does not use TEMP space.

That's great, but how practical is it considering most tables are partitioned by date but joined on some other foreign key?

Partition-wise joins are the primary reason for the existence of Composite partitioning (sub-partitions within partitions). If you want to enable a partition-wise join between two tables but one (or both) of them is already partitioned by a date, then you hash-sub-partition that table by the join-key. The other table must also be hash-partitioned or hash-sub-partitioned on the join key.

When to use Serial Partition-Wise Joins

Use Serial Partition-Wise Joins when you are joining two large tables (>1M rows), especially when one of them is monolithic (>10M rows). You will need to use hash-sub-partitions if one or both is already partitioned on a column other than the join-key.

Things to watch out for

Take care with locally partitioned indexes when partitioning and sub-partitioning tables. Range scans on locally partitioned indexes can dramatically degrade in performance because they have to scan many index segments instead of just the one. You should never physically reorganise a table (partitioning, clustering, index-organised) without prior performance analysis and regression testing.

If you are designing a physical data model with large tables, it is simply irresponsible not to consider the scalability of joins. Hash Joins are almost always the top-performers, however there are unusual circumstances with very large volumes (that I have been unable to reproduce here) described in the Oracle Performance Tuning Manual where a Hash Join requires as many as two reads an a re-write of the larger row-source in TEMP space.

Large tables should be designed to exploit partition-wise joins wherever possible, this will all but guarantee future join scalability. If scalability is not an issue then the joins can be parallelised making them even faster.

Partitioning and sub-partitioning a table for performance is not simple; it is very easy to detriment performance in an attempt to improve it. Before attempting this technique on a live system, make sure you read the relevant chapters for the Oracle Performance Tuning manual, perform copious benchmarking on full production volumes, and regression test all affected software.

This very nearly didn't get published. I've been working on it on-an-off for two weeks. Wasn't happy with it and chucked it in the 'maybe later' folder. Did a major edit yesterday changing the focus and deleting about 200 lines. The original article was about the benefits of clusters - perhaps another day.