Merge Join

In this post, I’ll describe the second physical join operator: merge join (MJ). Unlike the nested loops join which supports any join predicate, the merge join requires at least one equijoin predicate. Moreover, the inputs to the merge join must be sorted on the join keys. For example, if we have a join predicate “T1.a = T2.b,” table T1 must be sorted on T1.a and table T2 must be sorted on T2.b.

The merge join works by simultaneously reading and comparing the two sorted inputs one row at a time. At each step, we compare the next row from each input. If the rows are equal, we output a joined row and continue. If the rows are not equal, we discard the lesser of the two inputs and continue. Since the inputs are sorted, we know that we are discarding a row that is less than any of the remaining rows in either input and, thus, can never join.

We can express the algorithm in pseudo-code as:

get first row R1 from input 1get first row R2 from input 2while not at the end of either input begin if R1 joins with R2 begin return (R1, R2) get next row R2 from input 2 end else if R1 < R2 get next row R1 from input 1 else get next row R2 from input 2 end

Unlike the nested loops join where the total cost may be proportional to the product of the number of rows in the input tables, with a merge join each table is read at most once and the total cost is proportional to the sum of the number of rows in the inputs. Thus, merge join is often a better choice for larger inputs.

One-to-many vs. many-to-many merge join

The above pseudo-code implements a one-to-many merge join. After we join two rows, we discard R2 and move to the next row of input 2. This presumes that we will never find another row from input 1 that will ever join with the discarded row. In other words, there may not be duplicates in input 1. On the other hand, it is acceptable that we might have duplicates in input 2 since we did not discard the current row from input 1.

Merge join can also support many-to-many merge joins. In this case, we must keep a copy of each row from input 2 whenever we join two rows. This way, if we later find that we have a duplicate row from input 1, we can play back the saved rows. On the other hand, if we find that the next row from input 1 is not a duplicate, we can discard the saved rows. We save these rows in a worktable in tempdb. The amount of disk space we need depends on the number of duplicates in input 2.

A one-to-many merge join is always more efficient than a many-to-many merge join since it does not need a worktable.

To use a one-to-many merge join, the optimizer must be able to determine that one of the inputs consists strictly of unique rows. Typically, this means that there is either a unique index on the input or there is an explicit operator in the plan (perhaps a sort distinct or a group by) to ensure that the input rows are unique.

Sort merge join vs. index merge join

There are two ways that we can get sorted inputs for a merge join: we may explicitly sort the inputs using a sort operator or we may read the rows from an index. In general, a plan using an index to achieve sort order is cheaper than a plan using an explicit sort.

Join predicates

Merge join supports multiple equijoin predicates so long as the inputs are sorted on all of the join keys. The specific sort order does not matter so long as both inputs are sorted in the same order. For example, if we have a join predicate “T1.a = T2.a and T1.b = T2.b,” we can use a merge join so long as T1 and T2 are both sorted either on “(a, b)” or on “(b, a).”

Merge join also supports residual predicates. For example, consider the join predicate “T1.a = T2.a and T1.b > T2.b.” Although we cannot use the inequality predicate as part of a merge join, we can still use the equijoin portion of this predicate to perform a merge join (presuming the inputs are sorted on column “a”). For each pair of rows that join on column “a,” we can then apply the inequality predicate. If the inequality evaluates to true, we return the joined row; if not, we discard it.

Outer and semi-joins

Merge join supports all outer and semi-joins variations. To implement an outer join, we simply need to track whether each row has joined. Instead of discarding a row that has not joined, we can NULL extend it and output it as appropriate. We can implement a semi-join or an anti-semi-join in a similar way.

Merge join supports a special case for full outer join. In some cases, we generate a merge join for a full outer join even if we have no equijoin predicate. This is equivalent to a many-to-many merge join where all rows from one input join with all rows from the other input. We will build a worktable to store and playback all rows from the second input. We support this plan as an alternative to the full outer join transformation which supports nested loops join. (See the discussion of full outer join in my previous post.)

Examples

The following examples use this simple schema:

createtable T1 (a int, b int, x char(200))

createtable T2 (a int, b int, x char(200))

setnocounton

declare @i int

set @i = 0

while @i < 1000

begin

insert T1 values(@i * 2, @i * 5, @i)

insert T2 values(@i * 3, @i * 7, @i)

set @i = @i + 1

end

Let’s start with a simple example:

select*

from T1 join T2 on T1.a = T2.a

option(merge join)

Since we’ve asked for a merge join with the option hint and since there are no indexes on these two tables, the optimizer must add explicit sorts to the plan:

Although the rows in both input tables are indeed unique, the optimizer does not know that and cannot enforce it so we generate a many-to-many join as evidenced by the MANY-TO-MANY keyword.

Note that as I mentioned above, each table scan executed only once regardless of the number of rows processed. Also, note that the sort on T2 only returned 668 rows out of 1000 rows. What happened? After processing 668 rows from T2, the merge join encountered a row from T2 that is greater than any row in T1. At this point, the merge join read all the remaining rows from T1. Once it reached the end of T1, the merge join exited even though it did not read all of the rows in T2.

This is the same plan except that we now need a sort only on T2 since we can use the index on T1 to achieve the sort order. Note that even though we now have a unique index on T1, we still have a many-to-many join. Why? The index guarantees uniqueness on “(a, b)” not on column “a” alone. We could have duplicate values of column “a” and, thus, so long as we join only on a, we need the many-to-many join.

Observe that, with both indexes, we no longer need any sorts. Moreover, the unique index on T2 does guarantee uniqueness on column “a” so we can now do a one-to-many join. Notice that the MANY-TO-MANY keyword is gone in this example. (There is no ONE-TO-MANY keyword in text showplan.) Notice also that the optimizer switched the order of the inputs to put the unique input T2 first so we can use a one-to-many join.

Note that we now merge on two keys. Even more interesting, we still use the indexes to achieve order including the single column index on T2. How can a single column index on column “a” provide order on two columns “(a, b)”? Recall that the index on T2 is a unique index. We can append any set of columns to the keys of a unique index and still get order. This works because there is only one row for each unique key value. A set of one row is automatically sorted on any additional columns that we append.

Here is another two column join, but now we cannot merge on both columns:

Note that we now process all 1000 rows from both tables; we no longer terminate the join after processing only 668 rows of T2. Because of the outer join, we must read all rows of T2 and NULL extend those that do not join.

Finally, here is an example of a full outer join with no equijoin predicate. (Warning: This example takes much longer to execute than the prior examples.)

select*

from T1 fullouterjoin T2 on T1.a < T2.a

Rows

Executes

666334

1

|--Merge Join(Full Outer Join, , RESIDUAL:([T1].[a]<[T2].[a]))

1000

1

|--Clustered Index Scan(OBJECT:([T2].[T2a]))

1000

1

|--Clustered Index Scan(OBJECT:([T1].[T1ab]))

Note that we have no merge keys for this plan; just a residual predicate.

Up next …

In my next post, I will describe hash join, the third and final physical join operator.

In your pseudo code it appears like every page of both inputs will be read always. Is this true? Or are there optimizations to "seek" to the next possible applicable row (in equijoin). I.e. if input A is values 1..1000 and input B is 500..600, then is there an optimization to "seek" to 500 in input A and start from there?

In SQL 2008, I see Stats I/O output that seems to support this. I.e. even though I see two clustered scans, my stats I/O output doesn't reflect anywhere close to the number of pages for the two inputs. (The two inputs are clustered on their PKs and the equijoin pred is A.pk = B.pk, thus there is no sort operator).

In your example, as soon as input A reaches a value greater than 600, input B will reach the end and SQL Server will stop the join. Thus, SQL Server will NOT read the pages from input A beyond that point. There is no such optimization for the pages at the beginning of the input and this optimization only applies to inner joins. If an outer join requires that all rows from an input be returned, then SQL Server must read all pages of that input.

Yuxiu, you are correct that in this case, the residual predicate is implied by the join predicate (and, thus, is indeed redundant). The extra predicate is harmless (except for some minor performance overhead).