Rob Farley - Owner/Principal with LobsterPot Solutions (a MS Gold Partner consulting firm), Microsoft Certified Master, Microsoft MVP (SQL Server), APS/PDW trainer and leader of the SQL User Group in Adelaide, Australia. Rob is a former director of PASS, and provides consulting and training courses around the world in SQL Server and BI topics.

Probe Residual when you have a Hash Match – a hidden cost in execution plans

No, this post has nothing to do with airport security, and nothing to do with marijuana.

Being honest, this post is only half about Hash Matches and Probe Residuals. It’s more about the types of predicates you can see mentioned in a SQL Server Execution Plan (or Query Plan, but I prefer to call them Execution Plans) – but you may well see some described as a Probe Residual when you look at the properties of a Hash Match operator.

The main point of this post is: Some of these predicates can be really bad, even if they’re part of things which seem really good (like Seeks or Merge Joins).

Let’s consider a join. Two streams of data, of which matching rows must be found. These matching rows will be the ones that satisfy the join conditions, expressed through predicates listed in the ON clause and/or the WHERE clause. In fact, these predicates might only involve one side of the join, such as OrderDate <= '2011-03-22T00:00:00.000'. There are plenty of times when a join condition will incorporate a one-sided predicate like this – imagine a scenario in which matching rows can be easily located in an index, but only those that match an additional condition are allowed to be included.

In a join (ha!) there are these two streams of data, joined by one of a few different operators. This operator could be a Hash Match; it could be a Merge Join. It could even be a Nested Loop, although then the predicates are (generally) handled in the second (lower) data source. In fact, let’s start with that scenario.

When a Seek is performed, there are two main kinds of predicates that can be included. One is the Seek Predicate, and one is simply listed as Predicate. I prefer to call this second one the Residual Predicate. It’s the leftover one, after the Seek has been performed. This often happens when SARGability isn’t possible. SARGability is about being able to use an index effectively (with a seek), so if you have a predicate which doesn’t allow SARGability (for example, being able to consider the last letter of a character string.

You’ll see here that for each Product we find, we look up the Subcategory name for it. We have [s].[ProductSubcategoryID] before the Seek is done (the Nested Loop calls the Seek operator using each one), but although it can quickly seek to the right row(s) involved, it performs an additional check, making sure that the Product Name ends in a 'y'.

The index lets the system immediately search for the rows needed. It can seek to the rows in rf_ix_Product_Subcat_inc that match the SubcategoryID, but this is only half the story. Having applied this Seek Predicate, there’s still the matter of the last letter of the Product Name, which isn’t something that can be checked easily with the index. The values are there, but each row that the Seek finds must be checked individually, with this leftover, or Residual, Predicate.

Hopefully this helps show why I want to call the Predicate here a Residual Predicate.

But this Residual Predicate must be tested on every row that is fetched by the Seek. That could well be a lot of rows, if the Seek isn’t particularly selective.

The Seek might feel really nice, and might be returning very few rows. But the effort could be a lot larger than you expect if most of the work is being done in the Residual Predicate.

The Nested Loop operator pulls rows from the first stream of data, and passes the required values down to the next row of operators, pulling a stream of data which is then simply joined. All the rows that come in from the second stream of data are known to be matches with the row that provided the values, so that there is relatively little work in doing the actual join.

In a Merge Join or Hash Match, things are slightly different. The predicate checking happens in the actual join operator. However, there are still two types of predicates – the main ones and the residuals.

In a Merge Join, the two data streams are ordered by columns appropriate for the join, but there could still be a leftover predicate.

Consider this query (but I’ve dropped the extra index at this point). It’s a very similar query to before, but I’m forcing a Merge Join with a Join Hint, and I’ve thrown in a predicate which is non-SARGable from the perspective of either table. This is part of the clue to the problem – it’s a non-SARGable predicate. Mind you, it could seem perfectly SARGable. It might simply be that the stream of data being used for the Merge Join isn’t ordered by all the columns involved in the join.

When hovering over the Merge Join operator in this query’s plan, we see this tooltip. You’ll notice that it has a section called “Where (join columns)” which shows that the join is being done on the ProductSubcategoryID.

However, the other predicate isn’t mentioned. It’s nowhere in the tooltip. Hitting F4 shows the Properties window, and this is where we find it, in a property called Residual.

Residual: [AdventureWorks].[Production].[Product].[ProductSubcategoryID] as [p].[ProductSubcategoryID]=[AdventureWorks].[Production].[ProductSubcategory].[ProductSubcategoryID] as [s].[ProductSubcategoryID] AND right([AdventureWorks].[Production].[Product].[Name] as [p].[Name],len([AdventureWorks].[Production].[ProductSubcategory].[Name] as [s].[Name]))=N'blah'

Interestingly, it re-checks that the ProductSubcategoryID columns match, but the important thing is that it’s here (and only here) that Residual Predicate is tested.

This Residual Predicate must be tested on every combination of rows that match the ‘Where (join columns)’ predicate. That could well be a lot of rows, if those rows aren’t particularly selective.

With a Hash Match, the join is done by first applying a Hash function to columns involved in the join, using the resultant Hash Key to populate the data (including all the required columns) into a Hash Table. Once that has been done for one stream of data, the second stream is pulled in, and the Hash function applied to the columns from the second stream. The result of each row from the second stream is used in a Probe of the Hash Table. However, predicates which don’t work nicely with the Hash Key concept (such as the ‘blah’ predicate I used earlier) are considered residual.

So when candidate rows are identified via the Probe, the Probe Residual still needs to be tested. Just like the Predicate after the Seek Predicate was done, and the Residual after the Where (join columns) are handled.

This Probe Residual must be tested on every combination of rows that satisfies the Hash Key Probe. That could well be a lot of rows, if that probe isn’t particularly selective.

When you’re trying to tune your query, you need to consider how many rows are being matched by each section of the join.

Imagine with me that you have a Merge Join (or a Hash Match), in which you have a predicate such as p1.ListPrice - p2.Listprice = 0

(Incidentally, this query could use any of the three joins, depending on indexes and other filters. Put an index on ProductSubcategoryID including ListPrice, and then run the query with either no WHERE clause (Hash Match), a WHERE clause for ProductSubcategoryID < 2 (Nested Loop) or < 5 (Merge Join).)

The predicate featuring the ListPrice column is always going to be treated as residual. It’s something that can only be tested once both values are known, and is considered a non-SARGable predicate. Regardless of what type of join is done, the ListPrice predicate is handled as a Residual.

For this query, the answer is hopefully obvious. Rewriting the predicate as p1.ListPrice = p2.ListPrice will resolve it nicely, but an example you have might not be so straightforward.

Residual predicates can be expensive, and the bottleneck of your query might not be obvious from looking at the plan. The mere fact that a Residual in a Merge Join is not shown in the tooltip could mean you miss it significantly. Don’t worry – you’ll be in good company. Plenty of proper experts miss this.

Luckily, the answer is simple. Look at your Seek Predicates, your Where (join columns) and your Hash Keys Probes, and compare this to the Residuals. If the Residuals are needing to be checked for a lot more rows that you’d like, then you have a tuning opportunity you can leverage. Ideally, Residuals only need to be applied on a tiny number of rows.

Remember, a Residual Predicate will feel like a Scan, because it’s not using the Index nicely. Scanning a tiny table might be fine, but scanning a large one could be horrible.