Fun with Aggregates

There are interesting things to be learned from even the simplest queries. For example, imagine you are given the task of writing a query to list AdventureWorks product names where the product has at least one entry in the transaction history table, but fewer than ten.

The execution plan looks a bit different from the written form of the query: the base tables are accessed in reverse order, and the aggregation is performed before the join. The general idea is to read all rows from the history table, compute the count of rows grouped by ProductID, merge join the results to the Product table on ProductID, and finally filter to only return rows where the count is less than ten.

This ‘fully-optimized’ plan has an estimated cost of around 0.33 units. The reason for the quote marks there is that this plan is not quite as optimal as it could be – surely it would make sense to push the Filter down past the join too? To answer that, let’s look at some other ways to formulate this query. This being SQL, there are any number of ways to write logically-equivalent query specifications, so we’ll just look at a couple of interesting ones. The first query is an attempt to reverse-engineer T-SQL from the optimized query plan shown above. It joins the result of pre-aggregating the history table to the Product table before filtering:

SELECT p.Name

FROM

(

SELECT

th.ProductID,

cnt = COUNT_BIG(*)

FROM Production.TransactionHistory AS th

GROUPBY

th.ProductID

) AS q1

JOIN Production.Product AS p

ON p.ProductID = q1.ProductID

WHERE

q1.cnt < 10;

Perhaps a little surprisingly, we get a slightly different execution plan:

The results are the same (23 rows) but this time the Filter is pushed below the join! The optimizer chooses nested loops for the join, because the cardinality estimate for rows passing the Filter is a bit low (estimate 1 versus 23 actual), though you can force a merge join with a hint and the Filter still appears below the join. In yet another variation, the < 10 predicate can be ‘manually pushed’ by specifying it in a HAVING clause in the “q1” sub-query instead of in the WHERE clause as written above.

The reason this predicate can be pushed past the join in this query form, but not in the original formulation is simply an optimizer limitation – it does make efforts (primarily during the simplification phase) to encourage logically-equivalent query specifications to produce the same execution plan, but the implementation is not completely comprehensive.

Moving on to a second example, the following query specification results from phrasing the requirement as “list the products where there exists fewer than ten correlated rows in the history table”:

SELECT p.Name

FROM Production.Product AS p

WHEREEXISTS

(

SELECT *

FROM Production.TransactionHistory AS th

WHERE th.ProductID = p.ProductID

HAVING COUNT_BIG(*) < 10

);

Unfortunately, this query produces an incorrect result (86 rows):

The problem is that it lists products with no history rows, though the reasons are interesting. The COUNT_BIG(*) in the EXISTS clause is a scalar aggregate (meaning there is no GROUP BY clause) and scalar aggregates always produce a value, even when the input is an empty set. In the case of the COUNT aggregate, the result of aggregating the empty set is zero (the other standard aggregates produce a NULL). To make the point really clear, let’s look at product 709, which happens to be one for which no history rows exist:

-- Scalar aggregate

SELECT COUNT_BIG(*)

FROM Production.TransactionHistory AS th

WHERE th.ProductID = 709;

-- Vector aggregate

SELECT COUNT_BIG(*)

FROM Production.TransactionHistory AS th

WHERE th.ProductID = 709

GROUPBY th.ProductID;

The estimated execution plans for these two statements are almost identical:

You might expect the Stream Aggregate to have a Group By for the second statement, but this is not the case. The query includes an equality comparison to a constant value (709), so all qualified rows are guaranteed to have the same value for ProductID and the Group By is optimized away.

In fact there are some minor differences between the two plans (the first is auto-parameterized and qualifies for trivial plan, whereas the second is not auto-parameterized and requires cost-based optimization), but there is nothing to indicate that one is a scalar aggregate and the other is a vector aggregate. This is something I would like to see exposed in show plan so I suggested it on Connect. Anyway, the results of running the two queries show the difference at runtime:

The scalar aggregate (no GROUP BY) returns a result of zero, whereas the vector aggregate (with a GROUP BY clause) returns nothing at all. Returning to our EXISTS query, we could ‘fix’ it by changing the HAVING clause to reject rows where the scalar aggregate returns zero:

SELECT p.Name

FROM Production.Product AS p

WHEREEXISTS

(

SELECT *

FROM Production.TransactionHistory AS th

WHERE th.ProductID = p.ProductID

HAVING COUNT_BIG(*) BETWEEN 1 AND 9

);

The query now returns the correct 23 rows:

Unfortunately, the execution plan is less efficient now – it has an estimated cost of 0.78 compared to 0.33 for the earlier plans. Let’s try adding a redundant GROUP BY instead of changing the HAVING clause:

SELECT p.Name

FROM Production.Product AS p

WHEREEXISTS

(

SELECT *

FROM Production.TransactionHistory AS th

WHERE th.ProductID = p.ProductID

GROUPBY th.ProductID

HAVING COUNT_BIG(*) < 10

);

Not only do we now get correct results (23 rows), this is the execution plan:

I like to compare that plan to quantum physics: if you don’t find it shocking, you haven’t understood it properly :) The simple addition of a redundant GROUP BY has resulted in the EXISTS form of the query being transformed into exactly the same optimal plan we found earlier. What’s more, in SQL Server 2008 and later, we can replace the odd-looking GROUP BY with an explicit GROUP BY on the empty set:

SELECT p.Name

FROM Production.Product AS p

WHEREEXISTS

(

SELECT *

FROM Production.TransactionHistory AS th

WHERE th.ProductID = p.ProductID

GROUPBY ()

HAVING COUNT_BIG(*) < 10

);

I offer that as an alternative because some people find it more intuitive (and it perhaps has more geek value too). Whichever way you prefer, it’s rather satisfying to note that the result of the sub-query does not exist for a particular correlated value where a vector aggregate is used (the scalar COUNT aggregate always returns a value, even if zero, so it always ‘EXISTS’ regardless which ProductID is logically being evaluated).

The following query forms also produce the optimal plan and correct results, so long as a vector aggregate is used (you can probably find more equivalent query forms):

WHERE Clause

SELECT

p.Name

FROM Production.Product AS p

WHERE

(

SELECT COUNT_BIG(*)

FROM Production.TransactionHistory AS th

WHERE th.ProductID = p.ProductID

GROUPBY ()

) < 10;

APPLY

SELECT p.Name

FROM Production.Product AS p

CROSS APPLY

(

SELECTNULL

FROM Production.TransactionHistory AS th

WHERE th.ProductID = p.ProductID

GROUPBY ()

HAVING COUNT_BIG(*) < 10

) AS ca (dummy);

FROM Clause

SELECT q1.Name

FROM

(

SELECT

p.Name,

cnt =

(

SELECT COUNT_BIG(*)

FROM Production.TransactionHistory AS th

WHERE th.ProductID = p.ProductID

GROUPBY ()

)

FROM Production.Product AS p

) AS q1

WHERE

q1.cnt < 10;

This last example uses SUM(1) instead of COUNT and does not require a vector aggregate…you should be able to work out why :)

SELECT q.Name

FROM

(

SELECT

p.Name,

cnt =

(

SELECTSUM(1)

FROM Production.TransactionHistory AS th

WHERE th.ProductID = p.ProductID

)

FROM Production.Product AS p

) AS q

WHERE q.cnt < 10;

The semantics of SQL aggregates are rather odd in places. It definitely pays to get to know the rules, and to be careful to check whether your queries are using scalar or vector aggregates. As we have seen, query plans do not show in which ‘mode’ an aggregate is running and getting it wrong can cause poor performance, wrong results, or both.

I'm not sure I see what you mean about never skipping the GROUP BY clause. There are certainly times when a scalar aggregate is what is required (counting all the rows in a table for example). There are also times when the scalar semantic is necessary for correct results. It's all about being aware of the differences.

I really enjoy reading your post. I think that with Sum(1) the cardinality estimate is calculated as All density for ProductID * No of rows after stream aggregation ,i.e, 0.002267574*441~ 1. Its fine but I wonder why there is difference between count_Big and sum with regard to cardinality estimation. I tried with count(1) but still the same estimations

I enjoyed reading your post, but I think HAVING clause is redundant. As such, it should be eliminated - that would simplify and improve SQL. There would be less things to learn, less chances to shoot ourselves in the foot.

Thanks. It's the other way around: the cardinality estimate for COUNT is derived from statistics but SUM(1) results in a 30% guess.

Anyway, the reason for the difference is just that cardinality estimation knows how to predict a value for COUNT, but not for SUM(1), so it guesses.

I suppose cardinality estimation could treat the special case of SUM(1) as COUNT (though they produce different results on empty sets) but it doesn't. More generally, SUM(1) is just a special case of SUM(constant) and there's no sensible way to handle statistics given a comparison on that result.

Finally, COUNT(1) and COUNT(*) are the same. When you write COUNT(1) in a query, the query plan will show the aggregate as count star.

Good to hear from you. I'm starting to think I over-emphasised the HAVING examples :) Forgive me if I am misunderstanding you, but this post is not about HAVING - it is about the subtle semantic differences between aggregate queries with and without GROUP BY:

-- Try with the GROUP BY commented and uncommented

-- 23 versus 86 rows

-- No HAVING in sight!

SELECT

p.Name

FROM Production.Product AS p

WHERE

(

SELECT COUNT_BIG(*)

FROM Production.TransactionHistory AS th

WHERE th.ProductID = p.ProductID

--GROUP BY p.ProductID

) < 10;

Anyway, yes I agree HAVING is pure sugar; we can always replace it with a filter over a derived table. Nevertheless, it is standard SQL and hardly likely to be removed, whatever I think about it.