SQL University: Advanced indexing – Sorting and Grouping

Good day everyone and welcome to another week of SQL University. I know we’re getting close to the end of the year and everyone’s looking forward to a nice long vacation soaking up the sun at the beach, but a little bit of attention would be nice. Thank you.

This week is Advanced Indexing, and I mean advanced, none of that selectivity, SARGable, predicate stuff that gets repeated all over the place. If you need a refresher on the basics before we get started, the following can be considered pre-requisite reading for this course

Right, now that the admin has been handled, let’s get straight into things. Nothing like starting at the deep end…

Most people would rightly associate indexes with where clause predicates and joins, after all, the main usage of an index is to reduce the rows in consideration for a query as fast as possible. However there’s another portion of your queries that indexes can, if appropriately designed, help with – grouping and sorting.

Sorting is an extremely expensive operation, especially on large numbers of rows. For the academics in the audience, the algorithmic complexity of sorting is above linear, the time to sort a set of data increases faster than the number of items in the list. The common sorting algorithms have an average time complexity of O(n log n). It’s better than O(n2), but it can still hurt at the higher row counts.

The other reason that sorting hurts is that it needs a memory grant to do the sort. If there’s a shortage of memory the query could have to wait a while for the memory to be granted and, if the optimiser mis-estimates the number of rows to be sorted, the memory grant could be insufficient and the sort would have to spill into TempDB. You don’t want that happening.

Finally, sort is a blocking operator in the execution plan (all rows must have been fetched from the source before any output can be returned), and so the client won’t start seeing rows returned until the entire sort has been completed. This can make the query feel like it takes longer than it really does.

Grouping and aggregation are much the same. To aggregate one set of values based on another set of values, SQL has to get all the like values of the grouping columns together so that it can do the aggregation. That sounds suspiciously like a sort doesn’t it?

So for both sorting and grouping, the query processor’s job would be a lot easier if there was some way that it could get the data ordered by the sorting or grouping columns without having to do the work of actually sorting. Sounds impossible? No.

Enter indexes. The b-tree structure of an index makes it great for quickly finding matching rows, but today we’re more interested in the leaf-level of the index than the upper levels. The leaf level of an index is logically ordered by the index key (not physically ordered, let’s stop that myth right here please).

The leaf level of an index is logically ordered by the index key columns. So, if the index leaf level is read, it can return the data in the order of the index key. If that order is what the query processor needs in order to process a sort or grouping, then there’s no need for an expensive sort to be done, the underlying order can be leveraged.

Now, before someone misquotes me and goes off crying out that data is always returned in the order of the index used, no it is not. If there’s no order by on a query, there’s no guarantee of order regardless of indexes. However if there is an order by, the query optimiser and query processor may be able to utilise the underlying index order and avoid the cost of actually sorting the data.

Ok, enough theory for now, let’s look at some practical examples. We’ll use AdventureWorks here and I’m going to use the TransactionHistoryArchive table (in the Product schema)

Firstly a simple (and silly) example:

SELECT ProductID, TransactionDate, ActualCost
FROM Production.TransactionHistoryArchive
ORDER BY ActualCost

This one’s not very realistic, but it’s a nice simple one to start with. Give me all the rows in the table ordered by the ActualCost column. There’s no filter here so many would say that indexes can’t help here, and it’s true that an index can’t help with finding rows (because all are required), but it can help.

421ms of CPU time, with an estimated cost of 93% for the sort. Let’s see if we can make this any better.

If I create an index on ActualCost, that gives SQL the option of scanning the index (scan, because there’s no seek predicate) to retrieve the data in the order of the ActualCost column. I’m going to have to make it a covering index, as there is no way at all that SQL will willingly do key lookups for every single row of the table.

The reads have dropped slightly, because we’re now scanning a nonclustered index which is smaller than the clustered index, but that’s not the main point here. The CPU usage has dropped by around a factor of 8. 421ms down to 47ms. If this was a critical query that ran several times a minute then this change could make a nice improvement to throughput and overall CPU usage.

That was a silly example, let’s try for something a little more complex:

Well, it’s not exactly taking ages on the CPU (it’s a tiny resultset), but this can still be better than it is. The key lookup is there because the existing index is only on two columns – ReferenceOrderID and ReferenceOrderLineID. I can’t modify that index for this query without potentially breaking some other query’s use of it, so I’ll create a new index.

Why TransactionType first? Come back on Friday and I’ll be discussing that.

Plan’s now much simpler, the key lookup has gone, and now the sort is the majority of the cost

The key to getting rid of that sort is to understand that a multi-column index is sorted by the entire key. So, if we have an index on Col1, Col2, then for rows where Col1 has the same value, those rows are listed within the index ordered by Col2. In other words, if we read the leaf level of that index it would look like this:

Col1 Col2
A 1
A 3
A 8
B 4
B 8
B 9
B 14
C 0
\C 1

And so on and so on. Hence, if I filtered that by Col1 = B, the resulting rows could be returned ordered by Col2 with no additional work.

So in the above case, if I add the sort column (TransactionDate) as a key column in the index (it’s currently an Include column), then once the filter on TransactionType and ReferenceOrderID is done, the qualifying rows can be read in order of TransactionDate.

Let’s drop the index we created and create one with TransactionDate as an additional key column.

Let’s look at that simplistic index that I described earlier and see how the inequality would play out.

Col1 Col2
A 1
A 3
A 8
B 4
B 8
B 9
B 14
C 0
C 1

If I filter that for Col1 > ‘A’, are the resultant rows ordered by Col2?

No, because the filter on Col1 is an inequality, a filter that returns multiple different values of Col1, the results aren’t ordered by the second column and hence we can’t use an index here to both support the filter and the order.

Given this one, we could create an index to support the filter and have SQL sort the rows that qualified or we could create an index to support the order by and have SQL scan that and filter out rows that don’t match.

In general, the first option is the one that will be best in the majority of cases. The primary use of an index is to locate rows and filter resultsets. There are cases where the alternative may be appropriate, if the vast majority of the rows in the table qualify for the filter then it may be more optimal to support the sort and let SQL scan and filter. Definitely not the usual case though.

That about wraps up sorts, now for a quick look at group by.

As mentioned earlier, SQL has two ways to process grouping, it can do what is called a Stream Aggregate or it can use a hash table and do a Hash Aggregate. Stream Aggregate requires that the resultset be sorted in the order of the grouping columns. Well, given that fact and that we’ve spend a lot of time showing how to use indexes to support a sort, this should be quick and easy.

Let’s dive straight into some examples, because the theory is the same as for the sort described earlier

SELECT ProductID, SUM(ActualCost) AS TotalCostPerProduct
FROM Production.TransactionHistoryArchive
GROUP BY ProductID

If we want a stream aggregate, then we need to get the rows entering the aggregation ordered by ProductID. SQL is not going to willingly sort the entire resultset (the hash aggregate is cheaper), so the only way we’re going to get a stream aggregate (other than with a hint) is by adding an index so that the data can be read from the index already ordered. There’s no filter here so it’s a straightforward index

Share this:

15 Comments

Are there suppose to be INCLUDE columns in the first 2 indexes? I am not getting a change in the execution plan after running the queries. If I add the additional columns in the query as INCLUDE columns, then it works like yours.

I am guessing my browser (IE 8), does not display the whole CREATE INDEX text, because when I highlight the 3rd create index and copy/paste, the include part shows up. I think it is a display problem with the blog.

I just focused on the problem at hand. In real life, this is not something I would create because the TransactionDate could be greater than any date. But, I wasn’t able to get rid of the SORT operator any other way. Would love to see alternative solutions.

Thanks for clarifying. Sorry I should have paid attention myself.
Here’s the new index
CREATE NONCLUSTERED INDEX idx_TransactionHistoryArchive_2
ON Production.TransactionHistoryArchive (ReferenceOrderID, TransactionDate)
INCLUDE (ActualCost)
no sort, index scan and stream aggregate

SQL will try to put a new index down in physical order (clustered and nonclustered alike), but there’s no guarantee it will stay that way. If the clustered index was always stored in physical order of the index key, a clustered index would always have 0% logical fragmentation, as logical fragmentation is the measure of how badly the logical and physical orders differ.

Since I’m sure you agree that clustered indexes can indeed have a non-zero logical fragmentation, it thereby follows that the clustered index is not always stored in physical order.