Index columns, selectivity and equality predicates

There’s a common piece of advice given about columns in an index key that says that the most selective column should go first. I’m not going to say that’s incorrect, because it’s not. The problem is that it’s often given without any explanation as to why the most selective column should go first, nor are the other considerations for index key order mentioned.

This can lead to misunderstandings like, in the extreme case, where one person after hearing that advice went and added the primary key column as the leading column of every single nonclustered index (because it’s highly selective), and then wondered why his database performance decreased dramatically.

The comment about selectivity is because of the way SQL keeps statistics on indexes (see my post on statistics for more info on what they are). SQL only keeps the histogram for the first column of the index. That means that it only knows the actual distribution of values of the first column. If the first column is not selective, the index may not be used. However, that’s not the whole story.

SQL also, in addition to the histogram, keeps density values for all of the left-based subsets of the index keys. So, for a 3 column index key, SQL knows the density of the first column, of the first and second and of all three. The density is, in a nutshell, a value that shows how unique the set of columns is. It’s 1/(distinct values). The value can be seen for any index using DBCC Show_Statistics with the DENSITY_VECTOR option.

This means, while SQL only knows the actual data distribution of the first column, it does know, on average, how many rows will be returned by an equality match on any left-based subset of the index keys

So, what’s my rule for the order of columns in an index key? Put the most selective columns first, when all other considerations are equal.

Say this table has 10000 rows. It’s a heap, no clustered index. Let’s further say that there are 100 different values for Somestring, and 5000 different values for SomeDate. ID, since it’s an identity, is unique.

There’s a single nonclustered index on all three columns, in the order ID, SomeDate, SomeString.

Considering equalities first, this index can only used to seek for queries of the following forms:
… WHERE ID = @ID AND SomeDate = @dt AND SomeString = @str
… WHERE ID = @ID AND SomeDate = @dt
… WHERE ID = @ID

In other words, it’s only useful when the columns in the where/join are a left-based subset of the index key columns.

It cannot be used to seek on by a query that only filters on SomeDate, only on SomeString or the combination of the two. It’s like using the phonebook (which is an index ordered by surname, first initial) to find all the people whose firstname starts with M. It’s possible, but only by reading the entire book, i.e. by doing a scan.

It’s useless to have the most selective column of the index on the left if very few queries filter on it. Queries that don’t filter on it, but do filter on the other columns of the index will have to scan, and scans are expensive.

If it turns out that there’s more than one column that can be first, based on the queries that run against the table, or if all queries do equality matches on two or more columns, then put the most selective one first so that SQL has more chance to know that the index is useful. Otherwise, put the columns that are more often filtered on earlier in the index, so that the index can be used by more queries.

Let’s take a look at some query scenarios based on the hypothetical table above to see how that index will be used.

Scenario 1: Equality match on the ID column

This one’s the simplest. Since it’s a direct equality match on the leading column of an index, it’s a single seek operation to find the row.

Scenario 2: Equality match on the ID and Date columns

This one’s also simple. Since it’s a direct equality match on a left-based subset of the index columns, it’s a single seek operation to find the row.

Scenario 3: Equality match on the ID and String columns

This one’s a little harder. Only the filter on ID can be done as part of the index seek because the string column is not the second column in the index. The date is, and this query’s not filtering on the date. Hence this will require a seek operation on ID and then the string column checked to see if it matches. It’s no longer a single seek operation, even though the second predicate is done by the seek operator. SQL has to seek to find the matching ID and then compare the value of the string column to see if it matches.

Scenario 4: Equality match on the Date and String Columns

In this case, SQL can’t do a seek at all. The leading column of the index is not used in the where clause and, as such, the only way to satisfy this query is to scan. In fact, SQL decides to do a table scan and evaluate each of the rows in the table against the values specified for the date and string columns

I think that’s pretty much enough on indexes and equality predicates. I know I’ve probably muddied the water more than clarifying it, but I hope that I’ve given at least someone a better idea of how indexes work and are used for equality matches. Either later this week or next week (work permitting) I’ll take a look at some more complicated scenarios, that of inequality predicates

I’ve been using SQL Server for about 5 years, and have found that every resource I’ve encountered on indexing always sort of glosses over how to actually do it. I was still left asking “but what are the rules?”. Your 24 hours of Pass talk on Effective Indexing was the first time I took a page of notes that actually summarised some key rules – technicques and rules of thumb for equalities vs inequalities, left based sets, group by and order by etc. Thanks for laying out some really useful indexing guidlines!

I have created created partitioning with non clustered index on date time column. If verify partitioning by using object explorer
(storage in particular table properties ), it showing , table is not partitioned. How is the best way to find partitioned or not.
thanks.
RR

I was wondering if you could answer a few queries I have, please? You said in this post:

There’s a single nonclustered index on all three columns, in the order ID, SomeDate, SomeString.

Considering equalities first, this index can only used to seek for queries of the following forms:
… WHERE ID = @ID AND SomeDate = @dt AND SomeString = @str
… WHERE ID = @ID AND SomeDate = @dt
… WHERE ID = @ID

In other words, it’s only useful when the columns in the where/join are a left-based subset of the index key columns.

If you had a WHERE predicate that includes the first and last column of the index, like this,

… WHERE ID = @ID AND SomeString = @str

Would the optimiser use a nonclustered index seek (assuming that each column is highly selective and the rows returned are highly selective)?

Also, if you have a nonclustered index containing, say, 7 highly selective columns and have a WHERE predicate for the 2nd and the 3rd columns, will nonclustered index seek be used?