Selectivity

The fraction of rows from the input set of the predicate that satisfy the predicate. More sophisticated selectivity measures are also used to estimate the number of rows produced by joins, DISTINCT, and other operators.

Bart Duncan wrote a nice detailed blog post a while back explaining the difference between density, selectivity and cardinality. In summary, indexes have density, a measure of how unique the left-based column subsets within them are; predicates have selectivity, a measure of what portion of the table they affect; operators have cardinality, a measure of how many rows the operator processes.

Indexes cannot be said to be selective or not, they can only be said to have a high or low density. It is possible for a predicate on a very low density column (unique) to have a very poor selectivity (large percentage of the table affected) Imagine ID > 0 where ID is an int identity column. The column is unique, but the predicate affects the entire table. Low density (which is good), but poor selectivity.

So let’s alter the original question. “If an index has a high density (not very unique, lots of duplicate values), will query operators against it always be index scans rather than index seeks?”

Seek and Scan

Before we go on, I want to quickly look at the main difference between a seek operation and a scan operation.

A seek is an operation which navigates down the index’s b-tree looking for a row or for the start/end of a range of rows. A seek requires a predicate and that predicate must be of the form that can be used as a search argument (SARGable)

A scan is a read of part or all of the leaf level of an index.

High-density indexes

So what exactly is the problem with a high density index? In short, it returns a lot of rows for any predicate filters against it (unless there’s a TOP involved, but let’s ignore those cases here). If the index has a high density (and lets assume for simplicity there’s no data skew here), any predicate using that index automatically has a poor selectivity, it returns a large portion of the table.

If we take as an example a 100 000 row table, with an column called status that has 4 values only, then, assuming that the distribution of those values is equal, a query with a predicate searching for one of those values will read 25000 rows. If we have a nonclustered index on that integer column, it works out that the nonclustered index has 223 pages at the leaf level and is 2 levels deep in total. Given that the four values have equal distribution, an index seek to retrieve the rows for one of those status values will require approximately 57 pages to be read.

Is the index scan better?

The scan will read all the leaf pages, that’s what a scan does (ignoring cases like min, max and top where it can scan and read only part of the index). So if SQL decided to use an index scan because of the high density of the index it will have to read all 100 000 rows on all 223 pages (plus the index root page)

57 pages for the index seek vs 224 pages for the index scan. Looks pretty obvious which is better. To prove that I’m not making things up, let me test this and get actual numbers.

Yup, index seek is better and the one that the optimiser choses if it is allowed to chose.

High density indexes and the clustered index

So why the confusion around index scans on high density indexes? I suspect it’s because of the way the optimiser handles noncovering indexes where the predicates are not selective. This has nothing to do with the efficiency of the seek or scan operators on the nonclustered index though, it’s got to do with the mechanism used for the key lookup.

If a nonclustered index that SQL could use for a query is not covering, then for each row in that resultset it has to do a lookup back to the cluster/heap for the rest of the columns. Those key (or RID) lookups are expensive operations. If there are too many needed then the optimiser switches to a scan, not of the nonclustered index (it would be pointless, it’s still not covering), but of the clustered index because that at least has all the columns needed for the query (it could also switch to a scan of a different nonclustered index if there is one that’s covering but with columns in the wrong order to be seekable)

Summary

In summary, does having a high density nonclustered index result in index scans of that index? No (unless the predicate is not SARGable), however it can result in scans of a different index (probably the cluster) if the index is not covering for a query and that high density index being unused.

Right, I know it’s Friday and everyone’s tired and looking forward to the weekend, but I do need to finish off this indexing section and I’ll try to keep this short and interesting and hopefully keep everyone awake.

There’s no shortage of information available on how to create indexes. Hell, I’ve written a copious amount myself. Most of these many articles however are written from the point of indexing single queries. What you chose for a where clause, what has to go into the include to create the perfect index for this query. Now that’s all well and good, but I’ve never met a system that had only one query per table (maybe there is such a system out there, but I’ve never found it)

So what I’m going to try to do today is address the topic of a strategy for indexing. How to approach indexing, not for a single query, but for the system as a whole. I won’t be able to cover this in-depth, this is material worthy of an entire book chapter, if not an entire book, but I can at least touch on the essential portions.

Now, there’s two main positions that we could be in when considering indexing strategies for an entire system
1) A brand new system that’s still in development
2) An existing system that’s being used actively.

One at a time…

Indexing strategies for a brand new system

Start by choosing a good clustered index. What makes a good clustered index? Well, it depends 🙂

The clustered index is the base, it will affect each and every nonclustered index, and it’s not trivial to change once the system is in use, so chose carefully. I’m not saying another word on the subject of a clustered index, not today.

Welcome back to day 2 of Advanced Indexing. Today we’re going to look at a feature that was added in SQL Server 2008 – filtered indexes.

In versions previous, indexes were always on the entire table. An index would always have the same number of rows as the table it was built on did (which is why COUNT(*) can just scan the smallest index on the table)

With filtered indexes, it’s possible to have an index that’s built on a subset of the rows in the table. The definition for a filtered index contains a WHERE clause predicate that determines if a row in the table will be in the index or not.

This can be a major advantage on really large tables where most queries are only interested in a small fraction of the table. A normal index would be based on the entire table regardless of the fact that most of the table is of no interest, meaning the index would be larger than necessary, deeper than necessary and take up more space than would be ideal. With a filtered index on just the interesting portion of the table, the index size is kept to a minimum, meaning it’s shallower than an index on the entire table and hence more efficient.

There are two main uses of a filtered index:
1) Enforcing moderately complex uniqueness requirements
2) Supporting queries that filter on common subsets of a table

Filtered indexes and unique columns

One very interesting use of filtered indexes is in enforcing uniqueness over portions of a table. One requirement that come up again and again is to have a nullable column that must have unique entries in it, but whose entries are optional. Basically, the column must be unique or null. Sounds easy, but the problem is that a unique index allows only one null. So much for nulls not being equal to anything including other nulls.

Good day everyone and welcome to another week of SQL University. I know we’re getting close to the end of the year and everyone’s looking forward to a nice long vacation soaking up the sun at the beach, but a little bit of attention would be nice. Thank you.

This week is Advanced Indexing, and I mean advanced, none of that selectivity, SARGable, predicate stuff that gets repeated all over the place. If you need a refresher on the basics before we get started, the following can be considered pre-requisite reading for this course

Right, now that the admin has been handled, let’s get straight into things. Nothing like starting at the deep end…

Most people would rightly associate indexes with where clause predicates and joins, after all, the main usage of an index is to reduce the rows in consideration for a query as fast as possible. However there’s another portion of your queries that indexes can, if appropriately designed, help with – grouping and sorting.

Sorting is an extremely expensive operation, especially on large numbers of rows. For the academics in the audience, the algorithmic complexity of sorting is above linear, the time to sort a set of data increases faster than the number of items in the list. The common sorting algorithms have an average time complexity of O(n log n). It’s better than O(n2), but it can still hurt at the higher row counts.

The other reason that sorting hurts is that it needs a memory grant to do the sort. If there’s a shortage of memory the query could have to wait a while for the memory to be granted and, if the optimiser mis-estimates the number of rows to be sorted, the memory grant could be insufficient and the sort would have to spill into TempDB. You don’t want that happening.

Finally, sort is a blocking operator in the execution plan (all rows must have been fetched from the source before any output can be returned), and so the client won’t start seeing rows returned until the entire sort has been completed. This can make the query feel like it takes longer than it really does.

Grouping and aggregation are much the same. To aggregate one set of values based on another set of values, SQL has to get all the like values of the grouping columns together so that it can do the aggregation. That sounds suspiciously like a sort doesn’t it?

So for both sorting and grouping, the query processor’s job would be a lot easier if there was some way that it could get the data ordered by the sorting or grouping columns without having to do the work of actually sorting. Sounds impossible? No.

All of the indexing strategy posts I’ve written in the past have been concerned with predicates combined with ANDs. That’s only one half of the possibilities though. There’s the case of predicates combines with ORs, and the guidelines for indexing that work well with ANDs don’t work with ORs

When dealing with predicates combined with AND, the predicates are cumulative, each one operates to further reduce the resultset.

For this reason, multi-column indexes support multiple predicates combined with AND operators.

Again I’m going to be lazy and get SQLDataGenerator to generate a few rows.

With that two column index on those columns and a query that looks for Surname = ‘Kelley’ AND Name = ‘Rick’, SQL can do a double column seek to go directly to the start of the range then just read down the index to the end of the range, basically until it finds the first row that it’s not interested in.

So how does that that differ when the operator is an OR?

The main difference is that with an OR, the predicates are independent. The second doesn’t serve to reduce the recordset, but rather to expand it. It’s similar to evaluating two separate predicates and combining the result. Let’s have a look at that 2 column index again when the two predicates are combined with an OR.

If we try to use that index to evaluate Surname = ‘Kelley’ OR Name = ‘Rick’, there’s a problem. While the first of those predicates can be evaluated with a seek (it’s a sargable predicate on the left-most column of an index), the second predicate cannot. It’s sargable, but it is on the second column of the index (and for the moment let’s assume there are no other indexes on the table). Seeks are only possible if the predicate filters on a left-based subset of the index key.

Hence to evaluate that predicate SQL will have to do an index scan. Since it has to do a scan to evaluate the one predicate, it won’t bother also doing a seek to evaluate the first predicate as it can also evaluate that during the scan.

I see a lot of advice that talks about the clustered index been the best index for use for range queries, that is queries with inequalities filters, queries that retrieve ranges of rows, as opposed to singleton queries, queries that retrieve single rows (including, unfortunately, a Technet article).

I suspect the reasoning behind this advice is the idea that the clustered index stores the data in order of the clustering key (ack) and hence it’s ‘logical’ that such a structure would be best for range scans as SQL can simply start at the beginning of the range and read sequentially to the end.

The query that I’ll be testing with will do a sum of the SomeValue column for a large range of ID values. That means that of the three indexes that I’m testing, one is clustered, one is a nonclustered that does not cover the query and the third is a covering nonclustered index.

SELECT SUM(SomeValue)
FROM TestingRangeQueries
WHERE ID BETWEEN 20000 and 200000 -- 180 001 rows, 18% of the table

I’m going to run the same range scan query three times, each with an index hint so that SQL will use the three different indexes, regardless of which one it thinks is best.

First up, the clustered index.

As expected, we get a clustered index seek (the predicate is SARGable) and a stream aggregate.

Or “If one index is good, surely many indexes (indexes? indices? indi?) will be better”

This is a question that comes up very often on the forums. Something along the lines of:

I have a query with multiple where clause conditions on a table. Should I create one index for each condition, or one index with all the columns in it?

The question basically boils down to this: Which is more optimal and more likely for the optimiser to pick, a single seek operation against a wide index that seeks on all three conditions in one go, or three seek operations against three indexes followed by a join to get back the final set of rows.

One thing to keep in mind is that one of the jobs of an index is to reduce the number of rows in consideration for a query as early as possible in the query’s execution.

So let’s take a made-up example. Let’s say we have a table with a number of columns in it. A query is run against that table with three conditions in the where clause

WHERE ColA = @A AND ColB = @B AND ColC = @C

Let’s further say that 1000 rows qualify for the condition ColA = @A, 15000 rows qualify for ColB = @B and 30000 rows qualify for ColC = @C. The total number of rows that qualify for all three conditions is 25.

Which sounds like it would be more efficient?

Seek on an index with all three columns and retrieve just 25 rows

Seek on an index on ColA, retrieve 1000 rows, seek on an index on ColB, retrieve 15000 rows, seek on an index on ColC, retrieve 30000 rows then join the three result-sets together to get the desired 25 rows (called an Index Intersection)

This one comes up from time to time, so I thought I’d have a go at addressing it.

Let’s imagine a hypothetical DBA who’s doing some performance tuning. He looks at a query plan for a moderately complex query and panics because there’s a couple of index scans and he wants to rather see index seeks.

Is that correct, are index scans bad and index seeks good?

Well, most of the time yes. Most of the time a scan is a problem and indicates a missing index or a query problem, but there are other times that it’s the most optimal way to get the required rows from the table.

Let’s start with the simplest case, and I’m going to use the AdventureWorks database for these queries.

[source:sql]select ProductID, Name from Production.Product[/source]

In this case I get an index scan on the AK_Product_Name index and that makes perfect sense. I’m asking for all the rows in the table. there is no way that SQL can use a seek to execute that query. For there to be a seek, there has to be a SARGable predicate within the query that can be used for the seek.

Most people who know SQL execution plans will say, without reservation, that an index seek on a particular index is better than an index scan on the same index. In the vast majority of cases, that’s true, but there are times when what appears in the execution plan as an index seek is actually an index scan.

We have a table here with an identity column on it, starting at 1 and incrementing by 1 row. Hence, there will be no negative values in the table. I’m going to then put a nonclustered index on that column (the table has no cluster, it’s a heap)

CREATE NONCLUSTERED INDEX idx_Seek ON TestingSeeks (id)

Fair enough. If I query all the rows in the table and retrieve just the ID column, I’ll get a scan on that index, as is pretty much expected and Statistics IO tells me that 935 pages were read

Well, that’s a rather contentious title. There are probably several people shaking their heads at this point. Let me explain.

I was listening to a podcast with Kimberly Tripp this morning, and she mentioned this briefly. I thought it would be a good discussion to end a short series on indexes and selectivity.

The Clustered Index

A clustered index has to be unique, because the clustering key acts as the row’s location in the table. If the index is not defined as unique, SQL will make it unique by adding a uniquifier, a 4-byte integer that’s hidden behind the scenes and is added when necessary to make the clustered index unique.

It’s not documented anywhere clearly, but it is mentioned in a couple of places. From msdn:

If the clustered index is not a unique index, SQL Server makes any duplicate keys unique by adding an internally generated value called a uniqueifier. This four-byte value is not visible to users. It is only added when required to make the clustered key unique for use in nonclustered indexes. SQL Server retrieves the data row by searching the clustered index using the clustered index key stored in the leaf row of the nonclustered index.