The table scan from hell

Greg Linwood, a fellow SQL Server MVP, has started a series of articles in which he attempts to prove that having a clustered index on each table is not a good practice. However, he has failed to include the effects of fragmentation into account, so I decided to run some tests for myself. One of those test had rather upsetting results.

First a bit of background, for those of you who have never read a book by Kalen Delaney. A table without clustered index (also known as a “heap”) can be read in two ways: using a bookmark lookup or a table scan.

A bookmark lookup is used when a nonclustered index can first be used to narrow down the number of rows to read. Each entry in the nonclustered index holds a pointer to the row in the heap; this pointer is called the RID (short for Row ID), and consists of a file number, a page number, and a slot number. Using the RID, the storage engine can directly fetch the required page and read the row.

A table scan is used when there is no nonclustered index or when statistics indicate that the nonclustered index is not selective enough. A table scan is driven by the IAM (Index Allocation Map), basically a bitmap that indicates which pages in the database are allocated for this table. Table scans always tend to be slow, because each page in the table has to be read and each row processed. This is offset somewhat by the fact that the pages are read in order of page number, so if there is no file fragmentation, movement of the disk heads will be minimal and SQL Server can use the highly efficient read-ahead technology rather than the usual random I/O. Or at least, that’s what I (and many others) always assumed.

Before moving on to the issue of fragmentation, let’s first see this in action:

In case you are wondering about the “dummy table scan” – on my first tests, I got only logical reads, which was odd considering I had just forced the cache to be cleaned. Fellow MVP Steve Kass found the cause: the first time the Email column is used in a search condition, statistics have to be computed, which is done by scanning the table. The dummy scan ensures that the statistics are created before I clear out the cache.

On my SQL Server 2005 server, I got this output (edited for readability):

There are 149 pages used by the table, and 149 pages are read. This is as expected. I must admit that I don’t know why there are 16 extra read-ahead reads, though.

Updates of rows with variable-length columns are one of the main causes of fragmentation in a heap. SQL Server will try to perform an update in place. This is always possible if the new data occupies less or the same number of bytes as the old data, but if the new data needs more room and there’s not enough space left in the page, the row has to move. SQL Server finds a page with enough free space or allocates a new page, then stores the new data of the row in that page. The data at the original location is replaced with a so-called “forwarding pointer”: a RID that points to the new location of the row. Because of this forwarding pointer, there is no need to change RID pointers to the row in existing nonclustered indexes; when trying to read this row, they can simply follow the forwarding pointer to its new location. And at the new location, a “back pointer” to the original location is added, so that if the row has to move again, the original forwarding pointer can be updated rather than forming an endless chain of forwarding pointers.

Run this code now to simulate how a table might get fragmented over time as its contents are regularly updated. Note that this code does 5,000 random updates; one out of four rows on average will be updated, many rows will never be touched, and a few might even be updated twice or more.

-- Simulate update activity to create fragmentation.

-- Temporarily add index to improve performance.

CREATEUNIQUENONCLUSTEREDINDEX ix_PersonID ON dbo.Persons(PersonID);

-- Do 5,000 random updates to email column, increasing it's length.

DECLARE @cnt int;

SET @cnt = 1;

WHILE @cnt < 5000

BEGIN;

SET @cnt = @cnt + 1;

UPDATE dbo.Persons

SETEmail =LEFT(Email + Email, 7000)

WHEREPersonID =CAST(RAND()* 20000 ASint);

END;

-- Drop the index, we no longer need it.

DROPINDEX dbo.Persons.ix_PersonID;

go

There now should be plenty of forwarding pointers in the table. This means that many bookmark lookups will now take two reads instead of just one. But how are table scans affected by these forwarding pointers? I see two possible methods:

1.Follow the IAM from start to finish. Ignore forwarding pointers. Process all rows, both with and without back pointers. Processing is still completely sequential.

2.Follow the IAM from start to finish. When encountering a forwarding pointer, immediately fetch the page that the pointer points to and process that row. When encountering a row with a back pointer, ignore it (it is already or will be processed when the corresponding forwarding pointer is read).

If the table is read WITH (NOLOCK) or in TRANSACTION ISOLATION LEVEL READ UNCOMMITTED, the second method has to be used. Otherwise, a row that is moved during the scan might be included twice or not at all (Paul Randal explains this in a bit more detail in his post about heaps). But in all other cases, method 1 can safely be used since the table lock issued by the table scan effectively prevents any modification while the scan is running.

So let’s see what method SQL Server picks:

-- Do table scan and show pages read, then show #pages in table.

CHECKPOINT;

DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;

SETSTATISTICSIOON;

SELECT*

FROMdbo.Persons

WHEREEmail LIKE'brenda20@adventure-works.com%';

SETSTATISTICSIOOFF;

DBCC CHECKTABLE('dbo.Persons');

go

The relevant part of the output on my system (note that numbers vary slightly between runs, due to the use of RAND() in the simulated updates):

The number of pages occupied by the table has grown a fair bit, but that was to be expected after doubling 5,000 email addresses. The troubling part of the output is the number of logical reads. Apparently, the average page in the table is read 10 times! This shows that, even though the default READ COMMITTED isolation level makes it unnecessary, SQL Server will follow method 1.

Reading the same page will of course significantly impact performance. This gets even worse when the size of the table exceeds the available server memory so that not all pages fit in the data cache. Since the order in which pages are accessed will be completely random (for instance, the first page holds forwarding pointers to pages 6632, 91, 93055, 17, 494, and 220568; the second page holds forwarding pointers to pages 902, 14, 6632, 1905408, and 3, etc.), pages requested will probably just have been removed from cache so that they have to be read from disk once more. Because of the forwarding pointers (and an apparent oversight from Microsoft), a table scan on a heap might come with an amount of physical I/O that is a multiple of the actual size of the table. And there is no easy way to fix or prevent this, since heaps (unlike table scans) can not be unfragmented. The only option is to periodically copy over all data to a new table, as that is the only way to get rid of the fragmentation.

Bottom line: Beware of table scans, and beware of the effects of fragmentation on a heap. Even though I agree with Greg that some tables don’t need a clustered index, those are the exception. So if you choose to use rules of thumb, make sure that having a clustered index for each table is one of them. It might not be optimal, but it’s a lot better than not having a clustered index for any table!

You are right, creating and dropping a clustered index will indeed remove all fragmentation from a heap. I had not thought aboout that.

Of course, the overhead of creating and removing a clustered index on a big table is enormous - all data has to be moved, and all nonclustered indexes have to be redone as well - and all this twice!! Unless your database has a very long maintenance window, there's no way to do this without significantly impacting performance.

What about on a mulit-column clustered index? My experience has been exceptional in that M$ has definitely had their stuff in gear. Best practices indicate the almost never do we have a single column clustered index "key". At least that's my experience and my $0.02. Have not run the metrics yet, but very impressed with in-place production examples on large volume db's in both SQL 2000 and 2005 versions. Of course, my production systems rarely exceed 20m records.

The entire table must be rewritten twice. Once for the sort and once for the new table. COnsider offline bulk copy out and in instead.

No LOB types are permitted for an online rebuild. So no MAX types, XML, and so on. Quite a restriction.

Nonclustered indexes must be maintained, requiring yet more time and space, including the extra mapping table.

If nonclustered indexes are dropped first, then in what sense is the operation online?

Online index builds almost always result in greater remaining fragmentation than offline builds.

Building a non-unique clustered index will result in uniqueifiers being added to any nonclustered indexes temporarily.

Buildinding a unique clustered index will abort if a user inserts a duplicate value which the online process does not process first.

Online index building is not for free. In real-life production systems with a heavy OLTP load, online is not practical.

These are some of the reasons that "building a clustered index on a heap" will significantly affect performance. There may be others.

Your point about avoid scans on heaps is a good one, as is the remark about widening updates. However, real life systems are not always as well-designed as we might like. Missing non-clustered indexes, 100% fillfactors, widening updates and so on are all things that will happen. A lot of this is down to old code which was written in a hurry, is complex and hard to maintain, and frankly people are too scared to touch it. In situations like these, adding a clustered index to the heap permanently will make the problems go away. Bear it mind that it has been shown over and over again that heaps are inferior when it comes to INSERT, UPDATE and DELETE performance - the lifeblood of OLTP.

As for a real-life example of widening updates, please feel free to pop across the ditch to see how my workplace uses transactional replication, and how often widening updates occur. This is on a system serving 60M web page impressions per day. Setting a low fill factor to work around the weakness reduces data density - less real data fits in the available pages in the buffer pool for example.

The heap versus clustered index debate was over a long time ago. Clustered indexes by default, heaps in very rare circumstances, following extensive testing and justification. Sure, both have their weaknesses, but the list is very much longer on one side. Attempting to dismiss the technical inadequacies by saying they are not important or significant is churlish, borderline arrogant. There will be systems out there where these issues are very important indeed, on a daily basis.

The point about overhead on key lookups versus RID lookups is interesting, but ultimately flawed. The extra operations will always be on data already in memory, so the only overhead we need concern ourselves with is CPU. This is cheap and getting cheaper. For the very very small number of instances where CPU is a problem, there must be so many key lookups that it seems the clustered index must be on the wong column. Range seek on the clustered index for very large tables - it is what it is good at. On smaller tables, clustering on an identity column is pretty good practice, overall.

In your comparison with Oracle - I'm afraid it left me comparing the list prices and considering the value of benchmarks. I note that you fail to mention the much larger page size used by Oracle in those tests.

Apologies for the long reply. You're a smart guy Greg, with lots of good insights, knowledge and experience, but this holy war on clustered indexes does you no good at all I'm sad to say.

Leave a Comment

About Hugo Kornelis

Hugo is co-founder and R&D lead of perFact BV, a Dutch company that strives to improve analysis methods and to develop computer-aided tools that will generate completely functional applications from the analysis deliverable. The chosen platform for this development is SQL Server. In his spare time, Hugo likes to visit the SQL Server newsgroups, in order to share and enhance his knowledge of SQL Server.