When can allocation order scans be used?

I know this is jumping the gun a little as I haven't made it this far in my series on fragmentation, but this came up in a chalk-talk I did yesterday at TechEd Developers in Barcelona and is worth blogging about.

You'd expect a select * from mytable query on a table with a clustered index to use an allocation order scan to return the data, as that's the fastest way to read all the pages at the leaf-level of an index. In fact, the query plan will show an unordered clustered index scan. Well, guess again. What I didn't remember, and thanks to SQL Server MVP Maciej Pilecki for pointing this out during my talk (even though I didn't believe him at first), is that an allocation order scan can't be used when there's any possibility of the data changing beneath the scan.

Consider an example where an allocation order scan is progressing and a page that the scan has already read then splits - adding a newly allocated page at the end of the allocation-order list of pages. The scan will eventually come to the new page and then re-read some of the rows it has already read - producing duplicates in the scan output - clearly undesirable.

The only time such a scan will be used is when there's no possibility of the data changing (e.g. when the TABLOCK hint is specified, or when the table is in a read-only database) or when its explicitly stated that we don't care (e.g. when the NOLOCK hint is specifed or under READ UNCOMMITTED isolation level). As a further twist, there's a trade-off with setup cost of the allocation order scan against the number of pages that will b read - an allocation order scan will only be used if there's more than 64 pages to be read.

Below I've included a simple script that demonstrates the behavior. For me, the funny thing is that I should have remembered this behavior as it's exactly the same reason why DBCC SHOWCONTIG takes a shared table lock when in the default mode - it's using an allocation order scan and needs to ensure that the data doesn't change and produce duplicates. Oh well - you (re)learn something every day!

-- Drop and recreate the test database

USE

master;

DROP

DATABASE allocationordertest;

CREATE

DATABASE allocationordertest;

USE

allocationordertest;

GO

-- Create a simple table that we can fill up quickly, plus a clustered index

CREATE

TABLE t1 (c1 INT, c2 VARCHAR(8000));

CREATE

CLUSTEREDINDEX t1c1 ON t1 (c1);

GO

-- Add some rows, making sure to produce an out-of-order dataset when scanned in allocation order.

I am trying to clarify the whole 64K "stripe size" idea, and I am getting very confused.

I have been told that the 64K size that is recommended for SQL Server really corresponds to the "allocation unit" that you see when you do a chkdsk. For instance, on my laptop, the allocation unit size is 4K. This is the windows default. I have been told that for SQL to work the "best" that this size should be 64K, with an offset of 1 (can be check using diskpar).

Now, combine this with the technology of a SAN (EMC for example). When they build RAID groups, they use a stripe size of 128K. Does this make any difference to SQL? Is this in any way related to the 64K allocation unit size tha is recommended?

Would you mind shedding some light on this. I have read and googled my heart out and am still very confused. I have lots of people telling me different things, and none of it is giving me a clear picture. Do you know something I could read that would clear this up?

One thing I am wondering about all the defragmentation talk is that none of the possible ways to "defragment" handle all the types of fragmentation.

alter index reorganize handles pages that are out of order

alter index rebuild rebuilds the index, but it still allocates space where there is available space making it only slighty better than reorganize

dbcc emptyfile doesn't help much either

How can I force it to completly rebuild an index that is 100% continuously allocated on disk?

I have found one way, but it is incredibly annoying to work with: Create a new tablespace, drop index, recreate in new table stable. For clustered, create a new table and move over all data. And then do it all again the other way around when the old tablespace is empty.

That's about the only way to guarantee 100% continuously allocated - create a new filegroup and rebuild the index using CREATE WITH DROP_EXISTING and specifying the new filegroup.

Achieving 100% contiguity is not vital - as long as there are as few fragments as possible compared to the data volume to be read during a scan then the interruption to large-IO readahead will be minimized (and hence scan performance will hardly be affected).

You will not see allocation ordered scans on any instance that has the server configuration option "cursor threshold" set to a value other than (the default) -1. It seems that IAM scans do not play nicely with asynchronously-populated cursors. Disabling allocation-order scans generally, and not just for cursors that might be generated asynchronously seems unexpected.