Statistics Sample Rates

I had a question about sample rates for statistics yesterday for SQL Server 200x statistics. Here is my basic rule-of-thumb.

1. If you have no issues with statistics or query plan quality, then you can leave it alone and not think about it.

2. If you have data with very "spikey" distributions that is not caught by the default sample rate, then you should consider a higher sample rate.

The core reason for this is that the sampling algorithm for SQL Server is not entirely "random". First, it samples pages and then uses all rows on the page. Second, it will actually sample the same pages each time, mostly to retain sanity within the test team here at Microsoft. So, it is possible that the default sample rate will sample pages that do not contain all of the interesting rows that define the "spikes" in your data distribution. A higher sample rate can capture more of these rows, although it will be at a higher cost. the async statistics update feature in SQL 2005 can help reduce the perception of that cost for some applications by moving autostats updates to a background job queue.

Do you think the fact that the same pages are chosen hurts performance? I guess if the stats generated would be different enough to cause different plans to be chosen, making the page choice more random could hurt or help performance whenever the stats are updated. This could affect the sanity of more than just testers.

I suspect it’s not a problem doing it the way it is.

But, if there’s an advantage to having them change each time, maybe you could add an optional seed value so that testers and users who prefer stability could ensure repeatability.

Do I think that sampling the same pages hurts performance? Well, I think it reduces the quality of the sample, which can lead to lower quality query plans, and those can be faster or slower. So, I think that it leads to slightly lower guarantees about the average quality of a query plan. That can manifest as slower performance. To be honest, we usually see that the sample _rate_ is the bigger issue, but that could partially be a result of the data on the sampled pages not being representative instead of just "not enough pages sampled".