Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.

Conventional wisdom would suggest that running a full sample of statistics would provide SQL with the optimum information to make the best decisions when forming a query plan for execution and therefore the best performance.

However, I have a number of queries (produced by Business Objects) which prefer a sample below 100%. The exact percent required for sample varies for success. What I want to know (but am struggling to find online) is why is why a 100% sample doesn't produce the best performance. Index maintenance is regular, every night following the 'Ola Hallengren' method.

The nature of the query makes it difficult to post (contains sensitive information) but those who are familiar with business objects queries will know they can be unwieldy. . .rather unusual in their construction, I often think its their setup which pokes the performance.

2 Answers
2

A perfect query optimizer should make the best plan if provided with a full sample of statistics. But even a perfect optimizer does not need the information provided by a full sample of statistics to make the best plan. A simple example:

select count(*)
from tab1
where col1=3
and col2=4

Assume that for this table tab1 there exists a nonclustered index on col1. The optimizer has to possible plans:

1) scan the whole table and check for each row if col1=3 and col2=4
2) consult the index to find the position of all rows in the table with col1=3. Lookup each of these row in the table and check if col2=4

now assume the table has a size of 123,456 GB and 1 123 456 789 rows and 567 890 123 of them (about half of them) have col1=3. What should the optimizer propose? It schould propose a full table scan. The other possibility, looking up 567 890 123 rows one by one is too expensive.

Now assume the optimizer only has a rough estimate of the table data: the table size is about 100GB, there are about 1,000,000,000 rows and about 600,000,000 of them have col1=3. Based on this information the optimizer will choose the same good plan as in the first case.

The decision is based on the size of the ratio 100*(# of col1=3-rows)/#(all rows) and if this value is larger 10% (I don't know which value is really used) the full table is scanned. The decision is not based of the size of the 5th place after the decimal point of the ration.

In a special case where the exact ratio is 10,00001% and the ratio based on the rough estimate is 9,99999% the optimizer will choose different plans but maybe for these different plans the execution time may be the same for both plans or may be better for the plan based on the rough estimate

Administrative jobs like calculationg statitistics should be done only with least resource consumption that is possible. Therefore only the precision that is necessary should be tried to be reached.

You can see the sample histogram has sampled ~360k records and consolidated to 8 steps, whereas the full sample histogram has only 2 steps (see below). You will get slightly different results if you run the above code as NEWID distribution will be random, but hopefully you get the idea. If I write a query before the full sample, to say get me all records in the 3rd step of the histogram, I get an estimated rowcount of 29,329.6 which looks like a calc using 50% of the RANGE_ROWS or DISTINCT_RANGE_ROWS for the step plus 1 for the EQ_RANGE plus another mystery 1 for something else:

If I now update the stats with a fullscan I get a very different histogram and estimated rowcount:

UPDATE STATISTICS badGuid WITH FULLSCAN
GO
-- Have a look at the stats, again
DBCC SHOW_STATISTICS(badGuid,someGuid)
GO

So this is a slightly contrived example as it's unlikely you use the Between operator with GUIDs but we can see the stats behave differently for a full scan and how this can lead to marked differences in estimated rowcounts. Inaccurate rowcounts can proliferate in plans, making for even more inaccurate rowcounts higher up in the plan especially in more complex queries.