Geek City: Accessing Distribution Statistics

Distribution statistics are one of the most important sources of information that the Query Optimizer uses to determine a good query plan. In this post, I’m not going to tell you everything about distribution statistics. I’m just going to show you a few tricks for getting access to the statistics.

Microsoft does provide us a tool called DBCC SHOW_STATISTICS for examining the distribution statistics.

Microsoft has gradually been making more of the more of the old DBCC commands available as DMVs, even some undocumented ones. For example, one of my favorites, DBCC IND, has now been replaced in SQL Server 2012 by sys.dm_db_database_page_allocations.

I have been wishing for several versions that Microsoft would make the DBCC SHOW_STATISTICS information available as a DMV. But it hasn’t happened yet, and I’m tired of waiting, so I decided to do something about it.

My solution is not quite as easy to use as a DMV might be, but it allows you to get the information that DBCC SHOW_STATISTICS provides into a set of three tables that can then be saved into a more permanent location of your choice, and/or queried as desired.

DBCC SHOW_STATISTICS returns three sets of information, with different columns in the output, so three different tables are needed. The DBCC SHOW_STATISTICS command can be called with an argument that specifies that you just want one of the three sets returned. The options are

WITH STAT_HEADER – returns basic info such as last update date, and number of rows in the table/index. Also reports number of steps returned for HISTOGRAM section.

WITH DENSITY_VECTOR – returns density info for each left-based subset of columns in the index. For example, an index on (lastname, firstname, city) would have a density value for (lastname), for (lastname, firstname), and for (lastname, firstname, city). Each density value is a single number representing the average number of occurrences and depends on the number of distinct values. For example, if there are only 2 possible values in the column, the density would be 0.5. Multiplying density by the number of rows in the STAT_HEADER section would give the average expected rowcount if a query was executed looking for an equality on the specified column(s).

WITH HISTOGRAM – returns a set of ordered values from the first column of the index, creating a histogram. This histogram provides the optimizer with selectivity information for specific values or ranges of values in the first column of the index.

To collect this info, I will use one of my favorite tricks, which is to create a table in the master database with a name starting with sp_. (I’ve written about this trick several times, including in this earlier blog post.) Once I have the table(s) created, I can access them from any database. So here are the three tables:

The second trick is to use INSERT … EXEC to execute a DBCC statement and populate the tables. I will build the DBCC command dynamically, after capturing the schema, table and index names in variables. You of course could take this code and turn it into a stored procedure, for which the schema, table and index names are passed as parameters. I’ll use as an example a table in the AdventureWorks2008 sample database, just so you can try running the code, and I can verify that it actually works!

I will use the table Sales.SalesOrderDetail and the index IX_SalesOrderDetail_ProductID. So the object name (@oname) is SalesOrderDetail, the schema name (@sname) is Sales, and the index name (@iname) is IX_SalesOrderDetail_ProductID.

So now you can look at the values collected and filter or query in any way, or use SELECT INTO to save them into another table, so the sp_ tables can be used the next time you want to capture distribution statistics information.

SELECT * FROM sp_stat_header;

SELECT * FROM sp_density_vector;

SELECT * FROM sp_histogram;

Let me know if you find this useful, and especially if you embellish it to create a procedure or an automated process of your own!

Comment Notification

Comments

I’m guessing you already know but SQL 2008 R2 introduces a new DMV sys.dm_db_stats_properties that has almost all of the same properties as the DBCC show_statistics WITH STAT_HEADER returns. The DMV is missing the density which I believe is no longer used by the optimizer, string index, average key length, and filtered expression. The filtered expression column can be picked up from sys.stats filter definition column. It’s not a perfect fit but if you can do with the average key length and the string index flag it’s a nice alternative. You can also pull in the average record length for the 0 level index of a statistic if it has one from the sys.dm_db_index_physical_stats DMV using the DETAILED setting. It’s not exactly the same thing but it may give some insight to the statistic. This combination has been getting me through an high level look at statistics for a database when I’m having performance problems.

I like your solution as it solves the issue for getting some of the more detailed statistics information.

Thanks sfibich... I actually just found out about the new DMV when this issue of programmatically accessing stats info was being discussed on a private forum. Someone asked why we couldn't use sys.dm_db_stats_properties, and that was the first time I saw it! But as you say, it doesn't have everything.

Thank you! I forgot about the INSERT INTO preceding a command feature.

This is really helpful for dynamically identify cardinality for columns, as well as identifying the top 200 most common values in each column. Since I have auto statistics on, I just pass the statistic name into DBCC and I get the generated stats on every column of every table.

Hi Jeff, I'm glad you found this useful. Keep in mind that the 200 steps in the histogram are very unlikely to be the 200 most common values. They are sample points spread throughout the ordered key set that give us useful information on the intervals.

Thanks Steve... who would thunk average key length needed to be more than 32K bytes... but then of course this can keep track of any stats, not just stats on indexed columns. So we could have stats on a varchar(max) or some such.