Build a Generic Histogram Generator for SQL Server

Histograms help people analyze large amounts of data, whether you display them as tables or as charts. This article shows you how to do both.

by Michael Sorens

Jul 3, 2008

Page 1 of 7

he histogram is a mathematical tool that is invaluable when analyzing large quantities of data. A histogram may be represented in a table of numbers or in a chart. Because databases tend to hold large quantities of data, it follows that a database interface that leverages histograms would be a useful tool to have in your analysis toolbox. That's exactly what you'll see here—how to generate complex histograms with a single line of T-SQL code. The T-SQL procedure provides tabular output within the confines of a DB query tool (e.g., SQL Server Management Studio or Query Analyzer). As just one example of a way to convert the tabular data to a graphical representation, you'll also see how to hook up the output of the stored procedure to the input of a chart within Excel.

Wikipedia's histogram overview defines a histogram as "a mapping. . . that counts the number of observations that fall into various disjoint categories." The Wikipedia article provides both a tabular and a graphical representation of a simple example: measuring the commute time for US workers. Table 1 shows a subset of the Wikipedia table that is the most relevant portion for this discussion. The first column is the list of disjoint categories, also referred to as bins or buckets. The second column shows the count of the workers (in thousands) corresponding to each bucket. From census data, the article author determined that 4.18 million people had commute times of 5 minutes or less; 13.687 million had commute times between 5 and 10 minutes; 18.618 million had commute times between 10 and 15 minutes; and so forth.

Table 1. Sample Histogram: A histogram showing the distribution of commute times for US workers in 5-minute buckets.

Minutes

Workers (in thousands)

0

4180

5

13687

10

18618

15

19634

20

17981

25

7190

30

16369

35

3212

40

4122

For this article, I want to produce a histogram to show the same data in a different way. Table 2 shows exactly the same data but with the X values transposed with the Y values, so each column defines a bucket instead of each row defining a bucket. This format has several advantages:

Column-wise buckets visually map to a graphical representation more directly. Typically a histogram represented as a graph shows each bucket as a vertical bar in a bar chart. So each column in Table 2 corresponds directly to each bar in its corresponding graph.

Column-wise buckets map to a single-row result set from a database query. This provides the flexibility to show either a summary histogram—which I define as a single-row result set—or a detail histogram containing more than one row. A detail histogram for this same data set might, for example, enumerate age groups of workers. Perhaps the first row would represent workers under age 20, the next row workers over 20 and under 30, etc.

The header row shows ranges not single values. Each column header contains not just a single value but explicitly states the range contained in its bucket eliminating any ambiguity. With single values, does "10" mean "5-10" or "10-15" or even "7.5-12.5" ?

Table 2. Transposed Histogram: This table shows a transposition of the histogram in Table 1, showing the same data in a form more conducive for the techniques discussed.

Commute Time (minutes)

0-4

5-9

10-14

15-19

20-24

25-29

30-34

35-39

40-45

Number of Workers

4180

13687

18618

19634

17981

7190

16369

3212

4122

What You Need

SQL Server 2000 or later

SQL Server Query Analyzer, SQL Server Management Studio, or a similar tool for querying a database and receiving a result set

Microsoft Excel 2003 for creating charts from result sets

Installing the Sample Code To follow along, first download the zipped sample code that accompanies this article, and unzip the file. Next, run histogram.sql to install the histogram stored procedure and then run both BoundSmoothing.sql and GetDataTypeName.sql to install the necessary support functions. You will, of course, need sufficient database privileges to do this; talk to your DBA if you get an error when you try to install a procedure or function.

Author's Note: This stored procedure makes use of dynamic SQL. Dynamic SQL, while sometimes necessary, and frequently useful, always presents a risk of abuse to your system. See The Curse and Blessings of Dynamic SQL by a noted SQL Server MVP, Erland Sommarskog.

You also need the standard AdventureWorks database to replicate the specific examples discussed. If you don't have it, you can download it here, and install it before continuing.

Finally, to generate connections between Excel and your SQL Server database you will need to install Microsoft Query from your Excel installation disk, because it is not installed by default with Excel. To determine whether you have it installed or not, simply follow the steps in the first Excel exercise and the program will tell you if Query is not loaded.