Design Elements Part 7: Statistical Functions

Last updated Mar 28, 2003.

We are surrounded with statistics. Used correctly, they form the bedrock of
successful decisions. Used incorrectly, they can destroy a company by causing
management to make bad decisions, or cause a faulty conclusion in a lab test
evaluation.

The ability to properly deliver statistics is a great career skill. No
programming arsenal of knowledge is complete without a sound understanding of
statistical methods and their proper use.

In this week's tutorial, we'll learn the most important various
statistical methods and how to apply T-SQL functions and algorithms to solve
them.

If heavy statistics and graphics are necessary in a project, the proper
choice is normally to house the data in SQL Server and process it with a
specialized package. T-SQL, however, offers useful commands and functions to
deal with general statistics.

The Use of Statistics

One of the most basic uses of statistics is to summarize data.
Business or scientific requirements often involve simple summaries of large
quantities of data, which answer questions such as "How many?",
"What's the average?" and "How many times does this
occur?"

Another function statistical methods can help us with is grouping
data. This use is called representing data since it takes the original
order and rearranges it to show patterns or groups. The question here is
"Do we have a meaningful group of data here or just random
values?"

Statistical methods also help us to compare data so that we can view
one thing versus another. Typical comparison questions are "How many
workers are at plant X versus plant Y?" or "Do African honeybees fly
farther in a day than European honeybees?"

Related to the comparison questions are statistical results that show a
difference, such as "How much more protein is in peanut butter than
in peanuts?"

Finally, statistics can be used to show relationships. This type of
statistic is the most interesting, and the most dangerous. It seeks to answer
questions such as "Does spending more money per student yield better grades
overall?" This type of question is often difficult to answer, and one
sampling of data or formula is adequate to answer it.

With these uses and cautions in mind, let's take a look at the methods
used to gather the data our algorithms will use.

Statistical Method

The statistical method begins with asking questions. We've seen a
few of those already, but there is a specific way to ask these kinds of
questions. If the focus is a business system, the business users should be
contacted with the questions and interpretations. If the data sets are
scientific in nature, then the end users will help formulate the questions and
relationships they are looking for. The point is, as the DBA or developer, you
shouldn't try to guess what the questions are.

After we determine the questions, we need to collect the data. We
covered the subject of database design earlier, so we use those concepts to
create a model and store the data. This covers the question of how we
store the data. The larger question is what we store.

Data collected for statistical use is called a sample. There are some
important concepts to keep in mind when discussing statistical data, such as the
sample size, sample period, and distribution of the sample.

The sample size is the amount of data that makes the results valid.
For instance, if we asked the first three people we met whether they liked a
recent movie, we can't really call that a representative size. Instead, we
need to determine the number that allows us to have confidence in the results.
For minor, less-important questions, this size might be quite small, but
generally the more wide-spread the result set (kinds of food, house size, etc.),
the larger the sample size needs to be.

The sample period determines how long the sample data collection
should be taken. For instance, to check the average temperature in an area,
several years of data collection is needed.

Finally, the sample distribution of data is important to gain the
widest audience or participants of the event. For our movie question, it might
be important to ask people of many ages and social demographics to see if it was
widely liked. On the other hand, if we're only asking to determine if we
want to see the movie, the distribution needs to include only those people who
share our tastes.

After we collect the data, we analyze it (or at least provide it to
someone else to analyze). This is where we apply the methods we're about to
learn to derive meaning or information from the raw sample data.

Now that we have our methods down, let's get right to the Transact-SQL
that we can use to put it all together. For many of the algorithms we need, the
statistical functions are part of the aggregate functions we've studied
before. We'll just layer these concepts to get at what we need.

Summarizing Data

We'll begin with the basics: summarizing data. The primary functions
we'll use here are SUM and COUNT. Here's an example:

This simple statement returns the sum of the sales made at all stores. As we
learned in our aggregate studies, we can also provide data that is broken down
by store number:

SELECT stor_id, 'Sales' = SUM(qty)
FROM sales
GROUP BY stor_id
ORDER BY Sales DESC
GO
-----------------------

stor_id

Sales

7131

130

7066

125

7067

90

8042

80

7896

60

6380

8

This data is a bit more useful. It's often also helpful to
show not only a ranking of the data such as we have here, but also the sample
size. We do that with the COUNT function. It's pretty simple:

SELECT COUNT(*)
FROM sales
------------------------
21

The answer here is that we don't have a great many stores reporting
 or does it? Actually, this shows how many lines of data were collected,
which might very well mean something else entirely. It's important to keep
these facts in mind when analyzing numerical data.

In both the SUM and COUNT functions, as with most functions, the WHERE
condition can limit the results.

In addition to adding and counting the result sets, the average or mean of
the data is part of the summarization step. Averages are probably one of the
most misused functions in statistics. If I tell you that I have several coins in
my pocket and the average value of them is 5 cents, what do I have in my pocket?
Well, it all depends on the sample size and its distribution. I could have a few
coins or many, and no nickels whatsoever, or they could all be nickels. For that
reason, the average is often taken with several other measures. Also for this
reason there are several kinds of averages, such as weighted or binomial
averages.

In T-SQL, what we have is the numerical average, or the function that
represents the formula:

Average = The sum of the units divided by the number of units

The format of the command to get the numeric average looks like this:

SELECT 'Average Store Sales' = AVG(qty)
FROM sales
GO

Again, we could use a WHERE clause or a GROUP BY clause to display the
desired result.

Next week we'll take a look at some statistical functions that SQL
Server doesn't natively provide: we'll have to write our own.
We'll cover medians, modes, and more. We'll explain when to use each
and how we can implement them. In the meantime, there's a little light
reading in the references.