Determining Sample Size

Developing proprietary insight often requires doing primary research. We are often asked how we determine the sample size necessary for such research.

Larger sample sizes usually require higher cost. Therefore, we want to use the minimum sample size that will provide a useful answer. When the costs of being wrong are very high, it may be worth the cost of a large sample size to achieve very high confidence in the answer. For example, if a client is considering a $400 million investment whose success depends on the true value of a particular variable, she is likely to be willing to spend a fair amount to be highly confident in an estimate of that variable.

In other cases, limits on the resources available may require smaller sample sizes. However, when we know very little to begin with even a relatively small sample can achieve significant reductions in uncertainty. This reduction in uncertainty can be worth far more than the cost of obtaining a modest sample.

Key Background Assumption

This post is intended as a reminder and reference for people who have had some previous exposure to statistics, perhaps has part of a statistics course in college or graduate school. It assumes a basic understanding of statistical concepts such as mean, sample, population, standard deviation, and normal distributions.

Confidence Levels and Confidence Intervals

A confidence interval is the range of values around a sample mean judged to include the true population mean with a given confidence level, or probability. For example, suppose one has calculated the mean customer acquisition cost of $5000 for a sample of companies. One might wish to calculate the 90% confidence interval around that value, or the range of values that is 90% likely to include the mean acquisition cost of the population.

Increasing the confidence level or reducing the size of the confidence interval requires a larger sample. Therefore, a prerequisite for determining sample size is deciding what level of uncertainty is acceptable.

In academic research, one often sees confidence levels of 95% or 99%. However, we often find in our work that this results in an excessively large sample size. That is, clients are not willing to pay for the additional work (samples) that are necessary to achieve a 95% confidence levels. For this reason, we typically assume our goal is a 90% confidence level. However, this can be adjusted upward or downward based on the client’s sensitivity to risk and his budget. Even an 80% confidence level or a 90% confidence level of a fairly wide confidence interval could provide enough uncertainty reduction to draw worthwhile conclusions in a cost-constrained and “noisy” business environment.

Determining Sample Size for Continuous Variables

For continuous variables—that is, variables that can assume any value within a range, such as the salaries of marketing managers, the gross margin percentages of companies in an industry, or acquisition costs per customer of companies in that industry—the required sample size, n, to estimate a population mean with a given confidence level and confidence interval can be computed as:

tα/2 = the t-value that locates an area α/2 in each of the tails of the t-distribution, given the appropriate degrees of freedom

s = an estimate of the population standard deviation (or the actual population standard deviation, if available)

B = the acceptable margin of error for the mean being estimated (i.e., one half of the width of the required confidence interval)

And considering the following five important notes:

1. The sample must be random

The above formula assumes the individual observations made are not related to one another any more than the all the elements in the population are related. For example, if one is conducting a survey of voters, a sample of people on a street corner in the financial district is not a random sample, nor is one taken from the members of a single church.

2. The population must be normally distributed – OR –
The sample must be ≥30 and the population must not be scalable

The formula above only works if the distribution of the sample is approximately normal. This will be true if the distribution of the population is normal. Even if the population is not normal, however, the sample distribution will be approximately normal if n is sufficiently large (generally, ≥30) and the population is not “scalable”. Non-scalable means that probabilities of variations drop faster and faster the further one goes from the mean. This is the case when there are strong forces of equilibrium or other limitations preventing very extreme observations.

3. It is frequently necessary to estimate the standard deviation

If we know the standard deviation of the population we should use this in the calculation. However, in many problems we encounter in business must estimate the standard deviation. We have four ways to do this:

If for some reason the sample is already complete, use the sample standard deviation.

Use the sample standard deviation from previous studies of the same or a similar population.

Take the sample in two steps, and use the standard deviation of the first sample as an estimate of the standard deviation of the population to determine how many additional samples are needed.

Make an educated guess about population standard deviation based on experience, common sense, or some other reasonable logic.[1]

It is not uncommon for the fourth option to be the best option available.

4. Use of the t-distribution requires sample size be an input in the calculation, requiring iterative logic

To account for the fact that our estimate of the population standard deviation may not be equal to actual population standard deviation, we use the t-distribution to determine the sample size required instead of the more familiar standard normal (“z”) distribution. The t-distribution is similar to the z-distribution: mound-shaped, symmetric, and with a mean of zero, but it has more weight in the tails (except in the case of large samples, in which case the t-distribution is effectively the same as the z-distribution).

However, the shape of the t-distribution is dependent on the sample size. The appropriate t-distribution to use is the one for n-1 “degrees of freedom.” Thus, we may need to use an iterative process to determine the desired sample size. Typically, we’d start with the t-value for the smallest n we might possibly consider.

5. The acceptable margin of error is often not specified

The client often does not specify a desired margin of error, B. Therefore, we need to choose a reasonable value. However, decreasing the margin of error increases the sample size. In fact, reducing the margin of error by half increases the required sample size four times! Although a narrow margin of error is intellectually appealing, choose the widest margin that is useful for decision making.

Example and Implementation

Suppose we are estimating the average cost of customer acquisition in a particular industry with a survey of dealers in that industry. Our best a priori estimate of the average is $5000. However, whatever the true answer turns out to be, we want to be 90% confident that the estimate from our sample is within $1000 (20%) of the true value.

We have a database of 500 dealers in the industry. We are going to contact them at random to solicit their participation until we get to the desired sample size, so we believe our sample will be random.

We also expect that true distribution of acquisition cost among dealers is normally distributed, because we can’t think of a good reason it would be distributed bi-modally, exponentially, or some other way. So, we aren’t particularly concerned that the sample size be 30 or more.

We don’t know the population standard deviation and we have no prior information about it, so we will have to make an best-guess estimate about it with some common-sense assumptions. Given what we know about the business and our a priori guess that the mean is $5000, we think it is extremely unlikely more than 5% of the values will be less than $1000 or more than $9000. Thus, our estimate of the standard deviation is $2000.[2]

Recall the formula for calculating the sample size is:

Since we are aiming for a 90% confidence level, α = .1. We have to specify the degrees of freedom of the t-distribution to get its shape. Just as a starting point, we decide to assume n = 15, so we get the t-distribution for 14 degrees of freedom. Our estimate for s is $2000 and B is $1000. Therefore:

Since n is less than the estimate we used to calculate the appropriate t-value, we recalculate using this lower n to determine the t-value (i.e. we re-calculate with n-1, or 11, degrees of freedom).

Thus, we require a sample of at least 13 dealers.

Calculating Sample Size in Excel

The formula to calculate tα/2in Excel 2010 is =T.INV.2T(α, n-1). Therefore, the formula to calculate the required sample size in Excel is:

=T.INV.2T(α, n-1)^2*s^2/B^2.

For prior versions of Excel, replace T.INV.2T with TINV.

A Simpler Implementation

One could take issue with the above approach on the grounds because the answer, n, depends on itself. A practical way to avoid this is to use the z-distribution instead of the t-distribution, but to use the distribution for a higher confidence level than one actually requires.[3]

For example, the critical values for the z-distribution for 90 and 95% confidence levels are 1.645 and 1.960, respectively. For conservatism, one could use 2.000. (This is also higher than the t-value for the 90% confidence level for all samples sizes above 6.)

Thus:

This is more samples than is strictly necessary but the answer is both conservative (i.e. the real answer is a smaller n) and easier to calculate.

Limitations

This approach for determining sample size works for continuous variables. A slightly different approach is required for scaled variables (such as ratings on a scale of 1 to 7) and proportional variables (such as the female share of a population). We will discuss how to calculate sample size for those types of variables in a subsequent post.

[1] One might use ¼ of the expected range of values in the sample in the absence of any other intuition.

[2] If we expect values of less than $1000 would happen at most 2.5% of the time, $1000 is at least two standard deviations from the mean. Thus $4000 is two standard deviations and $2000 is one standard deviation. ($5000-$1000=$4000. $4000 / 2 = $2000.)

[3] The shape of the z-distribution is independent of the sample size and thus it is not necessary to specific the number of degrees of freedom.