I have a soft spot for crank science. Recently I visited Norumbega
Tower, which is an enduring monument to the crackpot theories of Eben
Norton Horsford, inventor of double-acting baking powder and fake
history. But that’s not what this chapter is about.

This chapter is about the Variability Hypothesis, which

"originated in the early nineteenth century with Johann Meckel, who
argued that males have a greater range of ability than females,
especially in intelligence. In other words, he believed that most
geniuses and most mentally retarded people are men. Because he
considered males to be the ’superior animal,’ Meckel concluded that
females’ lack of variation was a sign of inferiority."

I particularly like that last part, because I suspect that if it turns
out that women are actually more variable, Meckel would take that as a
sign of inferiority, too. Anyway, you will not be surprised to hear
that the evidence for the Variability Hypothesis is weak.

Nevertheless, it came up in my class recently when we looked at data
from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS),
specifically the self-reported heights of adult American men and women.
The dataset includes responses from 154407 men and 254722 women.
Here’s what we found:

The average height for men is 178 cm; the average height for
women is 163 cm. So men are taller, on average. No surprise there.

For men the standard deviation is 7.7 cm; for women it is 7.3
cm. So in absolute terms, men’s heights are more variable.

But to compare variability between groups, it is more meaningful
to use the coefficient of variation (CV), which is the standard
deviation divided by the mean. It is a dimensionless measure of
variability relative to scale. For men CV is 0.0433; for women it
is 0.0444.

That’s very close, so we could conclude that this dataset provides
weak evidence against the Variability Hypothesis. But we can use
Bayesian methods to make that conclusion more precise. And answering
this question gives me a chance to demonstrate some techniques
for working with large datasets.

I will proceed in a few steps:

We’ll start with the simplest implementation, but it only works
for datasets smaller than 1000 values.

By computing probabilities under a log transform, we can scale
up to the full size of the dataset, but the computation gets slow.

Finally, we speed things up substantially with Approximate
Bayesian Computation, also known as ABC.

In Chapter 9 we estimated two parameters simultaneously
using a joint distribution. In this chapter we use the same
method to estimate the parameters of a Gaussian distribution:
the mean, mu, and the standard deviation, sigma.

For this problem, I define a Suite called Height that
represents a map from each mu, sigma pair to its probability:

If you have studied statistics from a mathematical perspective,
you know that when you evaluate a PDF, you get a probability
density. In order to get a probability, you have to integrate
probability densities over some range.

But for our purposes, we don’t need a probability; we just
need something proportional to the probability we want.
A probability density does that job nicely.

The hardest part of this problem turns
out to be choosing appropriate ranges for mus and
sigmas. If the range is too small, we omit some
possibilities with non-negligible probability and get the
wrong answer. If the range is too big, we get the right answer,
but waste computational power.

So this is an opportunity to use classical estimation to
make Bayesian techniques more efficient. Specifically, we can use
classical estimators to find a likely location for mu and sigma, and use the standard errors of those estimates to choose a
likely spread.

If the true parameters of the distribution are µ and σ, and
we take a sample of n values, an estimator of µ is the sample
mean, m.

And an estimator of σ is the sample standard
variance, s.

The standard error of the estimated µ is s / √n
and the standard error of the estimated σ is
s / √2 (n−1).

This process might seem bogus, because we use the data to choose the
range of the prior distribution, and then use the data again to do the
update. In general, using the same data twice is, in fact, bogus.

But in this case it is ok. Really. We use the data to choose the
range for the prior, but only to avoid computing a lot of
probabilities that would have been very small anyway. With
num_stderrs=4, the range is big enough to cover all values with
non-negligible likelihood. After that, making it bigger has no effect
on the results.

In effect, the prior is uniform over all values
of mu and sigma, but for computational efficiency
we ignore all the values that don’t matter.

If we select the first 100 values from the BRFSS dataset and run the
analysis I just described, it runs without errors and we get posterior
distributions that look reasonable.

If we select the first 1000 values and run the program again, we get
an error in Pmf.Normalize:

ValueError: total probability is zero.

The problem is that we are using probability densities to compute
likelihoods, and densities from continuous distributions tend to be
small. And if you take 1000 small values and multiply
them together, the result is very small. In this case it is so small
it can’t be represented by a floating-point number, so it gets rounded
down to zero, which is called underflow. And if all
probabilities in the distribution are 0, it’s not a distribution any
more.

A possible solution is to renormalize the Pmf after each update,
or after each batch of 100. That would work, but it would be slow.

A better alternative is to compute likelihoods under a log
transform. That way, instead of multiplying small values, we can add
up log likelihoods. Pmf provides methods Log, LogUpdateSet and Exp to make this process easy.

Before applying the log transform Log uses MaxLike to find
m, the highest probability in the Pmf. It divide all
probabilities by m, so the highest probability gets normalized
to 1, which yields a log of 0. The other log probabilities are all
negative. If there are any values in the Pmf with probability 0, they
are removed.

While the Pmf is under a log transform, we can’t use Update,
UpdateSet, or Normalize. The result would be nonsensical;
if you try, Pmf raises an exception.
Instead, we have to use LogUpdate
and LogUpdateSet.

If the log-likelihoods are large negative numbers, the resulting
likelihoods might underflow. So Exp finds the maximum
log-likelihood, m, and shifts all the likelihoods up by m.
The resulting distribution has a maximum likelihood of 1. This
process inverts the log transform with minimal loss of precision.

To review, Log puts the suite under a log transform.
LogUpdateSet calls LogUpdate, which calls
LogLikelihood. LogUpdate uses Pmf.Incr,
because adding a log-likelihood is the same as multiplying
by a likelihood.

After the update, the log-likelihoods are large negative
numbers, so Exp shifts them up before inverting the
transform, which is how we avoid underflow.

Once the suite is transformed back, the probabilities
are “linear” again, which means “not logarithmic”,
so we can use Normalize again.

Using this algorithm, we can process the entire dataset without
underflow, but it is still slow. On my computer it might
take an hour. We can do better.

This section uses math and computational optimization
to speed things up by a factor of 100. But the following section
presents an algorithm that is even faster. So if you want to
get right to the good stuff, feel free to skip this section.

Suite.LogUpdateSet calls LogUpdate once for each data
point. We can speed it up by computing the log-likelihood of the entire
dataset at once.

By itself, this would be a small improvement, but it
creates an opportunity for a bigger one. Notice that the
summation only depends on mu, not sigma, so we only
have to compute it once for each value of mu.

But maybe you don’t have that kind of time. In that case, Approximate
Bayesian Computation (ABC) might be the way to go. The motivation
behind ABC is that the likelihood of any particular dataset is:

Very small, especially for large datasets, which is why we had
to use the log transform,

Expensive to compute, which is why we had to do so much
optimization, and

Not really what we want anyway.

We don’t really care about the likelihood of seeing the exact dataset
we saw. Especially for continuous variables, we care about the
likelihood of seeing any dataset like the one we saw.

For example, in the Euro problem, we don’t care about the order of
the coin flips, only the total number of heads and tails. And in
the locomotive problem, we don’t care about which particular trains were
seen, only the number of trains and the maximum of the serial numbers.

Similarly, in the BRFSS sample, we don’t really want to know the
probability of seeing one particular set of values (especially since
there are hundreds of thousands of them). It is more
relevant to ask, “If we sample 100,000 people from a population
with hypothetical values of µ and σ, what would be
the chance of collecting a sample with the observed mean and
variance?”

For samples from a Gaussian distribution, we can answer this question
efficiently because we can find the distribution of the sample
statistics analytically. In fact, we already did it when we computed
the range of the prior.

If you draw n values from a Gaussian distribution with parameters
µ and σ, and compute the sample mean, m, the
distribution of m is Gaussian
with parameters µ and σ / √n.

Similarly, the distribution of the sample standard deviation, s, is
Gaussian with parameters σ and σ / √2 (n−1).

We can use these sample distributions to compute the likelihood of the
sample statistics, m and s, given hypothetical values
for µ and σ. Here’s a new version of LogUpdateSet
that does it:

Figure 10.1: Contour plot of the posterior joint distribution of
mean and standard deviation of height for men in the U.S.

Figure 10.2: Contour plot of the posterior joint distribution of
mean and standard deviation of height for women in the U.S.

We are almost ready to look at results, but we have one more
problem to deal with. There are a number of outliers in this
dataset that are almost certainly errors. For example, there
are three adults with reported height of 61 cm, which would
place them among the shortest living adults in the world.
At the other end, there are four women with reported height
229 cm, just short of the tallest women in the world.

It is not impossible that these values are correct, but it is
unlikely, which makes it hard to know how to deal with them.
And we have to get
it right, because these extreme values have a disproportionate
effect on the estimated variability.

Because ABC is based on summary statistics, rather than the entire
dataset, we can make it more robust by choosing summary statistics
that are robust in the presence of outliers. For example, rather
than use the sample mean and standard deviation, we could use the median
and inter-quartile range
(IQR), which is the difference between the 25th and 75th percentiles.

More generally, we could compute an inter-percentile range (IPR) that
spans any given fraction of the distribution, p:

xs is a sequence of values. p is the desired range;
for example, p=0.5 yields the inter-quartile range.

MedianIPR works by computing the CDF of xs,
then extracting the median and the difference between two
percentiles.

We can convert from ipr to an estimate of sigma using the
Gaussian CDF to compute the fraction of the distribution covered by a
given number of standard deviations. For example, it is a well-known
rule of thumb that 68% of a Gaussian distribution falls within one
standard deviation of the mean, which leaves 16% in each tail. If we
compute the range between the 16th and 84th percentiles, we expect the
result to be 2 * sigma. So we can estimate sigma by
computing the 68% IPR and dividing by 2.

More generally we could use any number of sigmas.
MedianS performs the more general version of this
computation:

Again, xs is the sequence of values; num_sigmas is the
number of standard deviations the results should be based on. The
result is median, which estimates µ, and s, which
estimates σ.

Finally, in LogUpdateSetABC we can replace the sample mean and
standard deviation with median and s. And that pretty
much does it.

It might seem odd that we are using observed percentiles to
estimate µ and σ, but it is an example of the
flexibility of the Bayesian approach. In effect we are asking,
“Given hypothetical values for µ and σ, and
a sampling process that has some chance of introducing errors,
what is the likelihood of generating a given set of sample
statistics?”

We are free to choose any sample statistics we like, up to a point:
µ and σ determine the location and spread of
a distribution, so we need to choose statistics that capture those
characteristics. For example, if we chose the 49th and 51st percentiles,
we would get very little information about spread, so it
would leave the estimate of σ relatively unconstrained
by the data. All values of sigma would have nearly the
same likelihood of producing the observed values, so the posterior
distribution of sigma would look a lot like the
prior.

Figure 10.3: Posterior distributions of CV for men and women, based on
robust estimators.

Finally we are ready to answer the question we started with: is the
coefficient of variation greater for men than for women?

Using ABC based on the median and IPR with num_sigmas=1, I
computed posterior joint distributions for mu and sigma. Figures 10.1 and 10.2
show the results as a contour plot with mu on the x-axis, sigma on the y-axis, and probability on the z-axis.

For each joint distribution, I computed the posterior distribution of
CV. Figure 10.3 shows these distributions for men
and women. The mean for men is 0.0410; for women it is 0.0429.
Since there is no overlap between the distributions, we conclude with
near certainty that
women are more variable in height than men.

So is that the end of the Variability Hypothesis? Sadly, no. It turns
out that this
result depends on the choice of the
inter-percentile range. With num_sigmas=1, we conclude that
women are more variable, but with num_sigmas=2 we conclude
with equal confidence that men are more variable.

The reason for the difference is that there
are more men of short stature, and their distance from the mean is
greater.

So our evaluation of the Variability Hypothesis depends on the
interpretation of “variability.” With num_sigmas=1 we
focus on people near the mean. As we increase
num_sigmas, we give more weight to the extremes.

To decide which
emphasis is appropriate, we would need a more precise statement
of the hypothesis. As it is, the Variability Hypothesis may be
too vague to evaluate.

Nevertheless, it helped
me demonstrate several new ideas and, I hope you agree,
it makes an interesting example.

There are two ways you might think of ABC. One interpretation
is that it is, as the name suggests, an approximation that is
faster to compute than the exact value.

But remember that Bayesian analysis is always
based on modeling decisions, which implies that there is no
“exact” solution. For any interesting
physical system there are many possible models, and each model
yields different results. To interpret the results, we have to
evaluate the models.

So another interpretation of ABC is that it represents an alternative
model of the likelihood. When we compute p(D|H), we are asking
“What is the likelihood of the data under a given hypothesis?”

For large datasets, the likelihood of the data is very small, which
is a hint that we might not be asking the right question. What
we really want to know is the likelihood of any outcome
like the data, where the definition of “like” is yet another
modeling decision.

The underlying idea of ABC is that two datasets are alike if they yield
the same summary statistics. But in some cases, like the example in
this chapter, it is not obvious which summary statistics to choose.

For example, we could use data from the BRFSS to estimate the
difference in height between men and women. By sampling values
from the posterior distributions of µ and
σ, we could generate the posterior distribution of this
difference.

But it might be better to use a dimensionless measure of effect
size, rather than a difference measured in cm. One option is
to use divide through by the standard deviation (similar to what
we did with the coefficient of variation).

If the parameters for Group 1 are (µ1, σ1), and the
parameters for Group 2 are (µ2, σ2), the dimensionless
effect size is

µ1 − µ2

(σ1 + σ2)/2

Write a function that takes joint distributions of
mu and sigma for two groups and returns
the posterior distribution of effect size.

Hint: if enumerating all pairs from the two distributions takes too
long, consider random sampling.