Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

December 18, 2014

Effective Sample Size

Posted by Tom Leinster

On a scale of 0 to 10, how much does the average citizen of the Republic of
Elbonia trust the president?

You’re conducting a survey to find out, and you’ve calculated that in order
to get the precision you want, you’re going to need a sample of 100
statistically independent individuals. Now you have to decide how to do
this.

You could stand in the central square of the capital city and survey the
next 100 people who walk by. But these opinions won’t be independent:
probably politics in the capital isn’t representative of politics in
Elbonia as a whole.

So you consider travelling to 100 different locations in the country and
asking one Elbonian at each. But apart from anything else, this is far
too expensive for you to do.

Maybe a compromise would be OK. You could go to 10 locations and ask… 20
people at each? 30? How many would you need in order to match the
precision of 100 independent individuals — to have an “effective
sample size” of 100?

The answer turns out to be closely connected to a quantity I’ve written
about many times before:
magnitude.
Let me explain…

The general situation is that we have a large population of individuals (in
this case, Elbonians), and with each there is associated a real number
(in this case, their level of trust in the president). So we have a probability
distribution, and we’re interested in discovering some statistic θ\theta
(in this case, the mean, but it might instead be the median
or the variance or the 90th percentile). We do this by taking some sample
of nn individuals, and then doing something with the sampled data to
produce an estimate of θ\theta.

The “something” we do with the sampled data is called an estimator.
So, an estimator is a real-valued function on the set of possible sample
data. For instance, if you’re trying to estimate the mean of the
population, and we denote the sample data by Y1,…,YnY_1, \ldots, Y_n, then the
obvious estimator for the population mean would be just the sample mean,

1nY1+⋯+1nYn.
\frac{1}{n} Y_1 + \cdots + \frac{1}{n} Y_n.

But it’s important to realize that the best estimator for a given statistic
of the population (such as the mean) needn’t be that same statistic applied
to the sample. For example, suppose we wish to know the mean mass of
men from Mali. Unfortunately, we’ve only weighed three men from Mali, and
two of them are brothers. You could use

13Y1+13Y2+13Y3
\frac{1}{3} Y_1 + \frac{1}{3} Y_2 + \frac{1}{3} Y_3

as your estimator, but since body mass is somewhat genetic, that would give
undue importance to one particular family. At the opposite extreme, you
could use

12Y1+14Y2+14Y3
\frac{1}{2} Y_1 + \frac{1}{4} Y_2 + \frac{1}{4} Y_3

(where Y1Y_1 is the mass of the non-brother). But that would be going too
far, as it gives the non-brother as much importance as the two brothers put
together. Probably the best answer is somewhere in between. Exactly
where in between depends on the correlation between masses of brothers,
which is a quantity we might reasonably estimate from data gathered elsewhere
in the world.

(There’s a deliberate echo here of something I wrote
previously:
in what proportions should we sow
poppies, Polish wheat and Persian wheat in order to maximize
biological diversity? The similarity is no coincidence.)

There are several qualities we might seek in an estimator. I’ll focus on
two.

High precision The precision of an estimator is the
reciprocal of its variance. To make sense of this, you have to realize
that estimators are random variables too! An estimator with high
precision, or low variance, is not much changed by the effects of
randomness. It will give more or less the same answer if you run it
multiple times.

For instance, suppose we’ve decided to do the Elbonian survey by asking
30 people in each of the 5 biggest cities and 20 people from each of 3
chosen villages, then taking some specific weighted mean of the resulting
data. If that’s a high-precision estimator, it will give more or
less the same final answer no matter which specific Elbonians happen to
have been stopped by the pollsters.

Unbiased An estimator of some statistic is unbiased if its expected value is
equal to that statistic for the population.

For example, suppose we’re trying to estimate the variance of some
distribution. If our sample consists of a measly two individuals, then the
variance of the sample is likely to be much less than the variance of the
population. After all, with only two individuals observed, we’ve barely
begun to glimpse the full variation of the population as a whole. It can
actually be shown
that with a sample size of two, the expected value of the sample variance
is half the population variance. So the sample variance is a biased
estimator of the population variance, but twice the sample variance is an
unbiased estimator.

(Being unbiased is perhaps a less crucial property of an estimator than
it might at first appear. Suppose the boss of a chain of pizza takeaways
wants to know the average size of pizzas ordered. “Size” could be measured
by diameter — what you order by — or area — what you eat.
But since the relationship between diameter and area is quadratic rather
than linear, an unbiased estimator of one will be a biased estimator of the
other.)

No matter what statistic you’re trying to estimate, you can talk
about
the “effective sample size” of an estimator. But for simplicity, I’ll only
talk about estimating the mean.

Here’s a loose definition:

The effective sample size of an estimator of the population mean is
the number neffn_{eff} with the property that our estimator has the same
precision (or variance) as the estimator got by sampling neffn_{eff}
independent individuals.

Let’s unpack that.

Suppose we choose nn individuals at random from the population (with
replacement, if you care). So we have independent, identically distributed
random variables Y1,…,YnY_1, \ldots, Y_n. As above, we take the sample mean

1nY1+⋯+1nYn
\frac{1}{n} Y_1 + \cdots + \frac{1}{n} Y_n

as our estimator of the population mean. Since variance is additive for
independent random variables, the variance of this estimator is

where σ2\sigma^2 is the population variance. The precision of the
estimator is, therefore, n/σ2n/\sigma^2. That makes sense: as your sample
size nn increases, the precision of your estimate increases too.

Now, suppose we have some other estimator μ^\hat{\mu} of the population
mean. It’s a random variable, so it has a variance Var(μ^)Var(\hat{\mu}). The
effective sample size of the estimator μ^\hat{\mu} is the number neffn_{eff}
satisfying

σ2/neff=Var(μ^).
\sigma^2/n_{eff} = Var(\hat{\mu}).

This doesn’t entirely make sense, as the unique number neffn_{eff} satisfying
this equation needn’t be an integer, so we can’t sensibly talk about a
sample of size neffn_{eff}. Nevertheless, we can absolutely rigorously
define the effective sample size of our estimator μ^\hat{\mu} as

Trivial examples If μ^\hat{\mu} is the mean value of nn
uncorrelated individuals, then the effective sample size is nn. If
μ^\hat{\mu} is the mean value of nn extremely highly correlated
individuals, then the variance of the estimator is little less than the
variance of a single individual, so the effective sample size is little
more than 11.

Now, suppose our pollsters have come back from their trips to various parts
of Elbonia. Together, they’ve asked nn individuals how much they trust the
president. We want to take that data and use it to estimate the population
mean — that is, the mean level of trust in the president across
Elbonia — in as precise a way as possible.

We’re going to restrict ourselves to unbiased estimators, so that the
expected value of the estimator is the population mean. We’re also going
to consider only linear estimators: those of the form

a1Y1+⋯+anYn
a_1 Y_1 + \cdots + a_n Y_n

where Y1,…,YnY_1, \ldots, Y_n are the trust levels expressed by the nn
Elbonians surveyed.

Correlation and covariance

Variance is a quadratic form, and covariance is the corresponding bilinear
form. That is, take two random variables XX and YY, with respective
means μX\mu_X and μY\mu_Y. Then their covariance is

Cov(X,Y)=E((X−μX)(Y−μY)).
Cov(X, Y) = E((X - \mu_X)(Y - \mu_Y)).

This is bilinear in XX and YY, and Cov(X,X)=Var(X)Cov(X, X) = Var(X).

Cov(X,Y)Cov(X, Y) is bounded above and below by ±σXσY\pm \sigma_X \sigma_Y, the
product of the standard deviations. It’s natural to normalize, dividing
through by σXσY\sigma_X \sigma_Y to obtain a number between −1-1 and 11.
This gives the correlation coefficient

The matrix is positive semidefinite. That’s because the corresponding
quadratic form is (a1,…,an)↦Var(∑aiYi/σi)(a_1, \ldots, a_n) \mapsto Var(\sum a_i
Y_i/\sigma_i), and variances are nonnegative.

And actually, it’s not so hard to prove that any matrix with these
properties is the correlation matrix of some sequence of random variables.

In what follows, for simplicity, I’ll quietly assume that the correlation
matrices we encounter are strictly positive definite. This only amounts to
assuming that no linear combination of the YiY_is has variance zero —
in other words, that there are no exact linear relationships between the
random variables involved.

Back to the main question

Here’s where we got to. We surveyed nn individuals from our population,
giving nn identically distributed but not necessarily independent random
variables Y1,…,YnY_1, \ldots, Y_n. Some of them will be correlated because of
geographical clustering.

We’re trying to use this data to estimate the population mean in as precise
a way as possible. Specifically, we’re looking for numbers a1,…,ana_1, \ldots,
a_n such that the linear estimator ∑aiYi\sum a_i Y_i is unbiased and has the
maximum possible effective sample size.

The effective sample size was defined as neff=σ2/Var(∑aiYi)n_{eff} = \sigma^2/Var(\sum a_i
Y_i), where σ2\sigma^2 is the variance of the distribution we’re drawing
from. Now we need to work out the variance in the denominator.

Let RR denote the correlation matrix of Y1,…,YnY_1, \ldots, Y_n. I said a
moment ago that (a1,…,an)↦Var(∑aiYi)(a_1, \ldots, a_n) \mapsto Var (\sum a_i Y_i) is the
quadratic form corresponding to the bilinear form represented by the
covariance matrix. Since each YiY_i has variance σ2\sigma^2, the
covariance matrix is just σ2\sigma^2 times the correlation matrix RR. Hence

Which a∈ℝna \in \mathbb{R}^n achieves this maximum, and what is the maximum
possible effective sample size? That’s easy, and in fact it’s something
that’s appeared many times at this blog before…

The magnitude of a matrix

The magnitude|R||R| of an invertible n×nn \times n matrix RR is the sum of
all n2n^2 entries of R−1R^{-1}. To calculate it, you don’t need to go as
far as inverting RR. It’s much easier to find the unique column vector
ww satisfying

Rw=(1⋮1)
R w = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}

(the weighting of RR), then calculate ∑iwi\sum_i w_i. This sum is the
magnitude of RR, since wiw_i is the iith row-sum of R−1R^{-1}.

Most of what I’ve written about
magnitude
has been in the situation where we start with a finite metric space X={x1,…,xn}X =
\{x_1, \ldots, x_n\}, and we use the matrix ZZ with entries Zij=exp(−d(xi,xj))Z_{i j} =
exp(-d(x_i, x_j)). This turns out to give interesting information about
XX. In the metric situation, the entries of the matrix ZZ are between
00 and 11. Often ZZ is positive definite (e.g. when X⊂ℝnX
\subset \mathbb{R}^n), as correlation matrices are.

When RR is positive definite, there’s a third way to describe the
magnitude:

But the result is so simple that I’d imagine it’s much older. I’ve been
wondering whether it’s essentially the Gauss-Markov
theorem; I
thought it was, then I thought it wasn’t. Does anyone know?

The surprising behaviour of effective sample size

You might expect the effective size of a sample of nn individuals to be at
most nn. It’s not.

You might expect the effective sample size to go down as the correlations
within the sample go up. It doesn’t.

This behaviour appears in even the simplest nontrivial example:

Example Suppose our sample consists of just two individuals.
Call the sampled values Y1Y_1 and Y2Y_2, and write the correlation matrix
as
R=(1ρρ1).
R =
\begin{pmatrix}
1 & \rho \\
\rho & 1
\end{pmatrix}.
Then the maximum-precision unbiased linear estimator is 12(Y1+Y2)\frac{1}{2}(Y_1 +
Y_2), and its effective sample size is
|R|=21+ρ.
|R| = \frac{2}{1 + \rho}.
As the correlation ρ\rho between the two variables increases from 00 to
11, the effective sample size decreases from 22 to 11, as you’d expect.

But when ρ<0\rho \lt 0, the effective sample size is greater than 2. In
fact, as ρ→−1\rho \to -1, the effective sample size tends to ∞\infty.
That’s intuitively plausible. For if ρ\rho is close to −1-1 then, writing
Y1=μ+ε1Y_1 = \mu + \varepsilon_1 and Y2=μ+ε2Y_2 = \mu + \varepsilon_2, we have ε1≈−ε2\varepsilon_1
\approx -\varepsilon_2, and so 12(Y1+Y2)\frac{1}{2}(Y_1 + Y_2) is a very good estimator
of μ\mu. In the extreme, when ρ=−1\rho = -1, it’s an exact estimator of
μ\mu — it’s infinitely precise.

The fact that the effective sample size can be greater than the actual
sample size seems to be very well known. For instance, there’s a whole
page about
it
in the documentation for Q, which is apparently “analysis software for
market research”.

What’s interesting is that this doesn’t only occur when
some of the variables are negatively correlated. It can also happen when
all the correlations are nonnegative, as in the following example from the
paper by Eaton cited above.

A routine computation shows that
|R|=3−4ρ1−2ρ2.
|R| = \frac{3 - 4\rho}{1 - 2\rho^2}.
As we’ve shown, this is the greatest possible effective sample size you can achieve by taking an unbiased linear combination of Y1Y_1, Y2Y_2 and Y3Y_3.

When ρ=0\rho = 0, it’s 33, as you’d
expect: the variables are uncorrelated. As ρ\rho increases, |R||R|
decreases, again as you’d expect: more correlation between the variables
leads to a smaller effective sample size. This behaviour continues until
ρ=1/2\rho = 1/2, where |R|=2|R| = 2.

But then something strange happens. As ρ\rho increases from 1/21/2 to
2/2\sqrt{2}/2, the effective sample size increases from 22 to ∞\infty.
Increasing the correlation increases the effective sample size. For
instance, when ρ=0.7\rho = 0.7, we have |R|=10|R| = 10: the
maximum-precision estimator is as precise as if we’d chosen 1010
independent individuals! For that value of ρ\rho, the maximum-precision
estimator turns out to be
32Y1+32Y2−2Y3.
\frac{3}{2} Y_1 + \frac{3}{2} Y_2 - 2 Y_3.
Go figure!

This is very like the fact that a metric space with nn points can have
magnitude (“effective number of points”) greater than nn, even if the
associated matrix ZZ is positive definite.

These examples may seem counterintuitive, but Eaton cautions us
to beware of our feeble intuitions:

These examples show that our rather vague intuitive feeling that
“positive correlation tends to decrease information content in an
experiment” is very far from the truth, even for rather simple normal
experiments with three observations.

Anyone with any statistical knowledge who’s still reading will easily have
picked up on the fact that I’m a total amateur. If that’s you, I’d love to
hear your comments!

Posted at December 18, 2014 10:25 PM UTC

TrackBack URL for this Entry: https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/2793

Some Related Entries

26 Comments & 0 Trackbacks

Re: Effective Sample Size

I am also an amateur at statistics. However, on the question of how n positively correlated samples can have an effective sample size greater than n, I wonder how you can know what the true correlation matrix of your samples is. Presumably that knowledge is what somehow gets you the extra power of your experiment.

Re: Effective Sample Size

That’s a question I’ve wondered about myself.

I suppose one can never know the correlation, but one can take a good guess at it. Perhaps there’s a survey of trust in the Elbonian president taken annually, and although that trust level swings around wildly from year to year, the correlations within and between different towns remain about the same. In that case, it would be reasonable to assume that they’ll be about the same this year.

Or perhaps we know nothing about the mass of men in Mali, but we do know how well-correlated the masses of brothers tend to be in other countries, and we therefore feel it’s safe to assume that the correlation is similar there.

But I’d be happy if someone more knowledgeable gave their point of view.

Re: Effective Sample Size

Probably part of the story is that having a high neffn_{\mathrm{eff}} doesn’t really guarantee that your sample is “statistically powerful”.

For one thing, notice in the two examples that Tom gave that as the magnitude tends to ∞\infty, the covariance matrix tends toward a singular matrix, for which no weighting exists. When no weighting exists, it seems that you can’t actually construct an unbiased estimator.

What if the magnitude is very large, but not infinite? In the 2-element sample, the weighting for covariance of ρ\rho is [11+ρ,11+ρ]T[\frac 1 {1+\rho}, \frac 1 {1+\rho}]^T. So if you think that ρ\rho is close to −1-1, but it could be off by ϵ\epsilon, then all you know about the correct weighting is that it’s of the form [α,α]T[\alpha, \alpha]^T for some α\alpha with 1ϵ≤α≤∞\frac 1 \epsilon\leq \alpha \leq \infty. So actually choosing a correct estimator is infeasible.

I’m trying to wrap my head around this intuitively – if a two-element sample identically distributed and perfectly anticorrelated, then their sum always gives the mean exactly, right? So why doesn’t [1,1]T[1,1]^T come out as the optimal estimator?

Anyway, I’m guessing the connection between infinite neffn_\mathrm{eff} and a singular covariance matrix is a general phenomenon. Having a very high neffn_{\mathrm{eff}} probably goes hand in hand with having a nearly-singular covariance matrix and having a weighting which is very sensitive to perturbations in the matrix.

where the ϵi\epsilon_is are independent with variance 1, mean 0. When the coefficient of ϵ3\epsilon_3 is zero you can get a linear combination with just μ\mu, and the covariance matrix is a sum of two rank 1 matrices.

where μ\mu is any constant and the ϵi\epsilon_i are independent with mean 00 and variance 11, here’s the story.

I said in my post that any real positive semidefinite n×nn \times n matrix RR with 11s down the diagonal is the correlation matrix of some nn-tuple of random variables. The proof I know uses the fact that RR has a real symmetric square root SS. In fact, all that really matters is that there’s some real matrix SS satisfying SSt=RS S^t = R.

Re: Effective Sample Size

Hi Tim. In the 2-element example, the weighting is the transpose of (11+ρ,11+ρ)(\frac{1}{1 + \rho}, \frac{1}{1 + \rho}), so yes, that varies with ρ\rho. But the best estimator is

1|R|(w1Y1+w2Y2),
\frac{1}{|R|}(w_1 Y_1 + w_2 Y_2),

which (by calculation or simply by symmetry) is always 12(Y1+Y2)\frac{1}{2}(Y_1 + Y_2), regardless of ρ\rho.

(When you said “if a two-element sample is identically distributed and perfectly anticorrelated, then their sum always gives the mean exactly”, you were out by a factor of 2.)

Knowing that two variables are strongly anticorrelated tells you a great deal, it seems. And surely related to that is that it’s rather hard to think of situations where you would know that variables were strongly anticorrelated.

Re: Effective Sample Size

Ah, I see that I made the very silly mistake of missing a factor of 1|R|\frac 1 {|R|}. Thanks for setting me straight.

One thing to notice is that if we drop the assumption that the variables are identically distributed, the power of anticorrelation goes away, intuitively. How much of this whole story survives if we do drop this assumption?

Re: Effective Sample Size

Is it obvious that the actual value of the (anti-)correlation value changes its effectiveness as the distributions become different? Since you can show that the correlation for a signal is maximised/minimised by equal/negated version of the original signal (respectively). As such, as the distributions become more different the range of attainable correlation values is reduced. So the different distributions reduce the knowledge “through” how the correlation value behave; do the distributions of the random variables have any effect other than this?

Re: Effective Sample Size

where 0<ρ<2/20 \lt \rho \lt \sqrt{2}/2. This is positive definite and has 11s down the diagonal, so its a correlation matrix. (Indeed, ap’s comment gives an explicit construction of some random variables that it’s the correlation matrix of.) But if it came from a metric space in the way you describe, it would satisfy a version of the triangle inequality:

R12≥R13R32,
R_{1 2} \geq R_{1 3} R_{3 2},

which is false. (More intuitively, the “00” in the (1,2)(1, 2) position says that the 1st and 2nd points are infinitely far apart, whereas the “ρ\rho“s at (1,3)(1, 3) and (2,3)(2, 3) say that both the 1st and 2nd points are at finite distance from the 3rd point.)

Re: Effective Sample Size

Note that there’s another problem to surmount: the correlation between the random variable XX and X+cX+c is 1, so any transformation that maps that to 0 will violate the “distance zero means equal” condition (unless you possibly redefine what equal means).

Re: Effective Sample Size

Yes, good point: having correlation 11 doesn’t mean being identical.

Somewhat relatedly, having correlation 00 is a much weaker condition than being independent.

The Wikipedia page on uncorrelated random variables has a nice example (which I guess is standard). Let XX be distributed uniformly on [−1,1][-1, 1] and Y=X2Y = X^2. Then XX and YY are not independent, to say the least! But their correlation coefficient is zero.

Roughly, the reason they’re uncorrelated is that an increase in YY is equally likely to have been produced by an increase or a decrease in XX. E.g. if we know that YY has changed from 0.30.3 to 0.310.31, then that means that XX has either changed from +0.3{}^+\sqrt{0.3} to +0.31{}^+\sqrt{0.31}or changed from −3{}^-\sqrt{3} to −0.31{}^-\sqrt{0.31}, and the two possibilities are equally probable.

Re: Effective Sample Size

These examples show that our rather vague intuitive feeling that “positive correlation tends to decrease information content in an experiment” is very far from the truth, even for rather simple normal experiments with three observations.

The way I justified this observation to myself back in the nineties was that when the variables are correlated to an unknown degree, there is actual information hidden in the difference between a sampled value of a variable and the expected value based on the assumed correlation and the sampled values of other variables. In the limit when the correlation is 1.0, any deviation at all would produce a numerically infinite information value given that the sampled value is supposedly impossible.

In such cases I found it made intuitive sense to treat such extra information as pertaining to the the correlation itself and tweak that to minimize the effect.

Re: Effective Sample Size

I’ve been turning this over in my mind in the last 24 hours or so, and I think I kind of get what you mean, but it’s fuzzy.

One point is that we don’t see this effect with two positively correlated variables. There, the effective sample size is 2/(1+ρ)2/(1 + \rho), where ρ\rho is the correlation coefficient. This decreases as ρ→1\rho \to 1.

Any explanation needs to account for why the effect isn’t seen until n=3n = 3. Do you have an intuition as to why that is?

Re: Effective Sample Size

Sorry for the long delay, I forgot I actually posted that.

Any explanation needs to account for why the effect isn’t seen until n=3n = 3. Do you have an intuition as to why that is?

An intuition, nothing more. For two points in a metric manifold, a single number is sufficient to represent the distance between two points. For three points, the sum of pairwise distances (the perimeter of the triangle) can be used in the same way but this ends up ignoring the described area that carries information about the separation of the points as well. For four or more points the informational value of the single scalar drops more as the dimensionality of the ignored information rises.

I posit that the underlying assumption that a real-valued correlation factor is a good choice for three or more variables is false and loses information about the nature of the correlation itself.

Re: Effective Sample Size

I was thinking about an effective-sample-size-like notion the other day.

Shine a laser at a rough surface and you see a speckle pattern like this.

The intensity at each point can be modelled as the sum of many Gaussian variables. But if you look at the intensity at point A close to point B they are correlated. The distance from A to B has to be the size of a couple of speckle “lumps” before the correlation is small. So if you’re looking at some area with a speckle pattern on it, it makes intuitive sense to talk of an effective number of independent variables per unit area underlying that pattern. I’m not sure if this can be carried through rigorously but it seems related to what you’re talking about.

One reason I mention this is that you can think of speckle as emerging from a Feynman path integral. The speckle pattern arises from the statistics of summing over many paths from light source to surface to eye, each with a different phase. So this may connect back to notions of size mentioned way back on the n-category cafe.

Making the story fit the math

In your analysis you require the variables Y1,…,YnY_1,\ldots,Y_n to be identically distributed to the distribution of interest. To make the Elbonia surveying story fit this assumption, you’d have to send each of your surveyors to a randomly chosen region of the country, but in such a manner that the probability of a region getting a surveyor is proportional to the region’s population. (Otherwise people from regions of low population density would exert an undue influence on the results.) Then each surveyor would be instructed to measure a number of people in their assigned region (presumably with known correlation coefficients among those measurements).

Re: Making the story fit the math

Actually, I asked a bit more than I needed. It would have been enough to ask that Y1,…,YnY_1, \ldots, Y_n have the same mean and variance. (The latter condition goes by the superb name of homoscedasticity, I recently learned.) But I’m not sure that makes a substantial difference.

Re: Effective Sample Size

I just got around to reading this post. I hope to find the time to give it more thought sometime soon, but in the meantime I have a comment on one small part:

You might expect the effective size of a sample of nn individuals to be at most nn. It’s not.

Personally, I wouldn’t expect this. Here’s why: Saying that the effective sample size is kk means that, in some sense, it gives you the same amount of information about the underlying distribution as a sample of kk independent individuals. The thing is, independent samples are by no means the best possible for learning a distribution. It’s better if each individual strikes some balance between being typical and being as different as possible from the previously sampled individuals. (The precise meanings of “better”, “some balance”, “typical”, and “as different as possible” all depend on each other, of course.)

For example, say YY is uniformly distributed in {1,…,N}\{1, \ldots , N\}. A best possible sample would be if (Y1,…,YN)(Y_1, \ldots, Y_N) is a uniformly chosen permutation of {1,…,N}\{1, \ldots, N\}. These are very much not independent. Coming at this from the opposite direction, if Y1,…,YnY_1, \ldots, Y_n are independent and uniformly chosen from {1,…,N}\{1, \ldots, N\}, then you need n≈NlogNn \approx N \log N even to expect to see all the NN possible values of this distribution. (This is a classic problem in probability called the coupon collector’s problem.)

Re: Effective Sample Size

Isn’t it great how trainable intuition is? Isn’t it great talking to other people whose intuition is trained in directions that your own isn’t?

Your mathematical point reminds me of the following story. In the early days of the iPod, Apple were inundated with complaints that the shuffle function wasn’t truly random. Everyone kept telling them how songs by the same artiste would clump together: one Madonna song would usually be followed by another, and so on.

They had their technical people check the algorithm, and it turned out that nothing was wrong with it. All that was wrong was people’s perception of randomness. So they changed the algorithm to forbid clumping — making it less random in order to persuade humans that it was more random.

Re: Effective Sample Size

Among other things, there’s a terminological problem highlighted by the anecdote about iPods, and my sympathies lie more with the users.

Strictly speaking, any way of picking something is random — even a constant is a random variable, albeit a boring one. (As usual, xkcd has a great comment on this issue.) The trouble is that many people, including many professional probabilists who secretly know better, use “random” to mean something much stronger. Typically, a probabilist will say that XX is “random” in a set Ω\Omega if XX is uniformly distributed in Ω\Omega (assuming we’re in a context in which that even means anything), and that a sequence X1,X2,…X_1, X_2, \ldots is “random” if it is a sequence of independent (and uniform, if applicable) random variables.

Now independent sequences of random variables are a reasonable model of many real-world phenomena, and it’s true that people have very poor intuition about how such sequences behave. In particular, people underestimate how common clumping is. Among other things, this contributes to people’s tendency to ascribe winning streaks in sports or gambling to something other than a perfectly ordinary side effect of randomness. (I understand that careful studies by statisticians of sports statistics have found that “hot streaks”, about which many professional athletes have cherished superstitions, happen about as often and last about as long as independent-random-variable models would predict.)

On the other hand, this by no means means that a “random” selection of songs ought to be chosen with independent picks. It’s perfectly reasonable that a shuffle function ought to behave in a way that matches users’ intuition about randomness better than independent random variables. To make a semi-concrete proposal, if X1,X2,…X_1, X_2, \ldots are the song choices, a good shuffle algorithm ought to result in the empirical measures
1n∑i=1nδXi
\frac{1}{n} \sum_{i=1}^n \delta_{X_i}
being good approximations of the uniform measure for large nn. In fact the classical Glivenko–Cantelli theorem says that this will be the case for independent picks, but the approximation will not be the best possible.

So from my point of view, the initial choice of an algorithm that chose successive tracks independently was a design flaw, albeit one that would probably be made by any other company.

Re: Effective Sample Size

Related to this point, I’d be very interested to find some perspective that makes sense of the possibility that a metric space has magnitude greater than its cardinality. Thinking about that might help clarify what the magnitude of a metric space means.

Re: Effective Sample Size

OK, so what we could do is:

take a positive definite metric space XX with magnitude larger than its cardinality (such as 0.35K3,20.35 K_{3, 2}, where K3,2K_{3, 2} is a complete bipartite graph, as in Example 2.4.11 of `The magnitude of metric spaces’)

work out some string of nn random variables whose correlation matrix is the similarity matrix of XX (which we know is possible)

understand why the effective sample size represented by those nn random variables is greater than nn

use that understanding to improve our understanding of why metric magnitude can be greater than cardinality.

In the example I cited, the phenomenon of magnitude greater than cardinality only shows up at a very narrow range of scales. Specifically, it’s only for scale factors inside the range 0.345 to 0.355. So understanding why it happens at all may be difficult.

Nevertheless, it might be possible. As you know, what’s going on here is that the magnitude function t↦|tK3,2|t \mapsto |t K_{3, 2}| has a singularity at t=log(2)/2≈0.347t = log(2)/2 \approx 0.347. Just to the left of that singularity, the magnitude tends to −∞-\infty, and just to the right, it tends to +∞+\infty.