Undergraduate Capacity Control

Attention conservation notice: 2200+ words expounding on
some notions from statistical theory about over-fitting, plus describing a
paper on some experiments trying to apply these ideas to UW-Madison undergrads,
concluding with some speculative nihilism about common forms of trying to
understand social life, going far beyond anything actually supported by either
the general theory or the psychological experiments. Contains equations,
ugly graphs, and long quotations from an academic paper.

Mostly written several years ago, retrieved from the drafts folder for lack
of lack of time to write anything new.

"Capacity control" is a central idea for statistical learning. The great
challenge of devising a reliable learning procedure is avoiding over-fitting
— getting models to memorizing apparent features of the training data
which are really just noise, and will not generalize to new data. The
"capacity" of a class of models is, roughly speaking, how many different data
sets it can (seem to) fit well. High-capacity model classes are more flexible,
but also more prone to over-fitting. In particular — a point
insufficiently appreciated by many users — getting a really good fit
in-sample from a high capacity model class is not strong evidence that
the trained model will generalize well. Capacity control is the art of using
models that are flexible enough to get good fits, without using ones which are
so flexible that they over-fit.

There are various ways of shaping up this rough notion of "capacity" into
something more quantitative, which we can work into theorems and practical
tools. One nice one is what's called "Rademacher complexity", which goes as
follows. Our models are just simple input-out functions; we get an input \( x
\), we apply a particular model/function \( f \) to get a prediction \( f(x)
\), and we want this to be close to the actual output \( y \). The Rademacher
complexity of the class of models \( F \) is
\[
R_n(F) \equiv \mathbf{E}_{X}\left[\mathbf{E}_{Z}\left[ \sup_{f\in F}{\left|\frac{1}{n} \sum_{i=1}^{n}{Z_i f(X_i) } \right|} \right]\right]
\]
where the \( Z_i \) are a sequence of random variables, independent of each
other and everything else, and equal to +1 or -1 with equal probability,
a.k.a. "Rademacher random variables". (It's customary to include an over-all
factor of 2 in this definition, but I have my reasons for leaving it out here.)

To figure out what's going on here, start with the inner-most bit:
\[
\left|\frac{1}{n} \sum_{i=1}^{n}{Z_i f(X_i) } \right|
\]
This is the magnitude of the sample covariance between a particular realization
of the binary noise \( Z \), and the predictions of a particular model \( f \)
on a particular set of input values \( X \). Next, holding fixed the noise and
the inputs, we ask how big that covariance could appear to get over the whole
model class:
\[
\sup_{f\in F}{\left|\frac{1}{n} \sum_{i=1}^{n}{Z_i f(X_i)} \right|}
\]
Then we average over realizations of the binary noise:
\[
\mathbf{E}_{Z}\left[ \sup_{f\in F}{\left|\frac{1}{n} \sum_{i=1}^{n}{Z_i f(X_i)} \right|} \right]
\]
This tells us, holding the inputs fixed, how well our model class would
typically seem to predict binary noise. This is sometimes called the
"empirical Rademacher complexity", because it's conditional on a particular
training sample. Finally, we average over input training data. Generally,
doing that last step as an actual calculation is hard, but you can show that
the probability of the empirical Rademacher complexity fluctuating away from
its expected value goes to zero exponentially fast in the sample size \( n \),
so we can typically avoid the expectation over \( X \) at a small cost in
approximation.

Now, if the class \( F \) just contained a single function, the business
with the supremum wouldn't matter, and the Rademacher complexity is just the
expected magnitude of the sample covariance of the function with the noise.
Then \(R_n(F) \) has to go to zero as \( n \) grows, because, by the law of
large numbers, the sample covariance is converging to the true covariance,
which is zero. In fact if \( F \) contains only finitely many functions, \(
R_n(F) \) has to go to zero because all the sample covariances become, with
arbitrarily high probability, arbitrarily close to the true covariance, which
is zero. But if there are infinitely many functions in \( F \), there might be
trouble. How much trouble depends on, precisely, how many different sets of
outputs we can construct for typical inputs.

Why this is a useful way of quantifying model capacity
is a bit more
involved than I feel like explaining, but the intuition, hopefully, is
clear: a class of models which could give a good fit to pure noise is a class
of models ripe for over-fitting. If on the other hand the Rademacher
complexity of your class of models is low, and you nonetheless got a good fit
to your training data, that is actually evidence that you will continue to fit
well on new data. There's a simple formula if data are independent draws from
a constant (but otherwise arbitrary) distribution, the functions \( f \) also
have to output +1 or -1 (i.e., are binary classifiers), and we measure error by
the probability of mis-classification. With probability at least \( 1-h \)
under the true data-generating distribution, for all \( f \),
\[
\mathrm{error}(f) - \widehat{\mathrm{error}(f)} \leq R_n(F) + \sqrt{\frac{\ln{1/h}}{2n}}
\]
The left-hand side by this bound is the amount by which the true error rate
exceeds the in-sample error rate, i.e., the amount by which the model
over-fits. The right-hand side combines the Rademacher complexity, which says
how well the model class can seem to fit noise, plus a term which grows as we
demand a higher level of confidence (smaller \( h \)). The lower the
Rademacher complexity, the tighter the bounds we can put on the amount of
over-fitting. Turned around, the probability that the amount of over-fitting
exceeds the Rademacher complexity by any given amount is exponentially small.

These bounds actually make no reference to the actual learning process
— they just depend on the kind of model available, and on how much
flexibility those models have when presented with typical data from the
distribution. The benefit of this is that we don't have to worry
about how our learner works, just the kind of things it can learn.
The drawback is that the bounds might be too conservative — maybe
this learning procedure, on this sort of data, is much less
prone to over-fitting than the bounds indicate.

Abstract: We propose to use Rademacher complexity, originally developed in computational learning theory, as a measure of human learning capacity. Rademacher complexity measures a learner's ability to fit random labels, and can be used to bound the learner's true error based on the observed training sample error. We first review the definition of Rademacher complexity and its generalization bound. We then describe a "learning the noise" procedure to experimentally measure human Rademacher complexities. The results from empirical studies showed that: (i) human Rademacher complexity can be successfully measured, (ii) the complexity depends on the domain and training sample size in intuitive ways, (iii) human learning respects the generalization bounds, (iv) the bounds can be useful in predicting the danger of overfitting in human learning. Finally, we discuss the potential applications of human Rademacher complexity in cognitive science.

The procedure for estimating the Rademacher complexity of a human learner is
ingenious. They gave each subject \( n \) labeled examples drawn from some
domain, but with labels ("A" and "B" rather than -1 and +1) which were assigned
completely randomly --- that is, the labels were Rademacher noise. After
letting the subjects study the examples, they were distracted by being asked to
do some arithmetic problems involving 10-digit numbers, and then given their \(
n \) training examples again, but unlabeled, in a different, randomized order,
and without being told these were the same examples. The subjects were asked
to classify them, and the empirical Rademacher complexity for the subject was
taken to be the sample covariance between guesses and labels; this was averaged
over subjects.

This task was inflicted on 80 undergrads (presumably taking Intro. Psych. at
Madison), with two different domains. Here I'll just quote:

The "Shape" Domain ... consists of 321 computer-generated 3D shapes
... parametrized by a real number \( x \in [0, 1] \), such that small \( x \)
produces spiky shapes, while large \( x \) produces smooth ones.... It is
important to note that this internal structure is unnecessary to the definition
or measurement of Rademacher complexity per se. .... The "Word"
Domain ... consists of 321 English words. We start with the Wisconsin
Perceptual Attribute Ratings Database ... which contains words rated by 350
undergraduates for their emotional valence. We sort the words by their emotion
valence, and take the 161 most positive and the 160 most negative ones for use
in the study. ... Participants have rich knowledge about this domain. The
size of the domain for shapes and words was matched to facilitate comparison.

As they note, in both domains, Rademacher complexity shrinks as \( n \)
grows, so learning should be possible.

when \( n = 5 \), our interviews show that, in both domains, 9 out of 10
participants offered some spurious rules of the random labels. For example,
one subject thought the shape categories were determined by whether the shape
"faces" downward; another thought the word categories indicated whether the
word contains the letter T. Such beliefs, though helpful in learning the
particular training samples, amount to over-fitting the noise. In contrast,
when \( n = 40 \), about half the participants indicated that they believed the
labels to be random, as spurious "rules" are more difficult to find.

Also, the students had higher Rademacher complexity for words, with which
they are very familiar, than for strange synthetic shapes:

The higher complexity indicates that, for the same sample sizes, participants are better able to find spurious explanations of the training data for the Words than for the Shapes. Two distinct strategies were apparent in the Word domain interviews: (i) Some participants created mnemonics. For example, one subject received the training sample (grenade, B), (skull, A), (conflict, A), (meadow, B), (queen, B), and came up with the following story: "a queen was sitting in a meadow and then a grenade was thrown (B = before), then this started a conflict ending in bodies & skulls (A = after)." (ii) Other participants came up with idiosyncratic, but often imperfect, rules. For instance, whether the item "tastes good," "relates to motel service," or "physical vs. abstract."

Having measured the Rademacher complexity of Madison undergrads, Zhu et
al. then took the same domains and gave the students some actual prediction
problems where generalization was possible: separating spiky from not-so-spiky
shapes, separating shapes of medium spikiness from those which are either very
spiky or very smooth, separating positive-valence words from those of negative
valence, and separating words which are over five letters long from shorter
words. (They did not tell them that these were the categories, of course.)
This lets them compare the actual generalization performance --- i.e., the
error rate on new data, not seen during training --- from the bounds implied by
the Rademacher complexity.

It is not interesting that the humans' generalization error is lower than
the theoretical bound on the generalization error; that goes along with being
an upper bound. What's interesting is that Rademacher complexity nonetheless
predicts how much over-fitting the students will do. It's not surprising that
they do better at guessing the rules when they've seen 40 examples than when
they've seen 5, but that seeing 40 word examples leads to more error than
seeing 40 shapes is remarkable. What would be very nice, but doesn't
appear here, would be experiments were domain, distribution and sample size
were all different, but adjusted so that the human Rademacher complexity was
equal, which would let us tease out how much that complexity as such predicts
human learning.

While we await such experiments, I will speculate irresponsibly. I have
long been skeptical about the sort of journalism or scholarship which looks at
some trend at purports to tell us what it Really Means — often, in the
more scholarly versions, the Deep Social Forces behind it, or the Hidden
Ideological Motives. (See, e.g., the "Trees"
section here.) Even when the
trend actually exists, this seems to me far too vulnerable to Just So
story-telling, that the explainers could provide equally compelling stories no
matter what actually happened. To put it in terms of this paper, everyone is
really, really familiar with the social life of the East African Plains Ape,
yet we have very few truly independent historical examples, so our capacity to
spin such stories is very large and \( n \) is very small, so \( R_n \) has to
be very large, and none of our stories is going to generalize well. This is
also part of what Duncan Watts is
saying in Everything Is Obvious,
Once You Know the Answer, about how, once they hear
social-scientific findings, people find it easy to rationalize them as obvious
--- but would be equally capable of coming up with equally-convincing
rationalizations for the opposite findings. Hence the complete lack of
substance in most of what passes for commentary on our common life. (Thomas
Friedman is clearly a
queen-in-a-meadow-with-grenades-and-skulls
guy, while David Brooks is
more "relates
to motel service".) Some of these stories may even be right, but pointing
to how compelling they explain everything we see is actually not very good
evidence for them.