Foundations of Machine Learning

Machine Learning as Normal Science

This is a fairly gentle introduction to what is now the core theory of
machine learning, which one might operationally define as "the kind of thing
you see at NIPS
or ICML". Foundations of Machine
Learning focuses on proving that various modern algorithms, used in
contemporary problems, will work well with high probability, and showing the
tools and techniques used to craft such proofs generally. It is neither
polemical nor eclectic, and un-self-consciously so; this just is what the field
is about now. I will follow the authors' example and likewise just plunge in.

Chapter 1 introduces the idea of wanting to learn functions that generalize
well from old data to new data,
and cross-validation as a
practical tool of evaluation.

Chapter 2 explains the fundamental idea of a "probably approximately
correct" (PAC) algorithm for a classification problem: you have a collection of
\( n \) labeled data points, all of them belonging to one class or the other.
This training data consists of independent and identically distributed (IID)
draws from some fixed but unknown distribution; you can think of it as the
outcome of a well-designed survey. You also have a "hypothesis space" \( H \)
of possible prediction functions, which take training points and return labels.
I demand that you come up with a way of picking one function \( h \) from \( H
\) so that, with probability at least \( 1 -\delta \), the risk that \( h \)
mis-classifies a new data point is at most \( \epsilon \). (A variant is that
\( h \) must have a risk within \( \epsilon \) of the best function in \( H \),
which might be imperfect.) You get to respond with the sample size \(
n(\delta,\epsilon) \) you would need to do this. I in turn get to make \(
\delta \) and \( \epsilon \) (the "probably" and "approximately" in PAC) as
small as I like, and your \( n \) can't grow faster than polynomially in \(
1/\delta \) and \( 1/\epsilon \). Moreover, your guarantee has to hold no
matter what the true rule is, or the distribution which generates training
examples. Finally, your method must
be polynomial time in \( n
\), \( 1/\delta \) and \( 1/\epsilon \).

Despite all these restrictions, PAC learning is possible. Suppose, for
instance, that \( H \) consists of a finite number of rules, one of which is
right. By the law of large numbers, the rate at which each rule makes mistakes
on the training data converges to its true risk. One can actually strengthen
(Hoeffding's inequality) to say that the probability of any one rule's
"in-sample error rate" departing from its true risk by as much as \( \epsilon/2
\) is at most \( 2e^{-n\epsilon^2/2} \). If there are \( |H| \) rules in all,
the probability of such a large deviation is at most \( 2|H|e^{-n\epsilon^2/2}
\). If this bad event doesn't happen, if the sample is representative for all
the rules, then the risk of the best-in-sample rule must be within \(
\epsilon/2 + \epsilon/2 = \epsilon \) of the risk of the best rule. Setting \(
\delta \leq 2|H|e^{-n\epsilon^2/2} \) and solving for \( n \) gives \( n \leq
2\epsilon^{-2}\ln{\frac{2|H|}{\delta}} \), which is polynomial as requested.
(This isn't the tightest bound possible here.)

The finite-hypothesis-space setting is no intrinsic interest; teachers like
it because it does three things. First, it shows that PAC learning isn't
strictly impossible. Second, it warms students up for the work of combining
probabilistic inequalities (here, Hoeffding's inequality) with combinatorial
arguments (here, the trivial "union bound") to bound generalization error.
Third, it suggests that something fancier is going to be needed to handle
infinite hypothesis spaces. Chapter 2 includes a simple, explicit example
showing that PAC learning can happen even when \( H \) is infinite,
but doesn't say how to handle infinite hypothesis spaces in general.

This is what chapter 3 does. It
introduces Rademacher complexity, and shows
that it bounds the expected largest gap between a classifier's in-sample and
out-of-sample performance. Another application of the law of large numbers
shows that the actual gap will tend to be close to the expected gap, so true
risk can be bounded in terms of in-sample error, Rademacher complexity, and the
desired confidence level \( \delta \). When the Rademacher complexity goes to
zero as the sample size increases, we can learn. However, the Rademacher
complexity varies with the distribution of the data, which is inconvenient, and
involves optimization, which is slow. The VC
dimension is introduced as a tool for guaranteeing that the Rademacher
complexity will go to zero under all distributions. This chapter also shows
how, if the Adversary is left free to design the distribution, finite VC
dimension is necessary (not just sufficient) for learning.

Chapter 4 introduces linear classifiers, which categorize data points
according to which side of a straight line (plane, hyper-plane) they fall on.
The difficulty is that if there is one line which manages to classify the
training examples correctly, there are infinitely many others, even when one
uses measures like VC dimension, so initial bounds are quite pessimistic. This
leads, however, to the idea of looking at the "margin" — roughly
speaking, how far the training points are from the rule's boundary between
classes, since there are vastly fewer rules with large margins than with narrow
ones. Margin can be linked to Rademacher complexity, leading to much nicer
generalization bounds which tell us to get wide margins. Maximizing the margin
of a linear classifier leads naturally to a nice convex optimization problem (a
kind of quadratic programming), whose solution is called a "support vector
machine". These three ideas — margin analysis, the link between large
margins and low Rademacher complexity, and the importance of convex
optimization — are recurring themes for the book.

Linear methods works better in high dimensions (if the margin can be
controlled), and if one first takes nonlinear transformations in the data and
then uses linear methods in this "feature space", one gets nonlinearity in the
original data. Actually computing a huge set of non-linear transformations of
the data,however, is awkward. Chapter 5 explains the "kernel trick", which
short-cuts this by replacing linear operations in the high-dimensional feature
space with applying kernel functions to the original data. The theory for
linear machines developed already then translates very simply into theory for
kernel machines, and instead of worrying about crafting
features/transformations, one crafts kernels. Section 5.5, rather unusually,
looks at kernels for sequences, and their definition in terms of finite
automata.

Chapter 6 covers boosting, one of the leading methods for combining
individually-weak predictors. Having just
written far too much about boosting, I won't
rehash the subject. This book largely agrees
with Freund and Schapire in emphasizing
margin-based analyses of why boosting works, and its interesting interpretation
as a zero-sum game between a player who tries to find data points which are
hard to classify, and another who tries to find accurate classifiers.

Chapter 7 turns to on-line learning, where the goal is not to have low
expected future risk, but to retrospectively have done almost as well as the
in-hindsight-best hypothesis in the space. They properly emphasize that while
their earlier (and later) theorems need IID data, one can get low regret
without making any probabilistic assumptions about the data at all.
They give about equal attention to "prediction with expert advice", and to
particular on-line algorithms for classification (such as the perceptron).
They also look at risk bounds for on-line algorithms when you do have
IID data. This chapter
follows Cesa-Bianchi and
Lugosi less closely than I'd have guessed.

Chapters 8 and 9 look at, respectively, more-than-binary classification
(self-explanatory) and rankings — given two new data points, say which
one is preferred to the other. (Actually, the idea of "ranking" used here is
something more like "a possibly-asymmetric relation" than even a partial
ordering.) Both require discrete predictions, like binary classification, but
with much more structure, which really needs to be taken into account. Again,
the main tools here are margin bounds and Rademacher complexity; some of the
results in these chapters are new.

Chapter 10 turns to regression, i.e., to predicting real-valued variables.
They assume that the loss is bounded, which, since they mostly work with the
squared-error loss, implies that the predictions have to be bounded. With this
restriction, they transfer the machinery of Rademacher complexity and VC
qdimension with little trouble, and so analyze ordinary least squares ("Its
performance is typically poor in most applications"), as well as kernel ridge
regression, support vector
regression, the lasso, etc.
The case of unbounded range is, as they say, harder, and is left to references
in the end-notes.

So far, all the generalization error bounds have only invoked properties of
the hypothesis space, not of the algorithm used to search it. Chapter 11 thus
shifts focus to look at bounds based on algorithmic stability --- the notion
that if we change just one point in the training data, we can't change the risk
of the returned hypothesis too much. If algorithms become more and more stable
as the training data grows in size, and stabilize fast enough, this alone is
enough to bound the generalization error. (The converse, that generalization
implies stability, is not true, though it
fails in
interesting and complicated ways.) The kernel algorithms advocated earlier
turn out to be sufficiently stable.

The remaining chapters are more miscellaneous: chapter 12 looks at
dimensionality reduction, mostly principal components analysis and its
kernelized versions (including some nonlinear manifold-learning techniques) but
also a cute proof of
the Johnson-Lindenstrauss
theorem on random projections. Chapter 13 is a offers a glimpse of the
huge area of automata learning and
grammatical
inference. (The fascinating work
by Kontorovich, Cortes,
and Mohri on kernel methods
for learning grammars is only mentioned in passing.) Chapter 14 looks
at reinforcement
learning, building up to ideas
of stochastic
approximation
and Q-learning.
Appendices review linear algebra (A), convex optimization (B), probability (C),
and deviation inequalities (D) — though appendix D will be a review for
instructors, not students.

There are some omissions: unsupervised problems like clustering (other than
dimensionality reduction); neural networks; graphical models more
generally; minimax lower
bounds; non-IID data; Bayesian nonparametrics and PAC-Bayesian results*.
Some of these topics are absent because they haven't lent themselves to the
style of result the authors favor (though Mohri and Rostamizadeh have
written importantpapers
on dependent data), others perhaps are just a case of not being able to fit
everything into a mere 400 pages.

This is very much a book about the mathematical theory of learning
algorithms. (Applications are left to dozen or so of the extensive exercises;
those exercises are generally good.) The sole focus is proving theorems.
Those theorems are nicely formulated, and the proofs are clear, well-prepared,
and (usually) insightful. Moreover, the material goes well beyond the staples
of binary classification and regression, covering lots of interesting modern
topics from a unified perspective. The implied reader is a new graduate
student in theoretical computer science (caring deeply about proofs of
correctness and knowing about computational complexity and finite-state
machines, but ignorant of statistics); a good teacher could use it with
beginning graduate students in statistics, applied math, or many other
subjects. It's not the kind of book which would fire the reader with
enthusiasm for the subject, but it is solid; it does the job it set out to do.
I can't think of any other book now on the market which I'd rather use to teach
a first course in learning theory.

Disclaimer: I asked for, and got, an examination copy of this book
from the publisher. Perhaps more importantly, I have a collaborator in common
with Mohri. But I have no stake in the book's success.

*: I am limiting myself to "things you'd see at NIPS or ICML
without boggling" topics. Of course, once upon a time "machine learning" would
have embraced much of good old fashioned AI, computational models of human and
animal learning, evolutionary search (e.g., genetic programming), inductive
logic, etc. These all got pushed to the margins of the field, or even beyond
the margin, during the revolution of the 1980s and 1990s.
(Somebody really needs to write a history of this, while the
participants are mostly still alive and their e-mail is readable.)