Abstract

Machine Learning (ML) is one of the most exciting and dynamic areas of modern research and application. The purpose of this review is to provide an introduction to the core concepts and tools of machine learning in a manner easily understood and intuitive to physicists. The review begins by covering fundamental concepts in ML and modern statistics such as the bias-variance tradeoff, overfitting, regularization, and generalization before moving on to more advanced topics in both supervised and unsupervised learning. Topics covered in the review include ensemble models, deep learning and neural networks, clustering and data visualization, energy-based models (including MaxEnt models and Restricted Boltzmann Machines), and variational methods. Throughout, we emphasize the many natural connections between ML and statistical physics. A notable aspect of the review is the use of Jupyter notebooks to introduce modern ML/statistical packages to readers using physics-inspired datasets (the Ising Model and Monte-Carlo simulations of supersymmetric decays of proton-proton collisions). We conclude with an extended outlook discussing possible uses of machine learning for furthering our understanding of the physical world as well as open problems in ML where physicists maybe able to contribute.

I Introduction

Machine Learning (ML), data science, and statistics are fields that describe how to learn from, and make predictions about, data. The availability of big datasets is a hallmark of modern science, including physics, where data analysis has become an important component of diverse areas, such as experimental particle physics, observational astronomy and cosmology, condensed matter physics, biophysics, and quantum computing. Moreover, ML and data science are playing increasingly important roles in many aspects of modern technology, ranging from biotechnology to the engineering of self-driving cars and smart devices. Therefore, having a thorough grasp of the concepts and tools used in ML is an important skill that is increasingly relevant in the physical sciences.

The purpose of this review is to serve as an introduction to foundational and state-of-the-art techniques in ML and data science for physicists. The review seeks to find a middle ground between a short overview and a full-length textbook. While there exist many wonderful ML textbooks Friedman et al. (2001); Murphy (2012); Bishop (2006); Abu-Mostafa et al. (2012), they are lengthy and use specialized language that is often unfamiliar to physicists. This review builds upon the considerable knowledge most physicists already possess in statistical physics in order to introduce many of the major ideas and techniques used in modern ML. We take a physics-inspired pedagogical approach, emphasizing simple examples (e.g., regression and clustering), before delving into more advanced topics. The intention of this review and the accompanying Jupyter notebooks (available at https://physics.bu.edu/~pankajm/MLnotebooks.html) is to give the reader the requisite background knowledge to follow and apply these techniques to their own areas of interest.

While this review is written with a physics background in mind, we aim for it to be useful to anyone with some background in statistical physics, and it is suitable for both graduate students and researchers as well as advanced undergraduates. The review is based on an advanced topics graduate course taught at Boston University in Fall of 2016. As such, it assumes a level of familiarity with several topics found in graduate physics curricula (partition functions, statistical mechanics) and a fluency in mathematical techniques such as linear algebra, multi-variate calculus, variational methods, probability theory, and Monte-Carlo methods. It also assumes a familiarity with basic computer programming and algorithmic design.

i.1 What is Machine Learning?

Most physicists learn the basics of classical statistics early on in undergraduate laboratory courses. Classical statistics is primarily concerned with how to use data to estimate the value of an unknown quantity. For instance, estimating the speed of light using measurements obtained with an interferometer is one such example that relies heavily on techniques from statistics.

Machine Learning is a subfield of artificial intelligence with the goal of developing algorithms capable of learning from data automatically. In particular, an artificially intelligent agent needs to be able to recognize objects in its surroundings and predict the behavior of its environment in order to make informed choices. Therefore, techniques in ML tend to be more focused on prediction rather than estimation. For example, how do we use data from the interferometry experiment to predict what interference pattern would be observed under a different experimental setup? In addition, methods from ML tend to be applied to more complex high-dimensional problems than those typically encountered in a classical statistics course.

Despite these differences, estimation and prediction problems can be cast into a common conceptual framework. In both cases, we choose some observable quantity x of the system we are studying (e.g., an interference pattern) that is related to some parameters w (e.g., the speed of light) of a model p(x|w) that describes the probability of observing x given w. Now, we perform an experiment to obtain a dataset X and use these data to fit the model. Typically, “fitting” the model involves finding ^w that provides the best explanation for the data – in this case, that means that the estimated parameters maximize the probability of observing the data (i.e., ^w=argmaxw{p(X|w)}). Estimation problems are concerned with the accuracy of ^w, whereas prediction problems are concerned with the ability of the model to predict new observations (i.e., the accuracy of p(x|^w)). Although the goals of estimation and prediction are related, they often lead to different approaches. As this review is aimed as an introduction to the concepts of ML, we will focus on prediction problems and refer the reader to one of many excellent textbooks on classical statistics for more information on estimation Lehmann and Casella (2006); Lehmann and Romano (2006); Witte and Witte (2013); Wasserman (2013).

i.2 Why study Machine Learning?

The last three decades have seen an unprecedented increase in our ability to generate and analyze large data sets. This “big data” revolution has been spurred by an exponential increase in computing power and memory commonly known as Moore’s law. Computations that were unthinkable a few decades ago can now be routinely performed on laptops. Specialized computing machines (such as GPU-based machines) are continuing this trend towards cheap, large-scale computation, suggesting that the “big data” revolution is here to stay.

This increase in our computational ability has been accompanied by new techniques for analyzing and learning from large datasets. These techniques draw heavily from ideas in statistics, computational neuroscience, computer science, and physics. Similar to physics, modern ML places a premium on empirical results and intuition over the more formal treatments common in statistics, computer science, and mathematics. This is not to say that proofs are not important or undesirable. Rather, many of the advances of the last two decades – especially in fields like deep learning – do not have formal justifications (much like there still exists no mathematically well-defined concept of the Feynman path-integral in d>1).

Physicists are uniquely situated to benefit from and contribute to ML. Many of the core concepts and techniques used in ML – such as Monte-Carlo methods, simulated annealing, variational methods – have their origins in physics. Moreover, “energy-based models” inspired by statistical physics are the backbone of many deep learning methods. For these reasons, there is much in modern ML that will be familiar to physicists.

Physicists and astronomers have also been at the forefront of using “big data”. For example, experiments such as CMS and ATLAS at the LHC generate petabytes of data per year. In astronomy, projects such as the Sloan Digital Sky Survey (SDSS) routinely analyze and release hundreds of terabytes of data measuring the properties of near a billion stars and galaxies. Researchers in these fields are increasingly incorporating recent advances in ML and data science, and this trend is likely to accelerate in the future.

Besides applications to physics, part of the goal of this review is to serve as an introductory resource for those looking to transition to more industry-oriented projects. Physicists have already made many important contributions to modern big data applications in an industrial setting Metz (2017). Data scientists and ML engineers in industry use concepts and tools developed for ML to gain insight from large datasets. A familiarity with ML is a prerequisite for many of the most exciting employment opportunities in the field, and we hope this review will serve as a useful introduction to ML for physicists beyond an academic setting.

i.3 Scope and structure of the review

Any review on ML must simultaneously accomplish two related but distinct goals. First, it must convey the rich theoretical foundations underlying modern ML. This task is made especially difficult because ML is very broad and interdisciplinary, drawing on ideas and intuitions from many fields including statistics, computational neuroscience, and physics. Unfortunately, this means making choices about what theoretical ideas to include in the review. This review emphasizes connections with statistical physics, physics-inspired Bayesian inference, and computational neuroscience models. Thus, certain ideas (e.g., gradient descent, expectation maximization, variational methods, and deep learning and neural networks) are covered extensively, while other important ideas are given less attention or even omitted entirely (e.g., statistical learning, support vector machines, kernel methods, Gaussian processes). Second, any ML review must give the reader the practical know-how to start using the tools and concepts of ML for practical problems. To accomplish this, we have written a series of Jupyter notebooks to accompany this review. The notebooks introduce the nuts-and-bolts of how to use, code, and implement the methods introduced in the main text. Luckily, there are numerous great ML software packages available in Python (scikit-learn, tensorflow, Pytorch, Keras) and we have made extensive use of them. We have also made use of a new package, Paysage, for energy-based generative models which has been co-developed by one of the authors (CKF) and maintained by Unlearn.AI (a company affiliated with two of the authors: CKF and PM). The purpose of the notebooks is to both familiarize physicists with these resources and to serve as a starting point for experimenting and playing with ideas.

ML can be divided into three broad categories: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning concerns learning from labeled data (for example, a collection of pictures labeled as containing a cat or not containing a cat). Common supervised learning tasks include classification and regression. Unsupervised learning is concerned with finding patterns and structure in unlabeled data. Examples of unsupervised learning include clustering, dimensionality reduction, and generative modeling. Finally, in reinforcement learning an agent learns by interacting with an environment and changing its behavior to maximize its reward. For example, a robot can be trained to navigate in a complex environment by assigning a high reward to actions that help the robot reach a desired destination. We refer the interested reader to the classic book by Sutton and Barto Reinforcement Learning: an IntroductionSutton and Barto (1998). While useful, the distinction between the three types of ML is sometimes fuzzy and fluid, and many applications often combine them in novel and interesting ways. For example, the recent success of Google DeepMind in developing ML algorithms that excel at tasks such as playing Go and video games employ deep reinforcement learning, combining reinforcement learning with supervised learning methods based on deep neural networks.

Here, we limit our focus to supervised and unsupervised learning. The literature on reinforcement learning is extensive and uses ideas and concepts that, to a large degree, are distinct from supervised and unsupervised learning tasks. For this reason, to ensure cohesiveness and limit the length of this review, we have chosen not to discuss reinforcement learning. However, this omission should not be mistaken for a value judgement on the utility of reinforcement learning for solving physical problems. For example, some of the authors have used inspiration from reinforcement learning to tackle difficult problems in quantum control Bukov et al. (2017).

In writing this review, we have tried to adopt a style that reflects what we consider to be the best of the physics tradition. Physicists understand the importance of well-chosen examples for furthering our understanding. It is hard to imagine a graduate course in statistical physics without the Ising model. Each new concept that is introduced in statistical physics (mean-field theory, transfer matrix techniques, high- and low-temperature expansions, the renormalization group, etc.) is applied to the Ising model. This allows for the progressive building of intuition and ultimately a coherent picture of statistical physics. We have tried to replicate this pedagogical approach in this review by focusing on a few well-chosen techniques – linear and logistic regression in the case of supervised learning and clustering in the case of unsupervised learning – to introduce the major theoretical concepts.

In this same spirit, we have chosen three interesting datasets with which to illustrate the various algorithms discussed here. (i) The SUSY data set consists of 5,000,000 Monte-Carlo samples of proton-proton collisions decaying to either signal or background processes, which are both parametrized with 18 features. The signal process is the production of electrically-charged supersymmetric particles, which decay to W bosons and an electrically-neutral supersymmetric particle, invisible to the detector, while the background processes are various decays involving only Standard Model particles Baldi et al. (2014). (ii) The Ising data set consists of 104 states of the 2D Ising model on a 40×40 square lattice, obtained using Monte-Carlo (MC) sampling at a few fixed temperatures T. (iii) The MNIST dataset comprises 70000 handwritten digits, each of which comes in a square image, divided into a 28×28 pixel grid. The first two datasets were chosen to reflect the various sub-disciplines of physics (high-energy experiment, condensed matter) where we foresee techniques from ML becoming an increasingly important tool for research. The MNIST dataset, on the other hand, introduces the flavor of present-day ML problems. By re-analyzing the same datasets with multiple techniques, we hope readers will be able to get a sense of the various, inevitable trade-offs involved in choosing how to analyze data. Certain techniques work better when data is limited while others may be better suited to large data sets with many features. A short description of these datasets are given in the Appendix.

This review draws generously on many wonderful textbooks on ML and we encourage the reader to consult them for further information. They include Abu Mostafa’s masterful Learning from Data, which introduces the basic concepts of statistical learning theory Abu-Mostafa et al. (2012), the more advanced but equally good The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman Friedman et al. (2001), Michael Neilsen’s indispensable Neural Networks and Deep Learning which serves as a wonderful introduction to the neural networks and deep learning Nielsen (2015) and David MacKay’s outstanding Information Theory, Inference, and Learning Algorithms which introduced Bayesian inference and information theory to a whole generation of physicists MacKay (2003). More comprehensive (and much longer) books on modern ML techniques include Christopher Bishop’s classic Pattern Recognition and Machine LearningChristopher (2016) and the more recently published Machine Learning: A Probabilistic Perspective by Kevin Murphy Murphy (2012). Finally, one of the great successes of modern ML is deep learning, and some of the pioneers of this field have written a textbook for students and researchers entitled Deep LearningGoodfellow et al. (2016). In addition to these textbooks, we have consulted numerous research papers, reviews, and web resources. Whenever possible, we have tried to point the reader to key papers and other references that we have found useful in preparing this review. However, we are neither capable of nor have we made any effort to make a comprehensive review of the literature.

The review is organized as follows. We begin by introducing polynomial regression as a simple example that highlights many of the core ideas of ML. The next few chapters introduce the language and major concepts needed to make these ideas more precise including tools from statistical learning theory such as overfitting, the bias-variance tradeoff, regularization, and the basics of Bayesian inference. The next chapter builds on these examples to discuss stochastic gradient descent and its generalizations. We then apply these concepts to linear and logistic regression, followed by a detour to discuss how we can combine multiple statistical techniques to improve supervised learning, introducing bagging, boosting, random forests, and XG Boost. These ideas, though fairly technical, lie at the root of many of the advances in ML over the last decade. The review continues with a thorough discussion of supervised deep learning and neural networks, as well as convolutional nets. We then turn our focus to unsupervised learning. We start with data visualization and dimensionality reduction before proceeding to a detailed treatment of clustering. Our discussion of clustering naturally leads to an examination of variational methods and their close relationship with mean-field theory.
The review continues with a discussion of deep unsupervised learning, focusing on energy-based models, such as Restricted Boltzmann Machines (RBMs) and Deep Boltzmann Machines (DBMs). Then we discuss two new and extremely popular modeling frameworks for unsupervised learning, generative adversarial networks (GANs) and variational autoencoders (VAEs). We conclude the review with an outlook and discussion of promising research directions at the intersection physics and ML.

Ii Why is Machine Learning difficult?

Figure 1: Fitting versus predicting for noiseless data. Ntrain=10 points in the range x∈[0,1] were generated from a linear model (top) or tenth-order polynomial (bottom). This data was fit using three model classes: linear models (red), all polynomials of order 3 (yellow), all polynomials of order 10 (green) and used to make prediction on Ntest=20 new data points with xtest∈[0,1.2] (shown on right). Notice that in the absence of noise (σ=0), given enough data points that fitting and predicting are identical.

ii.1 Setting up a problem in ML and data science

Almost every problem in ML and data science starts with the same ingredients. The first ingredient is the dataset X. The second is the model g(w), which is a function of the parameters w. The final ingredient is the cost function C(X,g(w)) that allows us to judge how well the model g(w) explains, or in general performs on, the observations X. The model is fit by finding the value of w that minimizes the cost function. For example, one commonly used cost function is the squared error. Minimizing the squared error cost function is known as the method of least squares, and is typically appropriate for experiments with Gaussian measurement errors.

ML researchers and data scientists follow a standard recipe to obtain models that are useful for prediction problems. We will see why this is necessary in the following sections, but it is useful to present the recipe up front to provide context. The first step in the analysis is to randomly divide the dataset X into two mutually exclusive groups Xtrain and Xtest called the training and test sets. The fact that this must be the first step should be heavily emphasized – performing some analysis (such as using the data to select important variables) before partitioning the data is a common pitfall that can lead to incorrect conclusions. Typically, the majority of the data are partitioned into the training set (e.g., 90%) with the remainder going into the test set. The model is fit by minimizing the cost function using only the data in the training set ^w=argminw{C(Xtrain,g(w))}. Finally, the performance of the model is evaluated by computing the cost function using the test set C(Xtest,g(^w)). The value of the cost function for the best fit model on the training set is called the in-sample error Ein=C(Xtrain,g(^w)) and the value of the cost function on the test set is called the out-of-sample error Eout=C(Xtest,g(^w)).

One of the most important observations we can make is that the out-of-sample error is almost always greater than the in-sample error, i.e. Eout≥Ein. We explore this point further in Sec. VI and its accompanying notebook. Splitting the data into mutually exclusive training and test sets provides an unbiased estimate for the predictive performance of the model – this is known as cross-validation in the ML and statistics literature. In many applications of classical statistics, we start with a mathematical model that we assume to be true (e.g., we may assume that Hooke’s law is true if we are observing a mass-spring system) and our goal is to estimate the value of some unknown model parameters (e.g., we do not know the value of the spring stiffness). Problems in ML, by contrast, typically involve inference about complex systems where we do not know the exact form of the mathematical model that describes the system. Therefore, it is not uncommon for ML researchers to have multiple candidate models that need to be compared. This comparison is usually done using Eout and the model that minimizes this out-of-sample error is chosen as the best model (i.e. model selection). Note that once we select the best model on the basis of its performance on Eout, the real-world performance of the winning model should be expected to be slightly worse because the test data was now used in the fitting procedure.

ii.2 Polynomial Regression

In the previous section, we mentioned that multiple candidate models are typically compared using the out-of-sample error Eout. It may be at first surprising that the model that has the lowest out-of-sample error Eout usually does not have the lowest in-sample error Ein. Therefore, if our goal is to obtain a model that is useful for prediction we do not want to choose the model that provides the best explanation for the current observations. At first glance, the observation that the model providing the best explanation for the current dataset probably will not provide the best explanation for future datasets is very counter-intuitive.

Moreover, the discrepancy between Ein and Eout becomes more and more important, as the complexity of our data, and the models we use to make predictions, grows. As the number of parameters in the model increases, we are forced to work in high-dimensional spaces. The “curse of dimensionality” ensures that many phenomena that are absent or rare in low-dimensional spaces become generic. For example, the nature of distance changes in high dimensions, as evidenced in the derivation of the Maxwell distribution in statistical physics where the fact that all the volume of a d-dimensional sphere of radius r is contained in a small spherical shell around r is exploited. Almost all critical points of a function (i.e., the points where all derivatives vanish) are saddles rather than maxima or minima (an observation first made in physics in the context of the p-spin spherical spin glass). For all these reasons, it turns out that for complicated models studied in ML, predicting and fitting are very different things.

Figure 2: Fitting versus predicting for noisy data. Ntrain=100 noisy data points (σ=1) in the range x∈[0,1] were generated from a linear model (top) or tenth-order polynomial (bottom). This data was fit using three model classes: linear models (red), all polynomials of order 3 (yellow), all polynomials of order 10 (green) and used to make prediction on Ntest=20 new data points with xtest∈[0,1.2](shown on right). Notice that even when the data was generated using a tenth order polynomial, the linear and third order polynomials give better out-of-sample predictions, especially beyond the x range over which the model was trained.

To develop some intuition about why we need to pay close attention to out-of-sample performance, we will consider a simple one-dimensional problem – polynomial regression. Our task is a simple one, fitting data with polynomials of different order. We will explore how our ability to predict depends on the number of data points we have, the “noise” in the data generation process, and our prior knowledge about the system. The goal is to build intuition about why prediction is difficult in preparation for introducing general strategies that overcome these difficulties.

Before reading the rest of the section, we strongly encourage the reader to read Notebook 1 and complete the accompanying exercises.

Consider a probabilistic process that assigns a label yi to an observation xi. The data are generated by drawing samples from the equation

yi=f(xi)+ηi,

(1)

where f(xi) is some fixed (but possibly unknown) function, and ηi is a Gaussian, uncorrelated noise variable, such that

⟨ηi⟩=0,

⟨ηiηj⟩=δijσ2.

We will refer to the f(xi) as the function used to generate the data, and σ as the noise strength. The larger σ is the noisier the data; σ=0 corresponds to the noiseless case.

To make predictions, we will consider a family of functions gα(x;wα) that depend on some parameters wα. These functions represent the model class that we are using to model the data and make predictions. Note that we choose the model class without knowing the function f(x). The gα(x;wα) encode the features we choose to represent the data. In the case of polynomial regression we will consider three different model classes: (i) all polynomials of order 1 which we denote by g1(x;w1), (ii) all polynomials up to order 3 which we denote by g3(x;w3), and (iii) all polynomials of order 10, g10(x;w10). Notice that these three model classes contain different number of parameters. Whereas g1(x;w1) has only two parameters (the coefficients of the zeroth and first order terms in the polynomial), g3(x;w3) and g10(x;w10) have four and eleven parameters, respectively. This reflects the fact that these three models have different model complexities. If we think of each term in the polynomial as a “feature” in our model, then increasing the order of the polynomial we fit increases the number of features. Using a more complex model class may give us better predictive power, but only if we have enough statistical power to accurately learn the model parameters associated with these extra features from the training dataset.

To learn the parameters wα, we will train our models on a training dataset and then test the effectiveness of the model on a different dataset, the test dataset. Since we are interested only in gaining intuition, we will simply plot the fitted polynomials and compare the predictions of our fits for the test data with the true values. As we will see below, the models that give the best fit to existing data do not necessarily make the best predictions even for a simple task like polynomial regression.

To illustrate these ideas, we encourage the reader to experiment with the accompanying notebook to generate data using a linear function f(x)=2x and a tenth order polynomial f(x)=2x−10x5+15x10 and ask how the size of the training dataset Ntrain and the noise strength σ affect the ability to make predictions. Obviously, more data and less noise leads to better predictions. To train the models (linear, third-order, tenth-order), we uniformly sampled the interval x∈[0,1] and constructed Ntrain training examples using (1). We then fit the models on these training samples using standard least-squares regression. To visualize the performance of the three models, we plot the predictions using the best fit parameters for a test set where x are drawn uniformly from the interval x∈[0,1.2]. Notice that the test interval is slightly larger than the training interval.

Figure 1 shows the results of this procedure for the noiseless case, σ=0. Even using a small training set with Ntrain=10 examples, we find that the model class that generated the data also provides the best fit and the most accurate out-of-sample predictions. That is, the linear model performs the best for data generated from a linear polynomial (the third and tenth order polynomials perform similarly), and the tenth order model performs the best for data generated from a tenth order polynomial. While this may be expected, the results are quite different for larger noise strengths.

Figure 3: Fitting versus predicting for noisy data. Ntrain=104 noisy data points (σ=1) in the range x∈[0,1] were generated from a tenth-order polynomial. This data was fit using three model classes: linear models (red), all polynomials of order 3 (yellow), all polynomials of order 10 (green) and used to make prediction on Ntest=100 new data points with xtest∈[0,1.2](shown on right). The tenth order polynomial gives good predictions but the model’s predictive power quickly degrades beyond the training data range.

Figure 2 shows the results of the same procedure for noisy data, σ=1, and a larger training set, Ntrain=100. As in the noiseless case, the tenth order model provides the best fit to the data (i.e., the lowest Ein). In contrast, the tenth order model now makes the worst out-of-sample predictions (i.e., the highest Eout). Remarkably, this is true even if the data were generated using a tenth order polynomial.

At small sample sizes, noise can create fluctuations in the data that look like genuine patterns. Simple models (like a linear function) cannot represent complicated patterns in the data, so they are forced to ignore the fluctuations and to focus on the larger trends. Complex models with many parameters, such as the tenth order polynomial in our example, can capture both the global trends and noise-generates patterns at the same time. In this case, the model can be tricked into thinking that the noise encodes real information. This problem is called “overfitting” and leads to a steep drop-off in predictive performance.

We can guard against overfitting in two ways: we can use less expressive models with fewer parameters, or we can collect more data so that the likelihood that the noise appears patterned decreases. Indeed, when we increase the size of the training data set by two orders of magnitude to Ntrain=104 (see Figure 3) the tenth order polynomial clearly gives both the best fits and the most predictive power over the entire training range x∈[0,1], and even slightly beyond to approximately x≈1.05. This is our first experience with what is known as the bias-variance tradeoff, c.f. Sec. III.2. When the amount of training data is limited as it is when Ntrain=100, one can often get better predictive performance by using a less expressive model (e.g., a lower order polynomial) rather than the more complex model (e.g., the tenth-order polynomial). The simpler model has more “bias” but is less dependent on the particular realization of the training dataset, i.e. less “variance”. Finally we note that even with ten thousand data points, the model’s performance quickly degrades beyond the original training data range. This demonstrates the difficulty of predicting beyond the training data we mentioned earlier.

This simple example highlights why ML is so difficult and holds some universal lessons that we will encounter repeatedly in this review:

Fitting is not predicting. Fitting existing data well is fundamentally different from making predictions about new data.

Using a complex model can result in overfitting. Increasing a model’s complexity (i.e number of fitting parameters) will usually yield better results on the training data. However when the training data size is small and the data are noisy, this results in overfitting and can substantially degrade the predictive performance of the model.

For complex datasets and small training sets, simple models can be better at prediction than complex models due to the bias-variance tradeoff. It takes less data to train a simple model than a complex one. Therefore, even though the correct model is guaranteed to have better predictive performance for an infinite amount of training data (less bias), the training errors stemming from finite-size sampling (variance) can cause simpler models to outperform the more complex model when sampling is limited.

It is difficult to generalize beyond the situations encountered in the training data set.

Iii Basics of Statistical Learning Theory

In this section, we briefly summarize and discuss the sense in which learning is possible, with a focus on supervised learning. We begin with an unknown function y=f(x) and fix a hypothesis setH consisting of all functions we are willing to consider, defined also on the domain of f. This set may be uncountably infinite (e.g. if there are real-valued parameters to fit). The choice of which functions to include in H usually depends on our intuition about the problem of interest. The function f(x) produces a set of pairs (xi,yi), i=1…N, which serve as the observable data. Our goal is to select a function from the hypothesis set h∈H which approximates f(x) as best as possible, namely, we would like to find h∈H such that h≈f in some strict mathematical sense which we specify below. If this is possible, we say that we learnedf(x). But if the function f(x) can, in principle, take any value on unobserved inputs, how is it possible to learn in any meaningful sense?

The answer is that learning is possible in the restricted sense that the fitted model will probably perform approximately as well on new data as it did on the training data. Once an appropriate error function E is chosen for the problem under consideration (e.g. sum of squared errors in linear regression), we can define two distinct performance measures of interest. The in-sample error, Ein, quantifies performance of a particular hypothesis on the training data we used to fit the model. The out-of-sample or generalization error, Eout is the performance on new data. Recall that our goal is to make the out-of-sample error, Eout, as small as possible. However, when we are training our model, we only have access to the in-sample error, Ein, on the training data. Note that these two errors are generally distinct because finite-size effects stemming from sampling noise ensure that Ein will not be the same as Eout, and the parameters are tuned to minimize Ein. This is precisely the distinction between fitting and predicting introduced in Sec II.

This raises a natural question: Can we say something general about the relationship between Ein and Eout? Surprisingly, the answer is ‘Yes’. We can in fact say quite a bit. This is the domain of statistical learning theory, and we give a brief overview of the main results in this section. Our goal is to briefly introduce some of the major ideas from statistical learning theory because of the important role they have played in shaping how we think about machine learning. However, this is a highly technical and theoretical field, so we will just briefly skim over some introductory topics. An in-depth and more thorough introduction to statistical learning theory can be found in the introductory textbook by Abu Mustafa Abu-Mostafa et al. (2012).

iii.1 Three simple schematics that summarize the basic intuitions from Statistical Learning Theory

Figure 4: Schematic of typical in-sample and out-of-sample error as a function of training set size. The typical in-sample or training error, Ein, out-of-sample or generalization error, Eout, bias, variance, and difference of errors as a
function of the number of training data points. The schematic assumes that the number of data points is large (in particular, the schematic does not show the initial drop in Ein for small amounts of data), and that our model cannot exactly fit the true function f(x).

The basic intuitions of statistical learning can be summarized in three simple schematics. The first schematic, shown in Figure 4, shows the typical out-of-sample error, Eout, and in-sample error, Ein, as a function of the amount of training data. In making this graph, we have assumed that the true data is drawn from a sufficiently complicated distribution, so that we cannot exactly learn the function f(x). Hence, after a quick initial drop (not shown in figure), the in-sample error will increase with the number of data points, since our models are not powerful enough to learn the true function we are seeking to approximate. In contrast, the out-of-sample error will decrease with the number of data points in this high data regime. As the number of data points gets large, the sampling noise decreases and the training data set becomes a better and better representative of the true distribution from which the data is drawn. For this reason, in the infinite data limit, the in-sample and out-of-sample errors must approach the same value, which is called the “bias” of our model.

The bias represents the best our model could do if we had an infinite amount of training data to beat down sampling noise. The bias is a property of the kind of functions, or model class, we are using to approximate f(x). In general, the more complex the model class we use, the smaller the bias. However, we do not generally have an infinite amount of data. For this reason, to get best predictive power it is better to minimize the out-of-sample error, Eout rather than the bias. As shown graphically in Figure 4, Eout can be naturally decomposed into a bias, which measures how well we can hypothetically do in the infinite data limit, and a variance which measures the typical errors introduced in training our model due to sampling noise from having a finite training set.

The final quantity shown in Figure 4 is the difference between the generalization and training error. It measures how well our in-sample error reflects the out-of-sample error, and measures how much worse we would do on a new data set compared to our training data. For this reason, the difference between these errors is precisely the quantity that measures the difference between fitting and predicting. Models with a large difference between the in-sample and out-of-sample errors are said to “overfit” the data. One of the lessons of statistical learning theory is that it is not enough to simply minimize the training error, since the out-of-sample error can still be large. As we will see in our discussion of regression in Sec. VI, this insight naturally leads to the idea of “regularization”.

Figure 5: Bias-Variance tradeoff and model complexity. This schematic shows the typical out-of-sample error Eout as function of the model complexity for a training dataset of fixed size. Notice how the bias always decreases with
model complexity, but the variance, i.e. fluctuation in performance due to finite size sampling effects, increases with model complexity. Thus, optimal performance is achieved at intermediate levels of model complexity.

The second schematic, shown in Figure 5, shows the out-of-sample, or test, error Eout as function of “model complexity”. Model complexity is a very subtle idea and defining it precisely is one of the great achievements of statistical learning theory. However, roughly speaking, model complexity is a measure of the complexity of the model class we are using to approximate the true function f(x). For example, a model with more free parameters is generally more complex than one with fewer fitting parameters111Note that models with more parameters are not always more complex. One neat example in the context of one-dimensional regression in given in Friedman et al. (2001), Figure 7.5.. In the example of polynomial regression discussed above, higher-order polynomials are more complex than the linear model. If we consider a training dataset of a fixed size, Eout will be a non-monotonic function of the model complexity, and is generally minimized for models with intermediate complexity. The underlying reason for this is that, even though using a more complicated model always reduces the bias, at some point the model becomes too complex for the amount of training data and the generalization error becomes large due to high variance. Thus, to minimize Eout and maximize our predictive power, it may be more suitable to use a more biased model with small variance than a less-biased model with large variance. This important concept is commonly called the bias-variance tradeoff and gets at the heart of why machine learning is difficult.

Figure 6: Bias-Variance tradeoff. Another useful depiction of the bias-variance tradeoff is to think about how Eout varies as we consider different training data sets of a fixed size. A more complex model (green) will exhibit larger fluctuations (variance) due to finite size sampling effects than the simpler model (black). However, the average over all the trained models (bias) is closer to the true model for the more complex model.

Another way to visualize the bias-variance tradeoff is shown in Figure 6. In this figure, we imagine training a complex model (shown in green) and a simpler model (shown in black) many, many times on different training sets of a fixed size N. Due to the sampling noise from having finite size data sets, the learned models will differ for each choice of training sets. In general, more complex models need a larger amount of training data. For this reason, the fluctuations in the learned models (variance) will be much larger for the more complex model than the simpler model. However, if we consider the asymptotic performance as we increase the size of the training set (the bias), it is clear that the complex model will eventually perform better than the simpler model. Thus, depending on the amount of training data, it may be more favorable to use a less complex, high-bias model to make predictions.

iii.2 Bias-Variance Decomposition

In this section, we dig further into the central principle that underlies much of machine learning: the bias-variance tradeoff. Roughly speaking, this says that there is a tradeoff between how expressive our model class is and how sensitive the fitted model is to sample fluctuations in the training data. That is, the more (less) expressive the model, the larger (smaller) the fluctuations. Less expressive models exhibit bias in what they are able to fit, and thus there is a tradeoff between the bias and the variance of the fitted model. Oftentimes in physics, we are mostly concerned with expressivity, e.g. whether the true ground state wavefunction can be well-approximated by a class of variational wavefunctions such as a matrix product state. In the learning context, there is the additional challenge of finding the best variational state with finite sampling. We will see that while this concept is a generally useful heuristic to keep in mind, it is a mathematically precise statement when decomposing the squared error. Finally, we note that a better term would be the bias-variance decomposition, as it is possible to have high bias and high variance.

We will discuss the bias-variance tradeoff in the context of continuous predictions such as regression. However, many of the intuitions and ideas discussed here also carry over to classification tasks. Consider a dataset L consisting of the data XL={(yj,xj),j=1…N}. Let us assume that the true data is generated from a noisy model

y=f(x)+ϵ

(2)

where ϵ is normally distributed with mean zero and standard deviation σϵ.

Assume that we have a statistical procedure (e.g. least-squares regression) for forming a predictor ^gL(x) that gives the prediction of our model for a new data point x. This estimator is chosen by minimizing a cost function which we take to be the squared error

C(X,^g(x))=∑i(yi−^gL(xi))2.

(3)

We are interested in the generalization error on all data drawn from the true model, not just the error on the particular training dataset L that we have in hand. This is just the expectation of the cost function over many different data sets {Lj}. Denote this expectation value by EL. In other words, we can view ^gL as a stochastic functional that depends on the dataset L and we can think of EL as the expected value of the functional if we drew an infinite number of datasets {L1,L2,…}.

We would also like to average over different instances of the “noise” ϵ and we denote the expectation value over the noise by Eϵ. Thus, we can decompose the expected generalization error as

where in the last line we used the fact that our noise has zero mean and variance σ2ϵ and the sum over i applies to all terms. It is also helpful to further decompose the second term as follows:

EL[(f(xi)−^gL(xi))2]

=

EL[(f(xi)−EL[^gL(xi)]+EL[^gL(xi)]−^gL(xi))2]

(5)

=

+

2EL[(f(xi)−EL[^gL(xi)])(^gL(xi)−EL[^gL(xi)])]

=

(f(xi)−EL[^gL(xi)])2+EL[(^gL(xi)−EL[^gL(xi)])2].

The first term is called the bias

Bias2=∑i(f(xi)−EL[^gL(xi)])2

(6)

and measures the deviation of the expectation value of our estimator (i.e. the asymptotic value of our estimator in the infinite data limit) from the true value. The second term is called the variance

Var=∑iEL[(^gL(xi)−EL[^gL(xi)])2],

(7)

and measures how much our estimator fluctuates due to finite-sample effects. Combining these expressions, we see that the expected out-of-sample error of our model can be decomposed as

Eout=EL,ϵ[C(X,^g(x))]=Bias2+Var+Noise.

(8)

The bias-variance tradeoff summarizes the fundamental tension in machine learning, particularly supervised learning, between the complexity of a model and the amount of training
data needed to train it. Since data is often limited, in practice it is often useful to use a less-complex model with higher bias – a model whose asymptotic performance is worse than another model – because it is easier to train and less sensitive to sampling noise arising from having a finite-sized training dataset (smaller variance). This is the basic intuition behind the schematics in Figs. 4, 5, and 6.

Iv Gradient Descent and its Generalizations

Almost every problem in ML and data science starts with the same ingredients: a dataset X, a model g(θ), which is a function of the parameters θ and a cost function C(X,g(θ)) that allows us to judge how well the model g(θ) explains the observations X. The model is fit by finding the values of θ that minimize the cost function. In this section, we discuss one of the most powerful and widely used classes of methods for performing this minimization – gradient descent and its generalizations. The basic idea behind these methods is straightforward: iteratively adjust the parameters in the direction where the gradient of the cost function is large and negative. In this way, the training procedure ensures the parameters flow towards a local minimum of the cost function. However, in practice gradient descent is full of surprises and a series of ingenious tricks have been developed by the optimization and machine learning communities to improve the performance of these algorithms.

The underlying reason why training a machine learning algorithm is difficult is that the cost functions we wish to optimize are usually complicated, rugged, non-convex functions in a high-dimensional space with many local minima. To make things even more difficult, we almost never have access to the true function we wish to minimize but, instead must estimate this function directly from data. In modern applications, both the size of the dataset and the number of parameters we wish to fit is often enormous (millions of parameters and examples). The goal of this chapter is to explain how gradient descent methods can be used to train machine learning algorithms even in these difficult settings.

This chapter seeks to both introduce commonly used methods and give intuition for why they work. We also include some practical tips for improving the performance of stochastic gradient descent LeCun et al. (1998b); Bottou (2012). To help the reader gain more intuition about gradient descent and its variants, we have developed a Jupyter notebook that allows the reader to visualize how these algorithms perform on two dimensional surfaces. The reader is encouraged to experiment with the accompanying notebook whenever a new method is introduced (especially to explore how changing hyper-parameters can effect performance). The reader may also wish to consult useful reviews that cover these topics Ruder (2016) and this blog http://ruder.io/optimizing-gradient-descent/.

iv.1 Gradient Descent and Newton’s method

We begin by introducing a simple first-order gradient descent method and comparing and contrasting it with another algorithm, Newton’s method. Newton’s method is intimately related to many algorithms (conjugate gradient, quasi-Newton methods) commonly used in physics for optimization problems. Denote the function we wish to minimize by E(θ). In the context of machine learning, E(θ) is just the cost function E(θ)=C(X,g(θ)). As we shall see for linear and logistic regression in Secs. VI, VII, this energy function can almost always be written as a sum over n data points,

E(θ)=n∑i=1ei(xi,θ).

(9)

For example, for linear regression ei is just the mean square-error for data point i whereas, for logistic regression, it is the cross-entropy. To make analogy with physical systems, we will often refer to this function as the “energy”.

In the simplest gradient descent (GD) algorithm, we iteratively update the parameters according to the following rule. Initialize the parameters to some value θ0 and iteratively update the parameters according to the equation

vt=ηt∇θE(θt),

θt+1=θt−vt

(10)

where ∇θE(θ) is the gradient of E(θ) w.r.t to θ and we have introduced
a learning rate, ηt, that controls how big a step we should take in the direction of the gradient at time t.
It is clear that for sufficiently small choice of the learning rate ηt this methods will converge to a local minimum of the cost function. However, choosing a small ηt comes at a huge computational cost. The smaller ηt, the more steps we have to take to reach the local minimum. In contrast, if ηt is too large, we can overshoot the minimum and the algorithm becomes unstable (it either oscillates or even moves away from the minimum). This is shown in Figure 7. In practice, one usually specifies a “schedule” that decreases ηt at long times. Common schedules include power law and exponential decays in time.

Figure 7: Gradient descent exhibits three qualitatively different regimes as a function of the learning rate. Result of gradient descent on surface z=x2+10y2−1 for learning rate of η=0.1,0.9,1.01.
Notice that the trajectory converges to the global minima in multiple steps for small learning rates (η=0.1). Increasing the learning rate further (η=0.9) causes the trajectory to oscillate around the global minima before converging. For even larger learning rates (η=1.01) the trajectory diverges from the minima. See corresponding notebook for details.

To better understand this behavior and highlight some of the shortcomings of GD, it is useful to contrast GD with Newton’s method which is the inspiration for many widely employed optimization methods. In Newton’s method, we choose the step v for the parameters in such a way as to minimize a second-order Taylor expansion to the energy function

E(θ+v)≈E(θ)+∇θE(θ)v+12vTH(θ)v,

where H(θ) is the Hessian matrix of second derivatives. Differentiating this equation respect to v and noting that for the optimal value vopt we expect ∇θE(θ+vopt)=0, yields the following equation

0=∇θE(θ)+H(θ)vopt.

(11)

Rearranging this expression results in the desired update rules for Newton’s method

vt=H−1(θt)∇θE(θt)

(12)

θt+1=θt−vt.

(13)

Since we have no guarantee that the Hessian is well conditioned, in almost all applications of Netwon’s method, one replaces the inverse of the Hessian H−1(θt) by some suitably regularized pseudo-inverse such as [H(θt)+ϵI]−1 with ϵ a small parameter Battiti (1992).

For the purposes of machine learning, Newton’s method is not practical for two interrelated reasons. First, calculating a Hessian is an extremely expensive numerical computation. Second, even if we employ first-order approximation methods to approximate the Hessian (commonly called quasi-Newton methods), we must store and invert a matrix with n2 entries, where n is the number of parameters. For models with millions of parameters such as those commonly employed in the neural network literature, this is close to impossible with present-day computational power.
Despite these practical shortcomings, Newton’s method gives many important intuitions about how to modify GD algorithms to improve their performance. Notice that, unlike in GD where the learning rate is the same for all parameters, Newton’s method automatically “adapts” the learning rate of different parameters depending on the Hessian matrix.
Since the Hessian encodes the curvature of the surface we are trying to find the minimum of – more specifically, the singular values of the Hessian are inversely proportional to the squares of the local curvatures of the surface – Newton’s method automatically adjusts the step size so that one takes larger steps in flat directions with small curvature and smaller steps in steep directions with large curvature.

Our derivation of Newton’s method also allows us to develop intuition about the role of the learning rate in GD. Let’s first consider the special case of using GD to find the minimum of a quadratic energy function of a single parameter θLeCun et al. (1998b). Given the current value of our parameter θ, we can ask what is the optimal choice of the learning rate ηopt, where ηopt is defined as the value of η that allows us to reach the minimum of the quadratic energy function in a single step (see Figure 8). To find ηopt, we expand the energy function to second order around the current value

One can show that there are four qualitatively different regimes possible (see Fig. 8) LeCun et al. (1998b). If η<ηopt, then GD will take multiple small steps to reach the bottom of the potential. For η=ηopt, GD reaches the bottom of the potential in a single step. If ηopt<η<2ηopt, then the GD algorithm will oscillate across both sides of the potential before eventually converging to the minima. However, when η>2ηopt, the algorithm actually diverges!

Figure 8: Effect of learning rate on convergence. For a one dimensional quadratic potential, one can show that there exists four different qualitative behaviors for gradient descent (GD) as a function of the learning rate η depending on the relationship between η and ηopt=[∂2θE(θ)]−1. (a) For η<ηopt, GD converges to the minimum. (b) For η=ηopt, GD converges in a single step. (c) For ηopt<η<2ηopt, GD oscillates around the minima and eventually converges. (d) For η>2ηopt, GD moves away from the minima. This figure is adapted from LeCun et al. (1998b).

It is straightforward to generalize this to the multidimensional case. The natural multidimensional generalization of the second derivative is the Hessian H(θ). We can always perform a singular value decomposition (i.e. a rotation by an orthogonal matrix for quadratic minima where the Hessian is symmetric, see Sec. VI.2 for a brief introduction to SVD) and consider the singular values {λ} of the Hessian. If we use a single learning rate for all parameters, in analogy with (16), convergence requires that

η<2λmax,

(17)

where λmax is the largest singular value of the Hessian. If the minimum eigenvalue λmin differs significantly from the largest value λmax, then convergence in the λmin-direction will be extremely slow! One can actually show that the convergence time scales with the condition number κ=λmax/λminLeCun et al. (1998b).

iv.2 Limitations of the simplest gradient descent algorithm

The last section hints at some of the major shortcomings of the simple GD algorithm described in (10). Before proceeding, we briefly summarize these limitations and discuss general strategies for modifying GD to overcome these deficiencies.

GD finds local minima of our function. Since the GD algorithm is deterministic, if it converges, it will converge to a local minimum of our energy function. Because in ML we are often dealing with extremely rugged landscapes with many local minima, this can lead to poor performance. A similar problem is encountered in physics. To overcome this, physicists often use methods like simulated annealing that introduce a fictitious “temperature” which is eventually taken to zero. The “temperature” term introduces stochasticity in the form of thermal fluctuations that allow the algorithm to thermally tunnel over energy barriers. This suggests that, in the context of ML, we should modify GD to include stochasticity.

GD is sensitive to initial conditions. One consequence of the local nature of GD is that initial conditions matter. Depending on where one starts, one will end up at a different local minima. Therefore, it is very important to think about how one initializes the training process. This is true for GD as well as more complicated variants of GD introduced below.

Gradients are computationally expensive to calculate for large datasets. In many cases in statistics and ML, the energy function is a sum of terms, with one term for each data point. For example, in linear regression, E∝∑ni=1(yi−wT⋅xi)2; for logistic regression, the square error is replaced by the cross entropy, see Secs. VI, VII. Thus, to calculate the gradient we have
to sum over alln data points. Doing this at every GD step becomes extremely computationally expensive. An ingenious solution to this, discussed below, is to calculate the gradients using small subsets of the data called “mini batches”. This has the added benefit of introducing stochasticity into our algorithm.

GD is very sensitive to choices of learning rates. As discussed above, GD is extremely sensitive to the choice of learning rates. If the learning rate is very small, the training process take an extremely long time. For larger learning rates, GD can diverge and give poor results. Furthermore, depending on what the local landscape looks like, we have to modify the learning rates to ensure convergence. Ideally, we would “adaptively” choose the learning rates to match the landscape.

GD treats all directions in parameter space uniformly. Another major drawback of GD is that unlike Newton’s method, the learning rate for GD is the same in all directions in parameter space. For this reason, the maximum learning rate is set by the behavior of the steepest direction and this can significantly slow down training. Ideally, we would like to take large steps in flat directions and small steps in steep directions. Since we are exploring rugged landscapes where curvatures change, this requires us to keep track of not only the gradient but second derivatives of the energy function (note as discussed above, the ideal scenario would be to calculate the Hessian but this proves to be too computationally expensive).

GD can take exponential time to escape saddle points, even with random initialization. As we mentioned, GD is extremely sensitive to initial condition since it determines the particular local minimum GD would eventually reach. However, even with a good initialization scheme, through the introduction of randomness to be introduced later, GD can still take exponential time to escape saddle points, which are prevalent in high-dimensional space, even for non-pathological objective functions Du et al. (2017). Indeed, there are modified GD methods developed recently to accelerate the escape. The details of these boosted method are beyond the scope of this review, and we refer avid readers to Jin et al. (2017) for details.

In the next few subsections, we will introduce variants of GD that address many of these shortcomings. These generalized gradient descent methods form the backbone of much of modern deep learning and neural networks, see Sec IX. For this reason, the reader is encouraged to really experiment with different methods in landscapes of varying complexity using the accompanying notebook.

iv.3 Stochasticity Gradient Descent (SGD) with mini-batches

One of the most widely-applied variants of the gradient descent algorithm is stochastic gradient descent (SGD)Bottou (2012); Williams and Hinton (1986). As the name suggests, unlike ordinary GD, the algorithm is stochastic. Stochasticity is incorporated by approximating the gradient on a subset of the data called a minibatch222Traditionally, SGD was reserved for the case where you train on a single example – in other words minibatches of size 1. However, we will use SGD to mean any approximation to the gradient on a subset of the data.. The size of the minibatches is almost always much smaller than the total number of data points n, with typical minibatch sizes ranging from ten to a few hundred data points. If there are n points, and the mini-batch size is M, there will be n/M minibatches. Let us denote these minibatches by Bk where k=1…n/M. Thus, in SGD, at each gradient descent step we approximate the gradient using a minibatch Bk,

∇θE(θ)=n∑i∇θei(xi,θ)⟶∑i∈Bk∇θei(xi,θ),

(18)

cycling over all M minibatches. A full iteration over all n data points – in other words using all M minibatches – is called an epoch. For notational convenience, we will denote the mini-batch approximation to the gradient by

∇θEMB(θ)=M∑i∈Bk∇θei(xi,θ).

(19)

With this notation, we can rewrite the SGD algorithm as

vt=ηt∇θEMB(θ),

θt+1=θt−vt.

(20)

Thus, in SGD, we replace the actual gradient over the full data at each gradient descent step by an approximation to the gradient computed using a minibatch. This has two important benefits. First, it introduces stochasticity and decreases the chance that our fitting algorithm gets stuck in isolated local minima. Second, it significantly speeds up the calculation as one does not have to use all n data points to calculate the gradient. Empirical and theoretical work suggests that SGD has additional benefits. Chief among these is that introducing
stochasticity is thought to act as a natural regularizer that prevents overfitting in deep, isolated minima Bishop (1995a); Keskar et al. (2016).

iv.4 Adding Momentum

In practice, SGD is almost always used with a “momentum” or inertia term that serves as a memory of the direction we are moving in parameter space. This is typically
implemented as follows

vt

=

γvt−1+ηt∇θE(θt)

θt+1

=

θt−vt,

(21)

where we have introduced a momentum parameter γ, with 0≤γ≤1, and for brevity we dropped the explicit notation to indicate the gradient is to be taken over a different mini-batch at each step. We call this algorithm gradient descent with momentum (GDM). From these equations, it is clear that vt is a running average of recently encountered gradients and (1−γ)−1 sets the characteristic time scale for the memory used in the averaging procedure. Consistent with this, when γ=0, this just reduces down to ordinary SGD as described in Eq. (20). An equivalent way of writing the updates is

Δθt+1=γΔθt−ηt∇θE(θt),

(22)

where we have defined Δθt=θt−θt−1. In what should be a familiar scenario to many physicists, momentum based methods were first introduced in old, largely forgotten (until recently) Soviet papers Polyak (1964); Nesterov (1983).

Before proceeding further, let us try to get more intuition from these equations. It is helpful to consider a simple physical analogy with a particle of mass m moving in a viscous medium with drag coefficient μ and potential
E(w)Qian (1999). If we denote the particle’s position by w, then its motion is described by

md2wdt2+μdwdt=−∇wE(w).

(23)

We can discretize this equation in the usual way to get

mwt+Δt−2wt+wt−Δt(Δt)2+μwt+Δt−wtΔt=−∇wE(w).

(24)

Rearranging this equation, we can rewrite this as

Δwt+Δt=−(Δt)2m+μΔt∇wE(w)+mm+μΔtΔwt.

(25)

Notice that this equation is identical to Eq. (22) if we identify the position of the particle, w, with the parameters θ. This allows
us to identify the momentum parameter and learning rate with the mass of the particle and the viscous drag as:

γ=mm+μΔt,η=(Δt)2m+μΔt.

(26)

Thus, as the name suggests, the momentum parameter is proportional to the mass of the particle and effectively provides inertia. Furthermore, in the large viscosity/small learning rate limit, our memory time scales as (1−γ)−1≈m/(μΔt).

Why is momentum useful? SGD momentum helps the gradient descent algorithm gain speed in directions with persistent but small gradients even in the presence of stochasticity, while suppressing oscillations in high-curvature directions. This becomes especially important in situations where the landscape is shallow and flat in some directions and narrow and steep in others. It has been argued that first-order methods (with appropriate initial conditions) can perform comparable to more expensive second order methods, especially in the context of complex deep learning models Sutskever et al. (2013). Empirical
studies suggest that the benefits of including momentum are especially pronounced in complex models in the initial “transient phase” of training, rather than during a subsequent fine-tuning of a coarse minimum. The reason for this is that, in this transient phase, correlations in the gradient persist across many gradient descent steps, accentuating the role of inertia and memory.

These beneficial properties of momentum can sometimes become even more pronounced by using a slight modification of the classical momentum algorithm called Nesterov Accelerated Gradient (NAG)
Sutskever et al. (2013); Nesterov (1983). In the NAG algorithm, rather than calculating the gradient at the current parameters, ∇θE(θt), one calculates the gradient at the expected value of the parameters given our current momentum, ∇θE(θt+γvt−1). This yields the NAG update rule

vt

=

γvt−1+ηt∇θE(θt+γvt−1)

θt+1

=

θt−vt.

(27)

One of the major advantages of NAG is that it allows for the use of a larger learning rate than GDM for the same choice of γ.

iv.5 Methods that use the second moment of the gradient

In stochastic gradient descent, with and without momentum, we still have to specify a “schedule” for tuning the learning rates ηt as a function of time. As discussed in the context of Newton’s method, this presents a number of dilemmas. The learning rate is limited by the steepest direction which can change depending on the current position in the landscape. To circumvent this problem, ideally our algorithm would keep track of curvature and take large steps in shallow, flat directions and small steps in steep, narrow directions. Second-order methods accomplish this by calculating or approximating the Hessian and normalizing the learning rate by the curvature. However, this is very computationally expensive for extremely large models. Ideally, we would like to be able to adaptively change the step size to match the landscape without paying the steep computational price of calculating or approximating Hessians.

Recently, a number of methods have been introduced that accomplish this by tracking not only the gradient, but also the second moment of the gradient. These methods include AdaGrad Duchi et al. (2011), AdaDelta Zeiler (2012), RMS-Prop Tieleman and Hinton (2012), and ADAM Kingma and Ba (2014). Here, we discuss the last two as representatives of this class of algorithms.

In RMS prop, in addition to keeping a running average of the first moment of the gradient, we also keep track of the second moment denoted by st=E[g2t]. The update rule for RMS prop is given by

gt

=

∇θE(θ)

(28)

st

=

βst−1+(1−β)g2t

θt+1

=

θt−ηtgt√st+ϵ,

where β controls the averaging time of the second moment and is typically taken to be about β=0.9, ηt is a learning rate typically chosen to be 10−3, and ϵ∼10−8 is a small regularization constant to prevent divergences. Multiplication and division by vectors is understood as an element-wise operation. It is clear from this formula that the learning rate is reduced in directions where the norm of the gradient is consistently large. This greatly speeds up the convergence by allowing us to use a larger learning rate for flat directions.

A related algorithm is the ADAM optimizer. In ADAM, we keep a running average of both the first and second moment of the gradient and use this information to adaptively change the learning rate for different parameters. In addition to keeping a running average of the first and second moments of the gradient (i.e. mt=E[gt] and st=E[g2t], respectively), ADAM performs an additional bias correction to account for the fact that we are estimating the first two moments of the gradient using a running average (denoted by the hats in the update rule below). The update rule for ADAM is given by (where multiplication and division are once again understood to be element-wise operations below)

gt

=

∇θE(θ)

(29)

mt

=

β1mt−1+(1−β1)gt

st

=

β2st−1+(1−β2)g2t

^mt

=

mt1−βt1

^st

=

st1−βt2

θt+1

=

θt−ηt^mt√^st+ϵ,

where β1 and β2 set the memory lifetime of the first and second moment and are typically taken to be 0.9 and 0.99 respectively, and η and ϵ are identical to RMSprop.

Like in RMSprop, the effective step size of a parameter depends on the magnitude of its gradient squared. To understand this better, let us rewrite this expression in terms of the variance σ2t=^st−(^mt)2. Consider a single parameter θt. The update rule for this parameter is given by

Δθt+1=−ηt^mt√σ2t+m2t+ϵ.

(31)

We now examine different limiting cases of this expression. Assume that our gradient estimates are consistent so that the variance is small. In this case our update rule tends to Δθt+1→−ηt (here we have assumed that ^mt≫ϵ). This is equivalent to cutting off large persistent gradients at 1 and limiting the maximum step size in steep directions. On the other hand, imagine that the gradient is widely fluctuating between gradient descent steps. In this case σ2≫^m2t so that our update becomes Δθt+1→−ηt^mt/σt. In other words, we adapt our learning rate so that it is proportional to the signal-to-noise ratio (i.e. the mean in units of the standard deviation). From a physical point of view, this is extremely desirable. The standard deviation serves as a natural adaptive scale for deciding whether a gradient is large or small. Thus, ADAM has the beneficial effects of adapting our step size so that we cut off large gradient directions (and hence prevent oscillations and divergences) and measuring gradients in terms of a natural length scale, the standard deviation σt. The discussion above also explains empirical observations showing that the performance of both ADAM and RMSprop is drastically reduced if the square root is omitted in the update rule. It’s also worth noting that recent studies have shown adaptive methods like RMSProp, ADAM, and AdaGrad tend to generalize worse than SGD in classification tasks, though they achieve smaller training error. Such discussion is beyond the scope of this review so we refer readers to Wilson et al. (2017) for more details.

iv.6 Comparison of various methods

To better understand these methods, it is helpful to visualize the performance of the five methods discussed above – gradient descent (GD), gradient descent with momentum (GDM), NAG, ADAM, and RMSprop. To do so, we will use Beale’s function:

f(x,y)

=

(1.5−x+xy)2

+(2.25−x+xy2)2+(2.625−x+xy3)2.

This function has a global minimum at (x,y)=(3,0.5) and an interesting structure that can be seen in Fig. 9. The figure shows the results of using all five methods for Nsteps=104 steps for three different initial conditions. In the figure, the learning rate for GD, GDM, and NAG are set to η=10−6 whereas RMSprop and ADAM have a learning rate of η=10−3. The learning rates for RMSprop and ADAM can be set significantly higher than the other methods due to their adaptive step sizes. For this reason, ADAM and RMSprop tend to be much quicker at navigating the landscape than simple momentum based methods (see Fig. 9). Notice that in some cases (e.g. initial condition of (−1,4)), the trajectories do not find the global minimum but instead follow the deep, narrow ravine that occurs along y=1. This kind of landscape structure is generic in high-dimensional spaces where saddle points proliferate. Once again, the adaptive step size and momentum of ADAM and RMSprop allows these methods to traverse the landscape faster than the simpler first-order methods. The reader is encouraged to consult the corresponding Jupyter notebook and experiment with changing initial conditions, the loss surface being minimized, and hyper-parameters to gain more intuition about all these methods.

Figure 9: Comparison of GD and its generalization for Beale’s function. Trajectories from gradient descent (GD; black line), gradient descent with momentum (GDM; magenta line), NAG (cyan-dashed line), RMSprop (blue dash-dot line), and ADAM (red line) for Nsteps=104. The learning rate for GD, GDM, NAG is η=10−6 and η=10−3 for ADAM and RMSprop.
β=0.9 for RMSprop, β1=0.9 and β2=0.99 for ADAM, and ϵ=10−8 for both methods. Please see corresponding notebook for details.

iv.7 Gradient descent in practice: practical tips

We conclude this chapter by compiling some practical tips from experts for getting the best performance from gradient descent based algorithms, especially in the context of deep neural networks discussed later in the review, see Secs. XVI.2, IX. This section draws heavily on best practices laid out in LeCun et al. (1998b); Bottou (2012); Tieleman and Hinton (2012).

Randomize the data when making mini-batches. It is always important to randomly shuffle the data when forming mini-batches. Otherwise, the gradient descent method can fit spurious correlations resulting from the order in which data is presented.

Transform your inputs. As we discussed above, learning becomes difficult when our landscape has a mixture of steep and flat directions.
One simple trick for minimizing these situations is to standardize the data by subtracting the mean and normalizing the variance of input variables. Whenever possible, also decorrelate the inputs. To understand why this is helpful, consider the case of linear regression. It is easy to show that for the squared error cost function, the Hessian of the energy matrix is just the correlation matrix between the inputs. Thus, by standardizing the inputs, we are ensuring that the landscape looks homogeneous in all directions in parameter space. Since most deep networks can be viewed as linear transformations followed by a non-linearity at each layer, we expect this intuition to hold beyond the linear case.

Monitor the out-of-sample performance. Always monitor the performance of your model on a validation set (a small portion of the training data that is held out of the training process to serve as a proxy for the test set – see Sec. XI for more on validation sets). If the validation error starts increasing, then the model is beginning to overfit. Terminate the learning process. This early stopping significantly improves performance in many settings.

Adaptive optimization methods don’t always have good generalization. As we mentioned, recent studies have shown that adaptive methods such as ADAM, RMSPorp, and AdaGrad tend to have poor generalization compared to SGD or SGD with momentum, particularly in the high-dimensional limit (i.e. the number of parameters exceeds the number of data points) Wilson et al. (2017). Although it is not clear at this stage why these methods perform so well in training deep neural networks such as generative adversarial networks (GANs) Goodfellow et al. (2014), simpler procedures like properly-tuned SGD may work as well or better in these applications.

V Overview of Bayesian Inference

Statistical modeling usually revolves around the estimation or prediction of unknown quantities Jaynes (1996). Bayesian methods are based on the fairly simple premise that probability can be used as a mathematical tool for describing uncertainty. This is not that different in spirit from the main idea of statistical mechanics in physics, where we use probability to describe the behavior of large systems where we cannot know the positions and momenta of all of the particles even if the system itself is fully deterministic (at least classically). In practice, Bayesian inference provides a set of principles and procedures for learning from data and for describing uncertainty. In this section, we’ll give a gentle introduction to Bayesian inference, with special emphasis on its logic (i.e. Bayesian reasoning) and connection to ML discussed in Sec. II and III. For a technical account of Bayesian inference in general, we refer readers to Gelman et al. (2014); Barber (2012).

v.1 Bayes Rule

To solve a problem using Bayesian methods, we have to specify two functions: the likelihood function p(X|w), which describes the probability of observing a dataset X for a given value of the unknown parameters w, and the prior distribution p(w), which describes any knowledge we have about the parameters before we collect the data. Note that the likelihood should be considered as a function of the parameters w with the data X held fixed. The prior distribution and the likelihood function are used to compute the posterior distribution p(w|X) via Bayes’ rule:

p(w|X)=p(X|w)p(w)∫dwp(X|w)p(w).

(33)

The posterior distribution describes our knowledge about the unknown parameter w after observing the data X. In many cases, it will not be possible to analytically compute the normalizing constant in the denominator of the posterior distribution (i.e. the partition function p(X)=∫dwp(X|w)p(w)) and Markov Chain Monte Carlo (MCMC) methods are needed to draw random samples from p(w|X).

The likelihood function p(X|w) is a common feature of both classical statistics and Bayesian inference, and is determined by the model and the measurement noise. Many common statistical
procedures such as least-square fitting can be cast into the formalism of Maximum Likelihood Estimation (MLE). In MLE, one chooses the parameters ^w that maximize the likelihood (or equivalently the log-likelihood since log is a monotonic function) of the observed data:

^w=argmaxwlogp(X|w).

(34)

In other words, in MLE we choose the parameters that maximize the probability of seeing the observed data given our generative model. MLE is an important concept in both frequentist and Bayesian statistics.

The prior distribution, by contrast, is uniquely Bayesian. There are two general classes of priors: if we do not have any specialized knowledge about w before we look at the data then we would like to select an uninformative prior that reflects our ignorance, otherwise we should select an informative prior that accurately reflects the knowledge we have about w. This review will focus on informative priors that are commonly used for ML applications. However, there is a large literature on uninformative priors, including reparameterization invariant priors, that would be of interest to physicists and we refer the interested reader to Jeffreys (1946); Jaynes (1996); Berger and Bernardo (1992); Gelman et al. (2014); Mattingly et al. (2018).

Using an informative prior tends to decrease the variance of the posterior distribution while, potentially, increasing its bias. This is beneficial if the decrease in variance is larger than the increase in bias. In high-dimensional problems, it is reasonable to assume that many of the parameters will not be strongly relevant. Therefore, many of the parameters of the model will be zero or close to zero. We can express this belief using two commonly used priors: the Gaussian prior p(w|λ)=∏j√λ2πe−λw2j is used to express the assumption that many of the parameters will be small, and the Laplace prior p(w|λ)=∏jλ2e−λ|wj| is used to express the assumption that many of the parameters will be zero. We’ll come back to this point later in Sec. VI.6.

v.2 Bayesian Decisions

The above section presents the tools for computing the posterior distribution p(w|X), which uses probability as a framework for expressing our knowledge about the parameters w. In most cases, however, we need to summarize our knowledge and pick a single “best” value for the parameters. In principle, the specific value of the parameters should be chosen to maximize a utility function. In practice, however, we usually use one of two choices: the posterior mean ⟨w⟩=∫dwwp(w|X), or the posterior mode ^w=argmaxwp(w|X). Often, ⟨w⟩ is called the Bayes estimate and ^w is called the maximum-a-posteriori or MAP estimate. While the Bayes estimate minimizes the mean-squared error, the MAP estimate is often used instead because it is easier to compute.

v.3 Hyperparameters

The Gaussian and Laplace prior distributions used to express the assumption that many of the model parameters will be small or zero both have an extra parameter λ. This hyperparameter or nuisance variable has to be chosen somehow. One standard Bayesian approach is to define another prior distribution for λ – usually using an uninformative prior – and to average the posterior distribution over all choices of λ. This is called a hierarchical prior. Computing averages, however, often requires long Markov Chain Monte Carlo simulations that are computationally intensive. Therefore, it is simpler if we can find a good value of λ using an optimization procedure instead. We will discuss how this is done in practice when discussing linear regression in Sec. VI.

Vi Linear Regression

In Section II, we performed our first numerical ML experiments by fitting datasets generated by polynomials in the presence of different levels of additive noise. We used the l fitted parameters to make predictions on ‘unseen’ observations, allowing us to gauge the performance of our model on new data. These experiments highlighted the fundamental tension common to all ML models between how well we fit the training dataset and predictions on new data. The optimal choice of predictor depended on, among many other things, the functions used to fit the data and the underlying noise level. In Section III, we formalized this by introducing the notion of model complexity and the bias-variance decomposition, and discussed the statistical meaning of learning. In this section, we take a closer look at these ideas in the simple setting of linear regression.

As in Section II, fitting a given set of samples (yi,xi) means relating the independent variables xi to their responses yi. For example, suppose we want to see how the voltage across two sides of a metal slab V changes in response to the applied electric current I. Normally we would first make a bunch of measurements labeled by i and plot them on a two-dimensional scatterplot, (Vi,Ii). The next step is to assume, either from an oracle or from theoretical reasoning, some models that might explain the measurements and measuring their performance. Mathematically, this amounts to finding some function f such that Vi=f(Ii;w), where w is some parameter (e.g. the electrical resistance R of the metal slab in the case of Ohm’s law). We then try to minimize the errors made in explaining the given set of measurements based on our model f by tuning the parameter w. To do so, we need to first define the error function (formally called the loss function) that characterizes the deviation of our prediction from the actual response.

Before formulating the problem, let us set up the notation. Suppose we are given a dataset with n samples D={(yi,xi)}ni=1, where xi is the i-th observation vector while yi is its corresponding (scalar) response. We assume that every sample has p features xi∈Rp. Let f be the true function/model that generated these samples via yi=f(xi;wtrue)+ϵi, where wtrue∈Rp is a parameter vector and ϵi is some i.i.d. white noise with zero mean and finite variance. Conventionally, we cast all samples into a n×p matrix, X∈Rn×p, called the design matrix, with the rows xi,⋯,xn being observations and the columns X1,⋯,Xp being measured features. Bear in mind that this function f is never known to us explicitly, though in practice we usually presume its functional form. For example, in linear regression, we assume yi=f(xi;wtrue)+ϵi=xTiwtrue+ϵi for some unknown but fixed wtrue∈Rp.

We want to find a function g with parameters w fit to the data D that can best approximate f. When this is done, meaning we have found a ^w such that g(x;^w) yields our best estimate of f, we can use this g to make predictions about the response y0 for a new data point x0, as we did in Section II.

It will be helpful for our discussion of linear regression to define one last piece of notation. For any real number p≥1, we define the Lp norm of a vector x=(x1,⋯,xd)∈Rd to be

||x||p=(|x1|p+⋯+|xd|p)1p

(35)

vi.1 Least-square regression

Ordinary least squares linear regression (OLS) is defined as the minimization of the L2 norm of the difference between the response yi and the predictor g(xi;w)=xTiw:

minw∈Rp||Xw−y||22=minw∈Rpn∑i=1(xTiw−yi)2.

(36)

In other words, we are looking to find the w which minimizes the L2 error. Geometrically speaking, the predictor function g(xi;w)=xTiw defines a hyperplane in Rp. Minimizing the least squares error is therefore equivalent to minimizing the sum of all projections (i.e. residuals) for all points xi to this hyperplane (see Fig. 10). Formally, we denote the solution to this problem as ^wLS:

^wLS=argminw∈Rp||Xw−y||22,

(37)

which, after straightforward differentiation, leads to

^wLS=(XTX)−1XTy.

(38)

Note that we have assumed that XTX is invertible, which is often the case when n≫p. Formally speaking, if rank(X)=p, namely, the predictors X1,…,Xp (i.e. columns of X) are linearly independent, then ^wLS is unique. In the case of rank(X)<p, which happens when p>n, XTX is singular, implying there are infinitely many solutions to the least squares problem, Eq. (37). In this case, one can easily show that if w0 is a solution, w0+η is also a solution for any η which satisfies Xη=0 (i.e. η∈ null(X)). Having determined the least squares solution, we can calculate y, the best fit of our data X, as ^y=X^wLS=PXy, where PX=X(XTX)−1XT, c.f. Eq. (36). Geometrically, PX is the projection matrix which acts on y and projects it onto the column space of X, which is spanned by the predictors X1,⋯,Xp (see FIG. 11). Notice that we found the optimal solution ^wLS in one shot, without doing any sort of iterative optimization like that discussed in Section IV.

Figure 10: Geometric interpretation of least squares regression. The regression function g defines a hyperplane in Rp (green solid line, here we have p=2) while the residual of data point xi (hollow circles) is its projection onto this hyperplane (bar-ended dashed line).

In Section III we explained that the difference between learning and fitting lies in the prediction on “unseen" data. It is therefore necessary to examine the out-of-sample error. For a more refined argument on the role of out-of-sample errors in linear regression, we encourage the reader to do the exercises in the corresponding Jupyter notebooks. The upshot is, following our definition of ¯Ein and ¯Eout in Section III, the average in-sample and out-of-sample error can be shown to be

¯Ein

=

σ2(1−pn)

(39)

¯Eout

=

σ2(1+pn),

(40)

provided we obtain the least squares solution ^wLS from i.i.d. samples X and y generated through y=Xwtrue+ϵ333This requires that ϵ is a noise vector whose elements are i.i.d. of zero mean and variance σ2, and is independent of the samples X.. Therefore, we can calculate the average generalization error explicitly:

|¯Ein−¯Eout|=2σ2pn.

(41)

This imparts an important message: if we have p≫n (i.e. high-dimensional data), the generalization error is extremely large, meaning the model is not learning. Even when we have p≈n, we might still not learn well due to the intrinsic noise σ2. One way to ameliorate this is, as we shall see in the following few sections, to use regularization. We will mainly focus on two forms of regularization: the first one employs an L2 penalty and is called Ridge regression, while the second uses an L1 penalty and is called LASSO.

vi.2 Ridge-Regression

In this section, we study the effect of adding to the least squares loss function a regularizer defined as the L2 norm of the parameter vector we wish to optimize over. In other words, we want to solve the following penalized regression problem called Ridge regression:

^wRidge(λ)=argminw∈Rp(||Xw−y||22+λ||w||22).

(42)

This problem is equivalent to the following constrained optimization problem

^wRidge(t)=argminw∈Rp:||w||22≤t||Xw−y||22.

(43)

This means that for any t≥0 and solution ^wRidge in Eq. (43), there exists a value λ≥0 such that ^wRidge solves Eq. (42), and vice versa444Note that the equivalence between the penalized and the constrained (regularized) form of least square optimization does not always hold. It holds for Ridge and LASSO (introduced later), but not for best subset selection which is defined by choosing a L0 norm: λ||w||0. In this case, for every λ>0 and any ^wBS that solves the penalized form of best subset selection, there is a value t≥0 such that ^