Getting to the Bottom of Regression with Gradient Descent

Features

Author: Jocelyn T. Chi and Eric C. Chi

Date: 10 Jan 2014

Copyright: Image appears courtesy of iStock Photo

This article on gradient descent was written as part of a series on optimization methods for Statisticsviews.com. The article presents an overview of the gradient descent algorithm, offers some intuition on why the algorithm works and where it comes from, and provides examples of implementing it for ordinary least squares and logistic regression in R.

Introduction

From splines to generalized linear models, many problems in statistical estimation and regression can be cast as optimization problems. To identify trends and associations in data, we often seek the “best” fitting curve from a family of curves to filter the noise from our signals. In this article, we will consider “best” to be in terms of maximizing a likelihood.

Consider, for example, ordinary least squares (OLS) regression. OLS regression amounts to finding the line, or more generally, the plane, with minimal total Euclidean distance to the data points. In other words, we seek the plane that minimizes the distance between the data points and their orthogonal projections onto the plane. Formally, we seek a regression vector b that minimizes the objective or loss function \[ \begin{equation} \label{eq:min_problem} f({\bf b}) = \frac{1}{2} || {\bf y} - {\bf X}{\bf b} ||^{2}, \end{equation} \] where y is a vector of response variables and X is the design matrix. Although we can also obtain the solution to OLS regression directly by solving the normal equations, many other estimation and regression problems cannot be answered by simply solving a linear system of equations. Moreover, even solving a linear system of equations can become non-trivial when there are many parameters to be fit. Thus, we are motivated towards a simple approach to solving optimization problems that is general and works for a wide range of objective functions.

In this article, we review gradient descent, one of the simplest numerical optimization algorithms for minimizing differentiable functions. While more sophisticated algorithms may be faster, gradient descent is a reliable option when there is a need to fit data with a novel model, or when there are many parameters to be fit. Additionally, gradient descent presents a basis for many powerful extensions, including stochastic and proximal gradient descent. The former enables fitting regression models in very large data mining problems, and the latter has been successfully applied in matrix completion problems in collaborative filtering and signal processing. Understanding how gradient descent works lays the foundation for implementing and characterizing these more sophisticated variants.

In the rest of this article, we will outline the basic gradient descent algorithm and give an example of how it works. Then we will provide some intuition on why this algorithm works and discuss implementation considerations. We end with three examples: two OLS problems, and a logistic regression problem.

The Gradient Descent Algorithm.

The Basic Algorithm.

Let \( f({\bf b}) \) denote the function we wish to minimize. Starting from some initial point \( {\bf b}_0 \), the gradient descent algorithm repeatedly performs two steps to generate a sequence of parameter estimates \( {\bf b}_1, {\bf b}_2, \dots \). At the \( m^{\text{th}} \) iteration, we calculate \( \nabla f({\bf b_{m}}) \), the gradient of \( f({\bf b}) \) at \( {\bf b}_m \), and then take a step in the opposite direction. Combining these two steps, we arrive at the following update rule to go from the \( m^{\text{th}} \) iterate to the \( m+1^{\text{th}} \) iterate: \[ \begin{equation*} {\bf b}_{m+1} \gets {\bf b}_m - \alpha \nabla f({\bf b}_m), \end{equation*} \] where \( \alpha \) is a step-size that controls how far we step in the direction of the \( -\nabla f({\bf b_{m}}) \). In principle, we repeatedly invoke the update rule until the iterate sequence converges, since a fixed point \( {\bf b}^\star \) of the update rule is a stationary point of \( f({\bf b}) \), namely \( \nabla f({\bf b}^\star) = 0 \). In practice, we stop short of convergence and run the algorithm until the Euclidean norm of \( \nabla f({\bf b}_m) \) is sufficiently close to zero. In our examples, we stop the algorithm once \( \lVert \nabla f({\bf b}_m) \rVert \leq 1 \times 10^{-6} \).

We notice that the gradient descent algorithm stops when the iterates are sufficiently close to a stationary point rather than the global minimizer. In general, stationarity is a necessary but not sufficient condition for a point to be a local minimizer of a function. Convex functions constitute an important exception since all local minima are global minima for convex functions. Thus, finding a global minimizer of a convex function is equivalent to finding a stationary point. For general functions, however, there are no guarantees that gradient descent will arrive at the globally best solution, but this is true for all iterative algorithms.

Later in this article, we will discuss implementation considerations for the gradient descent algorithm, including choosing a step-size \( \alpha \) and initializing \( {\bf b} \). First, however, let us work through the mechanics of applying gradient descent on a simple example.

A Simple Example: Univariate OLS

We made an R package titled “Getting to the Bottom - A Package for Learning Optimization Methods” to enable reproduction of the examples in this article. Use of this package requires R (>= 3.0.2). Once you have R working, you can install and load the package by typing install.packages(gettingtothebottom) into the R console. After loading the package and calling the help file for the gdescent function, the code for each example can be copied and pasted from the help panel into the console.

We run our simple example using an initial value of \( {\bf b}_0 = {\bf 0} \), and \( \alpha = 0.01 \). The default setting on the gdescent function automatically adds a column vector of \( 1 \)'s before the first column in \( {\bf X} \) to estimate an intercept value. If you prefer to withhold the intercept in your model, you can do so by including intercept = FALSE when you call the gdescent function.

The gdescent output returns the minimum value for \( f \), an intercept, and values for the coefficients from \( {\bf b} \).

The progress of the algorithm can be observed by calling the plot_loss function.

plot_loss(simple_ex)

The plot shows how the loss function \( f \) decreases as the algorithm proceeds. The algorithm makes very good progress towards finding a local minimum early on. As it edges closer towards the minimum, however, progress becomes significantly slower.

As a sanity check, we can compare our gdescent results with what we obtain from the lm function for the same problem.

When \( \alpha \) is very large, \( \frac{1}{2\alpha}\lVert {\bf b} - \tilde{{\bf b}}\rVert ^{2} \) becomes very small and the approximation of \( f \) at \( {\bf b} \) becomes flat. Conversely, when \( \alpha \) is very small, \( \frac{1}{2\alpha}\lVert {\bf b} - \tilde{{\bf b}}\rVert ^{2} \) becomes quite large and correspondingly, the curvature of the approximation becomes more pronounced. The figure below illustrates how the choice of \( \alpha \) affects the quadratic approximation employed in the algorithm.

example.quadratic.approx(alpha1 = 0.01, alpha2 = 0.12)

In this figure, the black line depicts the OLS objective function \( f \) that we minimized in Example 1. The green dot denotes an initial starting point \( {\bf b} \) for the gradient descent algorithm. The red and blue curves show the quadratic approximations for \( f \) when \( \alpha=0.01 \) and \( \alpha=0.12 \), respectively. The dotted vertical lines intersect the curves at their minima. The lines also show how the minima provide different anchor points for the quadratic approximation of \( f \) in the next iteration of the algorithm.

Intuitively, choosing large values for the step-size \( \alpha \) results in greater progress towards a minimum in each iteration. As \( \alpha \) increases, however, the approximation becomes more linear and the minimizer of the approximation can drastically overshoot the minimizer of \( f \). We can ensure monotonically decreasing objective function values (i.e. \( f({\bf b}_m) \geq f({\bf b}_{m+1}) \geq f({\bf b}_{m+2}) \geq \ldots \)) if the approximations always “sit on top” of \( f \) (like the red approximation above). This prevents us from wildly overshooting the minimizer. Formally, we can guarantee monotonically decreasing objective function values when \( \nabla f({\bf b}) \) is \( L \)-Lipschitz continuous and \( \alpha \leq 1/L \). Recall that \( \nabla f({\bf b}) \) is \( L \)-Lipschitz continuous if \[ \begin{align*} \lVert \nabla f({\bf b}) - \nabla f(\tilde{{\bf b}}) \rVert \leq L \lVert {\bf b} - \tilde{{\bf b}} \rVert, \end{align*} \] for all \( {\bf b} \) and \( \tilde{{\bf b}} \). When \( f \) is twice differentiable, this means that the largest eigenvalue of \( \nabla^2 f({\bf b}) \), the Hessian of \( f \), is no greater than \( L \) for all \( {\bf b} \). Lipschitz continuous functions have a bound on how rapidly they can vary. So when the gradient of a function is Lipschitz continuous, we know that roughly speaking, the function has a maximum bound on its curvature. Thus, it is possible to find quadratic approximations that always sit on top of the function, as long as we employ a step-size equal to, or less than, the reciprocal of that bound.

In OLS, \( f({\bf b}) \) is twice differentiable and its Hessian is \( {\bf X}^t{\bf X} \), which does not depend on \( {\bf b} \). Therefore, the smallest Lipschitz constant of \( \nabla f \) is the largest eigenvalue of \( {\bf X}^t{\bf X} \). Naturally, we want to take the biggest steps possible, so if we can compute the Lipschitz constant \( L \) we set \( \alpha = 1/L \).

Using the simple example we employed previously, we observe that when our choice of \( \alpha \) is not small enough, the norm of the gradient will diverge towards infinity and the algorithm will not converge. (Note that the code for this example is provided below but in the interest of space, we do not include the output in this article.)

simple_ex2 <- gdescent(f,grad_f,X,y,alpha=0.05,liveupdates=TRUE)

The live updates in this example show the norm of the gradient in each iteration and we can see that the norm of the gradient diverges when \( \alpha \) is not sufficiently small. The following two figures illustrate why this might occur. In the first figure, \( \alpha \) is sufficiently small so each iteration in the algorithm results in a step towards the minimum, resulting in convergence of the algorithm.

example.alpha(0.01)

## Minimum function value: ## 38.81 ## ## Coefficient(s): ## 2.114

In the second figure, \( \alpha \) is too large and each subsequent iterate increasingly overshoots the minimum, resulting in divergence of the algorithm.

example.alpha(0.12)

## Minimum function value not attained. A better result might be obtained by decreasing the step size or increasing the number of iterations.

In the following section, we discuss some of the decisions required in implementing the gradient descent algorithm.

Some Implementation Considerations.

Choosing a Step-Size Manually

As the figures above indicate, the choice of the step-size \( \alpha \) is very important in gradient descent. When \( \alpha \) is too large, the objective function values diverge and the algorithm will not converge towards a local minimum. On the other hand, when \( \alpha \) is too small, each iteration takes only a tiny step towards the minimum and the algorithm may take a very long time to converge.

A safe choice for \( \alpha \) is the one derived above using the Lipschitz constant. Sometimes it easy to determine a Lipschitz constant, and the step-size can be appropriately set. But how does one pick an appropriate \( \alpha \) when the Lipschitz constant is not readily available? In these cases, it is useful to try experimenting with step-sizes. If you know that the objective function has a Lipschitz continuous gradient but do not know the Lipschitz constant, you still know that there exists an \( \alpha \) that will lead to a monotonic decrease of the objective function. So you might start with \( \alpha = 0.01 \), and if the function values are diverging or oscillating, you can make your step-size smaller, say \( \alpha = 0.001 \), or \( 1e-4 \) and so forth, until the function values are decreasing monotonically.

The automation of this manual procedure underlies a more principled approach to searching for an appropriate step-size called backtracking. In backtracking, at every iteration, we try taking a gradient step and check to see if the step results in a “sufficient decrease” in \( f \). If so, we keep the step. If not, we try again with a smaller step-size. In the interest of space, we defer further discussions on backtracking to another time.

Finally, we note that in choosing \( \alpha \), it may be very helpful to plot the objective function at each iteration of the algorithm. Such a plot can also provide a useful check against mistakes in the implementation of your algorithm. In the gettingtothebottom package, a plot of the objective function values can be obtained using the plot_loss function.

Initializing the \( {\bf b} \) vector.

When the objective \( f \) is convex, all its local minima are also global minima, so the choice of the initial \( {\bf b} \) vector does not alter the result obtained from gradient descent. Choosing an initial \( {\bf b} \) that is closer to minimum, however, will allow the algorithm to converge more quickly.

On the other hand, when optimizing over non-convex functions, a function may have distinct local minima and then the choice of the initial \( {\bf b} \) vector does matter since the algorithm may converge to a different local minimum depending on the choice of the initial \( {\bf b} \). Unfortunately, there is not a better solution than to try several different initial starting vectors and to select the best minimizer if multiple minima are obtained.

Determining Convergence Measures and Tolerance Setting.

The algorithm in the gdescent function stops when \( \lVert \nabla f(x) \rVert \) is sufficiently close to zero. As previously discussed, this is because \( \nabla f(x) = 0 \) is a necessary condition for the solution(s) to \[ \begin{align*} \text{minimize }{f(x)},{x \in \mathbb{R}^{n}.} \end{align*} \]

For appropriate objective functions and properly chosen step-sizes, gradient descent is guaranteed to converge to a stationary point of the objective. Convergence will likely only occur in the limit, however. Although it is unlikely that a problem will require a solution that is correct up to infinite precision, very accurate solutions typically will require substantially more iterations than less accurate ones. The reason for this is because as the gradient approaches zero, \( \alpha \nabla f(x) \) also approaches zero so each successive iteration makes less progress towards a minimum. Thus, it can be computationally expensive to drive \( \lVert \nabla f({\bf b}) \rVert \) to something very small and when accuracy is at a premium, second order Newton or quasi-Newton methods may be more appropriate.

Additionally, as the gradient vanishes, \( \alpha \nabla f(x) \) also vanishes, so iterate values will not change very much near a minimum. Thus, one might also choose to stop the algorithm when the absolute value of the change in the difference between the \( m^{\text{th}} \) and the \( m-1^{\text{th}} \) iterates becomes sufficiently small. The figure below shows the iterates converging to the solution in our simple example from before.

plot_iterates(simple_ex)

Scaling the Variables.

In practice, gradient descent works best with objective functions that have very “round” level sets, and tends to perform slowly on functions with more elliptical level sets. For twice differentiable objective functions, this notion of eccentricity of level sets is captured by the condition number of the Hessian, or the ratio of the largest to the smallest eigenvalues of the Hessian. In the context of regression problems, ill-conditioned Hessians may arise when covariates contain values on drastically different scales. A natural remedy is to put all the columns of the design matrix \( {\bf X} \) on approximately the same scale. This process is also referred to as feature scaling, and can be achieved by standardization or even more simply, by dividing each element in \( {\bf X} \) by the maximal element in its column. The gdescent function performs feature scaling by default but one can also opt out of feature scaling by setting autoscaling=FALSE when calling the function. We provide an example of using gradient descent both with and without feature scaling later in this article.

A Summary of Simple Sanity Checks.

In view of the preceding considerations, we highlight a few possible indicators of necessary adjustment in implementing the gradient descent algorithm.

The function values are diverging.
– If the step-size is small enough and the objective function has a Lipschitz continuous gradient, the function values should always decrease with each iteration. If the function values are constantly increasing, the algorithm will not converge. This is an indicator that a smaller step-size should be selected. In the gdescent function, the plot_loss() function enables verification that the objective function values are decreasing over the course of the implementation.

The norm of the gradient is not approximately zero when the algorithm finishes running.
– We know that the gradient must vanish at a minimizer so if the norm of the gradient is not approximately zero, the algorithm has not minimized the function. In the gdescent function, the plot_gradient() function enables verification that the norm of the gradient is converging to zero.

The algorithm is taking a very long time to run when the number of covariates is not very large.
– This is not a guarantee that something is wrong with the implementation but it may be an indicator that the columns of the design matrix \( {\bf X} \) have not been scaled.

In this section, we discussed some of the key implementation considerations for the gradient descent algorithm. In the remaining section, we provide several more examples to illustrate the gradient descent algorithm in action.

Examples.

Least Squares Regression with Gradient Descent.

The next two examples demonstrate use of the gradient descent algorithm to solve least squares regression problems.

In this first example, we use the moviebudgets dataset included in the gettingtothebottom package to estimate movie ratings based on their budgets. The dataset contains ratings and budgets for 5,183 movies. A quick look at the data shows us that the values in the budget and rating variables are on drastically different scales.

Below is an example of the same problem without feature scaling. We observe that without feature scaling, the gradient descent algorithm requires a much smaller step-size and many more iterations. (In the interest of space, we do not include that output in this article.)

In the next example, we apply gradient descent to a multivariate linear regression problem using data from the baltimoreyouth dataset included in the gettingtothebottom package. Here, we want to predict the relationship between the percentage of students receiving free or reduced meals and the high school completion rate within each of the Community Statistical Areas (CSAs) in Baltimore. This model controls for the percentage of students suspended or expelled during the year, the percentage of students aged 16-19 who employed, and the percentage of students chronically absent in each CSA.

Conclusion

In this article, we saw that the gradient descent algorithm is an extremely simple algorithm. Much more can be said about its theoretical properties, but our aim in this article was to highlight the basic mechanics, intuition, and biggest practical issues in implementing gradient descent so that you might be able to use it in your work. Many fine texts delve further in details we did not have space to explore here, and in particular, we point interested readers to Numerical Optimization by Nocedal and Wright and Numerical Analysis for Statisticians by Lange for a more thorough treatment.