Other sites

Regression via Gradient Descent in R

In a previous post I derived the least squares estimators using basic calculus, algebra, and arithmetic, and also showed how the same results can be achieved using the canned functions in SAS and R or via the matrix programming capabilities offered by those languages. I’ve also introduced the concept of gradient descent here and here.

Given recent course work in the online machine learning course via Stanford, in this post, I tie all of the concepts together. The R -code that follows borrows heavily from the Statalgo blog, which has done a tremendous job breaking down the concepts from the course and applying them in R. This is highly redundant to that post, but not quite as sophisticated (i.e. my code isn’t nearly as efficient), but it allows me to intuitively extend the previous work I’ve done.

If you are not going to use a canned routine from R or SAS to estimate a regression, then why not at least apply the matrix inversions and solve for the beta’s as I did before? One reason may be that with a large number of features (independent variables) the matrix inversion can be slow, and a computational algorithm like gradient descent may be more efficient.

From my earliest gradient descent post, I provided the intuition and notation necessary for performing gradient descent to estimate OLS coefficients. Following the notation from the machine learning course mentioned above, I review the notation again below.

In machine learning, the regression function y = b0 + b1x is referred to as a ‘hypothesis’ function:

The idea, just like in OLS is to minimize the sum of squared errors, represented as a ‘cost function’ in the machine learning context:

This is minimized by solving for the values of theta that set the derivative to zero. In some cases this can be done analytically with calculus and a little algebra, but this can also be done (especially when complex functions are involved) via gradient descent. Recall from before, the basic gradient descent algorithm involves a learning rate ‘alpha’ and an update function that utilizes the 1st derivitive or gradient f'(.). The basic algorithm is as follows:

For the case of OLS, and the cost function J(theta) depicted above, the gradient descent algorithm can be implemented in pseudo-code as follows:

It turns out that:

and the gradient descent algorithm becomes:

The following R -Code implements this algorithm and achieves the same results we got from the matrix programming or the canned R or SAS routines before. The concept of ‘feature scaling’ is introduced, which is a method of standardizing the independent variables or features. This improves the convergence speed of the algorithm. Additional code intuitively investigates the concept of convergence. Ideally, the algorithm should stop once changes in the cost function become small as we update the values in theta.

# typically we would iterate the algorithm above until the # change in the cost function (as a result of the updated b0 and b1 values)# was extremely small value 'c'. C would be referred to as the set 'convergence'# criteria. If C is not met after a given # of iterations, you can increase the# iterations or change the learning rate 'alpha' to speed up convergence