February 7, 2016

Philosophy of GD vs. normal regression

Back in 2010, I graduated from the UW with a degree in Economics and a certificate in econometrics. Econometrics is a mix of statistics, linear algebra, calculus, and computer science used to solve systems of equations with real world data. For example, if you wanted to compute house prices, econometrics would allow you to take a set of prices and information about houses, like size, number of bedrooms, permits, average wages in an area, etc — call features — and determine which of those features matter and how much they matter. As students, we did a lot of linear regression, with two full courses focused on the theory and implementation of linear regressions. The core computational concept used for solving regressions in this field is a matrix transpose based approach.

More recently, for fun, I’ve been learning Machine Learning (ML) — which uses a different computational concept for solving a regression system of equations. In ML, the most common method is based on a fixed number of iterations with some magic constants and differentials on a cost function called, “Gradient Desent”(GD). ML also uses a matrix approach, based on the same method used in econometrics which ML folks call a, “Normal Equation”.

There are a lot of differences between the methods, but both can solve the core optimization problem of reducing error between a prediction equation and actual data values. First, the central regression equation is this:

y = A + B1*x1 + B2*x2 … + e

where y is a measured data value, A is unknown, B1… Bn is unknown, and x1…xn are known feature values.

So, here’s an example of the equation from a single data point for a house price with a single feature of house size in square feet.

$100,000 = A + B1 * (1,000 sf) + e.

You’re solving for A, B1, and sometimes e ( which can sometimes also have functions wrapping it ). An example solution might be:

$100,000 = $99,000 + $1/sf * 1,000sf + 0.

You’re solving via the “system of equations” method that you learned in high-school algebra, except that you’re solving in matrix form. Then, you’re using a computational matrix engine like eViews or R to solve the equation in a single step, and find all the values, even e ( which is the sum of residuals ).

In econometrics, we have error diagnostic functions to determine if our solution was any good. We also have ways to use the error to solve some types of non-linearity in time series regressions, like auto-regression and moving averages. There’s a lot of philosophy behind this equation and when to use it and debug it. Both econometrics and ML are optimizing this equation when we talk about linear regressions.

In ML, some things change — the constant A ( alpha ) and the factors B1 … Bn ( betas ) are all called theta in ML. The ML equation is changed to this:

y = t0*x0 + t1*x1 + … with no e term, and with x0 = 1.

Gradient Descent’s core method is this ( also called the partial cost function ).

Theta(j) := Theta(j) – alpha/m * Sum[(prediction-actual)*Xj]

where alpha is a magic number that the person solving the equation guesses will solve the equation ( there’s some rules on this number, but it’s just a guess. ), m is the number of data measurements, and Xj is the vector X containing the data values matching that theta.

You’re solving the equation by calling the error term a “cost value”, and creating a cost function to figure out the cost value, then minimizing this cost value using a system of updating guesses based on the derivative of the cost function.

This difference in how you generate your matrix and how you treat error is the core philosophy difference between the two disciplines. Both are computationally optimizing an equation to fit data for unknown optimization constants — but the other fields are concerned with “goodness of fit” and use error diagnostically and have non-linear corrections available to them, where ML guesses the solution based on alpha and spatial collapses of error to a single vector per feature. Both systems can work in non-linear functions — the format is linear and linear algebra is used for both systems — hence the name, “Linear regression”.