Blog Archives

Often times, particularly in a regression framework, we are given a set of inputs (independent variables) and a set outputs (dependent variables) , and we want to devise a model function

that predicts the outputs given some inputs as best as possible. But what does it mean for a model to predict “as best as possible” exactly? In order to make the notion of how good a model is explicit, it is common to adopt a loss function

,

The loss function is some function of the model’s prediction errors (a.k.a. residuals) at predicting outputs given the inputs (the loss function is also often referred to as the cost function, as it makes explicit the “cost” of incorrect prediction). “Good” models of a dataset will have small prediction errors, and therefore produce small loss function values. Determining the “best” model is equivalent to finding model function that minimizes the loss function. A common choice for this loss function is the sum of squared of the errors (SSE) loss. If there are input-output pairs, the SSE Loss function is formally:

This formula states that, for each output predicted by the model, we determine how far away the prediction is from the actual value (i.e. subtraction). Each of the individual distances are then squared and added to give a single number indicating how well (or badly) the model function captures the structure of the data across all the datapoints. The “best” model under this loss is called the least sum of squares (LSS) solution.

But why square the errors before summing them? At first, this seems somewhat unintuitive (or even ad hoc!). Surely there are other, more straight-forward loss functions we can devise. An initial notion of just adding the errors leads to a dead end because adding many positive and negative errors (i.e. resulting from data located below and above the model function) just cancels out; we want our measure of errors to be all positive (or all negative). Therefore, another idea would be to just take the absolute value of the errors before summing. Turns out, this is a known loss function, called the sum of absolute errors (SAE) or sum of absolute deviations (SAD) loss function. Finding the “best” SAE/SAD model is called the least absolute error LAE/LAD solution and such a solution was actually proposed decades before LSS. Though LAE is indeed used in contemporary methods (we’ll talk more about LAE later), the sum of squares loss function is far more popular in practice. Why does sum of squares always make the cut?

Useful Interpretations of the Sum of Squares Loss for Linear Regression

Areas of squares

Figure 1 demonstrates a set of 2D data (blue dots) and the LSS linear function (black line) of the form

,

where the parameters (the offet of the line from ) and (the slope) have been estimated (LSS) to “best” fit the data.

Figure 1 – Linear model fit to some data points

One helpful interpretation of SSE loss function is demonstrated in Figures 2. Each red square is a literal interpretation of the squared error for linear function fit in Figure 1. Here we see that no matter if the errors occur from predictions being greater than or less than the actual output values, the error term is always positive.

Figure 2 — Least squares loss function represented as areas

In this interpretation, the goal of finding the LSS solution is to find the line that results in the smallest red area. This interpretation is also useful for understanding the important regression metric known as the coefficient of determination , which is an indicator of how well a linear model function explains or predicts a dataset. Imagine that instead of the line fit in Figures 1-2, we instead fit a simpler model that has no slope parameter, and only a bias/offset parameter (Figure 3). In this case the simpler model only captures the mean value of the data along the y-dimension.

Figure 3 — Least squares loss for a linear function that only captures the mean of the dependent variable

The squared error in this model corresponds to the (unscaled) variance of the data. Lets denote the total area of the green squares in Figure 3 as the total sum of squares (TSS) error, and the area of the red squares in Figure 2 as the residual sum of squares (RSS) of the linear model fit. The metric is related to the red and green areas as follows

If the linear model is doing a good job of fitting the data, then the variance of the model errors/residuals (RSS) term will be small compared to the variance of the dataset (TSS), and the metric will be close to one. If the model is doing a poor job of fitting the data, then the variance residuals will approach that of the data itself, and the metric will be close to zero (Note too that the value of can also take negative values, in the case when the RSS is larger than the TSS, indicating a very poor model).

A bar suspended by springs

We can gain some important insight to the importance of the least squares loss by developing concepts within the framework of a physical system (Figure 4). In this formulation, a set of springs (red, dashed lines, our errors ) suspend a bar (solid black line, our linear function to a set of anchors (blue datapoints, our outputs ). Note that in this formulation, the springs are constrained to operate only along the vertical direction (y-dimension). This constraint is equivalent to saying that there is only error in our measurement of the dependent variables, and is often an assumption made in regression frameworks.

Figure 4 — Least squares interpreted in terms of a physical system of a bar suspended by springs

From Hooke’s Law, the force created by each spring on the bar is proportional to the distance (error) from the bar (linear function) to its corresponding anchor (datapoint) :

Further, there is a potential energy associated with each spring (datapoint). The total potential energy for the entire system is as follows:

(assuming a spring constant of ). This demonstrates that the equilibrium state of this system (i.e. the arrangement of the bar that minimizes the potential energy of the system) is analogous to the state that minimizes the sum of the squared error (distance) between the bar (line function) and the anchors (datapoints).

The physical interpretation given above can also be used to derive how linear regression solutions are related to the variances of the independent variables and the covariance between and . When the bar is in the equilibrium position (optimal solution), the net force exerted on the bar zero. Because , this first zero-net-force condition is formally described as:

The second condition that is fullfilled during equilibrium is that there are no torquing forces on the bar (i.e. the bar is not rotating about an axis). Because torque created about an axis is the force times distance away from the origin (average x-value; the origin), this second zero-net-torque condition is formally described by:

From the equation corresponding to the first zero-net-force condition, we can solve for the bias parameter of the linear function that describes the orientation of the bar:

Here the (pronounced “bar”) means the average value. Plugging this expression into the second second zero-net-torque condition equation, we discover that the slope of the line has an interesting interpretation related to the variances of the data:

The expressions for the parameters and tell us that, under the least squares linear regression framework, The average of the dependent variables is equal to a scaled version of the average of independent variables plus an offset:

Further, the scaling factor (the slope) is equal to the ratio of the covariance between the dependent and independent variables to the variance of the independent variable. Therefore if and are positively correlated, the slope will be positive, if they are negatively correlated, the slope will be negative.

Because of these relationships, the LSS solution has a number of useful properties:

The sum of the residuals under the LSS solution is zero(this is equivalent to the first zero-net-force condition above).

Because of 1., the average residual of the LSS solution is zero

The covariance between the independent variables and the residuals is zero.

The LLS solution always passes through the mean (center of mass) of the sample

The LSS solution minimizes the variance of the residuals/model errors.

Therefore, the least squares loss function directly relates model residuals to how the independent and dependent variables co-vary. These relationships are not available with other loss functions such as the least absolute deviation.

What’s interesting, is that the two physical constraint equations derived from the physical system above are also obtained through other analytic analyses of linear regression including defining the LSS problem using both maximum likelihood estimation and method of moments.

Wrapping up

There are many other reasons, albeit suggestions, as to why squared errors are often preferred to other rectifying functions of the errors (i.e. making all errors be positive or zero):

The Least Squares solution can be derived in closed form, allowing simple analytic implementations and fast computation of model parameters.

Unlike the LAE loss, the SSE loss is differentiable (i.e. is smooth) everywhere, which allows model parameters to be estimated using straight-forward, gradient-based optimizations

Squared errors have deep ties in statistics and maximum likelihood estimation methods (as mentioned above), particularly when the errors are distributed according to the Golden Boy of statistical distributions, the Normal distribution.

There are a number of geometric and linear algebra theorems that support using least squares. For instance the Gauss-Markov theorem states that if errors of a linear function are distributed Normally about the mean of the line, then the LSS solution gives the best unbiased estimator for the parameters .

Squared functions have a long history of facilitating calculus calculations used throughout the physical sciences.

The SSE loss does have a number of downfalls as well. For instance, because each error is squared, any outliers in the dataset can dominate the parameter estimation process. For this reason, the LSS loss is said to lack robustness. Therefore preprocessing of the the dataset (i.e. removing or thresholding outlier values) may be necessary when using the LSS loss.

MATLAB Code

The code to produce the figures in this post is below. The code can be directly copied to your clipboard using the toolbar at the top right of the code display.