Excel's LINEST() Function Deconstructed

Microsoft Excel has for many years included a worksheet function called LINEST(), which returns a multiple regression analysis of a single outcome or predicted variable on one or more predictor variables. LINEST() returns a regression equation, standard errors of regression coefficients, and goodness-of-fit statistics. In the first of three articles, Excel expert Conrad Carlberg, author of Predictive Analytics: Microsoft Excel, discusses issues regarding LINEST() that have not been covered sufficiently, or even accurately, in the past.

From the author of

From the author of

Microsoft Excel's LINEST() worksheet function has a long and checkered history. It is capable of returning a multiple regression analysis with up to 64 predictor variables and one outcome or "predicted" variable. (Early versions permitted up to 16 predictor variables.)

LINEST() performs quite well in most situations. It returns accurate regression coefficients and intercepts, the standard errors of the coefficients and of the intercept, and six summary statistics regarding the regression: R2, the standard error of estimate, the F ratio for the full regression, the degrees of freedom for the residual, and the sums of squares for the regression and for the residual.

But LINEST has some drawbacks, ranging from the inconvenient to the potentially disastrous.

An Inconvenient Problem

One difficulty is that the regression coefficients and their standard errors are shown in reverse order in which their associated underlying variables appear on the worksheet. See Figure 1.

Figure 1LINEST() returns coefficients in reverse order of the worksheet.

In Figure 1, the predictor variables are years of education and years of age. Education data is in column A, and Age data is in column B. The predicted variable, Income, is in column C.

The formula that uses the LINEST() function is array-entered (with Ctrl+Shift+Enter) in the range E5:G9. The formula in this example is:

=LINEST(C2:C21,A2:B21,TRUE,TRUE)

Note

LINEST()'s third argument, called const, is set to TRUE in the example just given. You can also omit the argument and Excel regards that as setting it to TRUE:

=LINEST(C2:C21,A2:B21,,TRUE)

Only by setting the third argument to FALSE can you force LINEST() to remove the constant from the regression equation.

The problem is that the regression coefficient for Age is in cell E5, and the coefficient for Education is in cell F5: in left-to-right order, the coefficient for Age comes before the coefficient for Education. But in the underlying data set, the Education data (column A) precedes the Age data (column B).

(The intercept, in cell G5 in Figure 1, always appears rightmost in the LINEST() results.)

So if you wanted to use the regression equation to estimate the income of the first person in Row 2, you would need to use this formula (parentheses included for clarity only):

=(E5*B2)+(F5*A2)+G5

instead of the more natural and more easily interpreted:

=(E5*A2)+(F5*B2)+G5

With just two variables, this is a really minor issue. But with 5, 10, perhaps 20 variables, it becomes exasperating. To complete the regression equation, you need to proceed left-to-right for the variables and right-to-left for the coefficients. With 20 of each, it's tedious and error-prone.

And there is absolutely no good reason for it—statistical, theoretical or programmatic. I recognize that one could use the TREND() function instead of assembling the regression formula, coefficient by coefficient and variable by variable, but there are often times when you need to see the result of modifying one variable or coefficient; the only way to do that is to call them out separately in the full equation.

Nevertheless, this is principally a matter of convenience. The issues that I'm going to discuss in subsequent papers are more serious, particularly if you're still using a version of Excel prior to 2003.

This paper continues with a discussion of how the results provided by LINEST() can be calculated, and how you can replicate those results using Excel's native worksheet functions. A little matrix algebra is needed and it will be necessary for you to be familiar with the concepts behind the worksheet functions MMULT(), MINVERSE(), and TRANSPOSE().

Once you've seen how to replicate the LINEST() results using straightforward matrix algebra, you'll be in a position to see how Microsoft got it badly wrong when it offered LINEST()’s third option, const. That option calculates regression statistics "without the constant," also known as "forcing the intercept through zero." While the associated problems have been fixed, anyone who is still using a version of Excel prior to 2003 is in trouble if that option is selected, whether in LINEST(), TREND(), or the Regression tool in the Data Analysis add-in.

Note

In fairness, I should note that Microsoft was in good company. In 1986, well before LINEST() came along, Lee Wilkinson wrote in the manual for Systat, in its discussion of the MGLH program, "The total sum of squares must be redefined for a regression model with zero intercept. It is no longer centered about the mean of the dependent variable. Other definitions of sums of squares can lead to strange results like negative squared multiple correlations." Alas, Microsoft's code developers were not expert in statistical theory, any more than were the other developers Wilkinson was referring to.

You will see in a subsequent paper how Microsoft has changed its algorithm to avoid returning a negative R2, and how it came about in the first place. This is necessary information for anyone needing to migrate a regression analysis from, say, Excel 2002 to Excel 2010, or to understand how Excel 2002's results can be so different from Excel 2010's. Even if you're using a version subsequent to Excel 2003, the problems still show up in the R2 values associated with chart trendlines.

Microsoft has also included in the code for LINEST() a method for dealing with severe multi-collinearity in the X matrix. (Multi-collinearity is just a hifalutin word for two or more predictor variables that are perfectly correlated, or virtually so.) Microsoft deserves kudos for recognizing that the problem existed. But the way that the solution is manifested in the results of LINEST() since Excel 2003 is potentially disastrous. With the information in this paper, you'll be in a position to avoid that particular LINEST()—well, call it a feature.

Assembling LINEST() Results from Other Functions

In this section, I'm going to show you how to assemble the different results you get from LINEST() using other worksheet functions. Some of these methods will be clear, even obvious. Others will seem unclear, and they aren't at all intuitively rich. But by taking things apart, I think you'll find it much easier to understand the way they work together.

Getting the Regression Coefficients

The first step is to lay out the data as shown in Figure 2.

Figure 2 Add a column that contains nothing but 1's to the range of predictor variables.

Figure 2 shows that a column containing 1's is included with the other predictor, or X, values. This column enables the matrix operations described below to calculate an intercept and its standard error. Although you don't see that column of 1's when you run LINEST() directly on your input data, Excel adds it (invisibly) on your behalf.

Note

If you add the column of 1's and then call LINEST() without the constant (setting LINEST()’s third argument to FALSE), Excel doesn't add the 1's for you, and you'll get the same regression coefficients and standard errors as you would if you omitted the 1's and used LINEST() with the third argument set to TRUE.

Getting the Sum of Squares and Cross Products (SSCP)

You'll need access to what's called the transpose of the data in B3:E22. You can do that explicitly on the worksheet using Excel's TRANSPOSE() function. In Figure 2, the range H2:AA5 contains this array formula:

=TRANSPOSE(B3:E22)

(Recall that you enter an array formula using Ctrl+Shift+Enter instead of simply Enter.)

With those two matrices set up, you can get what's called the sum of squares and cross-products matrix, often called the SSCPmatrix. Use this array formula:

=MMULT(H2:AA5,B3:E22)

Note

In the notation used by matrix algebra, it's conventional to show in boldface a symbol such as "X" that represents a matrix. Matrix transposition is denoted with an apostrophe, so X' means the transposition (or simply the transpose) of X. And the inverse of a matrix is indicated by the "-1" superscript. The inverse of the matrix Y is indicated by Y-1.

If you don't want to bother putting the transpose of the X matrix directly on the worksheet, you could use this array formula instead to get the SSCP matrix:

=MMULT(TRANSPOSE(B3:E22),B3:E22)

Excel's MMULT() function performs matrix multiplication. Here, the transpose of the X matrix (B3:E22) is post-multiplied by the X matrix.

Note

Unlike regular algebra, matrix multiplication is not commutative. If X and Y are both matrices, XY does not necessarily give the same result as YX.

Getting the Inverse of the SSCP Matrix

The next step is to get the inverse of the SSCP matrix. A matrix's inverse is analogous to an inverse in simple arithmetic. The inverse of the number 4 is 1/4: When you multiply a number by its inverse, you get 1.

Similarly, when you multiply a matrix by its inverse, you get a new matrix with 1's in its main diagonal and 0's everywhere else. Figure 3 shows the SSCP matrix in G3:J6, its inverse in G10:J13, and the result of the multiplication of the two matrices in L10:O13.

Calculating the Regression Coefficients and Intercept

I mentioned earlier that much of the derivation of the results that LINEST() returns is not intuitively rich. The inverse of the SSCP matrix is an example of that. There's much information buried in the matrix inverse, but no flash of intuition will tell you that it's hidden there, or even why it's there. For example, see Figure 4.

Figure 4 The SSCP matrix and its inverse, combined with the X and Y matrices, return the regression coefficients and the intercept.

In Figure 4, notice the range G18:J18. It contains this array formula:

=TRANSPOSE(MMULT(G10:J13,MMULT(TRANSPOSE(B3:E22),A3:A22)))

In words, the formula uses matrix multiplication via the MMULT() function to combine the transposed X matrix (B3:E32) with the Y matrix (A3:A32) with the inverse of the SSCP matrix (G10:J13). The result in G18:J18 is the intercept (G18) and the regression coefficients (H18:J18). The coefficients are in the same order that the underlying values appear on the worksheet—that is, columns C, D, and E contain the values for variables X1, X2, and X3, respectively, and cells H18, I18, and J18 contain the associated regression coefficients.

Cells G21:J21 contain the first row of the LINEST() results for the same underlying data set (except that the 1's in column B are omitted from the LINEST() arguments because LINEST() supplies them for you). Notice that the values for the intercept and the coefficients are identical to those in row 18. The only difference is that LINEST() has returned them out of order.

In sum, to get the intercept and regression coefficients using matrix algebra instead of using LINEST(), take the following general steps:

Get the SSCP matrix using X'X. Use MMULT() and TRANSPOSE() to postmultiply the transpose of the X matrix by the X matrix.

Use MINVERSE() to calculate the inverse of the SSCP matrix.

Use the array formula given above and repeated here to calculate the intercept and coefficients:

=TRANSPOSE(MMULT(G10:J13,MMULT(TRANSPOSE(B3:E22),A3:A22)))

Getting the Sum of Squares Regression and Residual

It probably seems a little perverse to go from the calculation of regression coefficients to sums of squares, skipping over standard errors, R2, F tests, and so on. But you need the sums of squares to calculate those other statistics.

Before getting to the matter of calculating the sums of squares, it's helpful to review the meaning of the sum of squares regression and the sum of squares residual.

A sum of squares, in most statistical contexts, is the sum of the squares of the differences (or deviations) between individual values and the mean of the values. So if our values are 2 and 4, the mean is 3. 2 – 3 is -1, and the squared deviation is +1. 4 – 3 is 1, and the squared deviation is +1. Therefore, the sum of squares is 1 + 1 or 2.

Note

The term "sum of squares" dates to the early part of the 20th century and is something of a misnomer. The term suggests that the task is to find the sum of the squared values, not the sum of the squared deviations from the mean. In this case, Excel's function names are more descriptive than the statistical jargon. Excel uses the function DEVSQ() to sum the squared deviations, and the function SUMSQ() to sum the squares of the raw values.

Our purpose in calculating those two sums of squares is to divide the total sum of squares into two parts:

The sum of squares regression is the sum of the squared deviations of the Y values that are predicted by the regression coefficients and intercept, from the mean of the predicted values.

The sum of squares residual is the sum of the squared deviations of the differences between the actual Y values and the predicted Y values, from the mean of those deviations.

Calculating the Predicted Values

Those two definitions of sums of squares are fairly dense when written in English. It’s usually easier to understand what's going on if you think about them in the context of an Excel worksheet. See Figure 5.

In Figure 5, I have repeated the regression coefficients and the intercept, as calculated using the matrix algebra discussed earlier, in the range G3:J3. Because they appear in the correct order, you can easily use them to calculate the predicted Y values as shown in the range L3:L22. This is the formula that's used in cell L3:

=$G$3+SUMPRODUCT(C3:E3,$H$3:$J$3)

The intercept and coefficients in G3:J3 are identified using dollar signs and therefore absolute addressing. The X values in C3:E3 are identified using relative addressing. Therefore, you can drag and drop or copy and paste from cell L3 into the range L4:L22.

Just as a check, Figure 5 also shows the predicted Y values in M3:M22, using this array formula in that range:

=TREND(A3:A22,C3:E22)

You'll note that the predicted values using matrix algebra are identical to the predicted values using TREND(). There are actually slight differences, but they do not begin to show up until the 14th decimal place. (For example, the difference between cell L8 and cell M8 is 0.000000000000057.)

Note

These minuscule differences are due to the fact that Excel 2003 and subsequent versions do not use traditional matrix algebra in LINEST(), but instead use an approach called QR Decomposition. The third article in this series has a brief discussion of that approach and the rationale for its usage.

Calculating the Prediction Errors

The values shown in Figure 5, in the range O3:O22, are the errors in the predicted values. They are simply the differences between the actual Y values in A3:A22 and the predicted values in L3:L22. So, for example, the formula in cell O3 is =A3-L3.

Calculating the Sums of Squares

With the predicted values and the errors of prediction, we're in a position to calculate the sums of squares. The sum of squares regression is found with this formula in cell G24:

=DEVSQ(L3:L22)

and the sum of squares residual is found with a similar formula in cell H24:

=DEVSQ(O3:O22)

Notice that the two sums of squares total to 21612.95. This is the same value as appears in cell G26. The formula in G26 is:

=DEVSQ(A3:A22)

which is the sum of the squared deviations of the original Y values. So, the process described in this section has accomplished the following:

Predicted Y values on the basis of the combination of the X values and the regression coefficients and intercept.

Obtained the sum of squared deviations of the predicted Y values (the sum of squares regression).

Calculated the errors of prediction by subtracting the predicted Y values from the actual Y values.

Obtained the sum of squared deviations of the errors of prediction (the sum of squares residual).

Demonstrated that the total sum of squares of the actual Y values has been divided into two portions: the sum of squares regression and the sum of squares residual.

Calculating the Regression Diagnostics

Now that we have the sum of squares regression and the sum of squares residual, it's easy to get the results that help you diagnose the accuracy of the regression equation.

Calculating R2

The R2 is simply the proportion of variability in the Y values that can be attributed to variability in the best combination of the X variables. That best combination is the result of applying the regression coefficients to the X variables—that is, the best combination is represented by the predicted Y values.

Therefore, the R2 is calculated by this ratio:

(Sum of Squares Regression) / (Sum of Squares Total)

Because the sum of squares total is the sum of the regression and the residual sums of squares, you can easily calculate R2 on the worksheet as shown in Figure 6.

which returns the ratio of the regression sum of squares to the total sum of squares.

Calculating the Standard Error of Estimate

At this point, you need to keep in mind the way that you’ve set up your inputs. In Figure 6, I’ve set things up so that the column of 1's is shown explicitly on the worksheet. That's because the column is needed if you're going to obtain the value of the regression equation’s intercept by means of matrix algebra—the instances of MMULT(),TRANSPOSE(), and MINVERSE() that I’ve discussed in this paper.

In that case—if you're showing the column of 1's explicitly—you get the degrees of freedom for the sum of squares residual by subtracting the number of X variables on the worksheet from the number of observations, or rows, in the matrix of X values.

In the example shown in Figure 6, the number of observations is 20, found in rows 3 through 22. The number of variables is 4, found in columns B through E. Therefore, the number of degrees of freedom for the sum of squares residual is 16: 20-4. You can confirm this from the LINEST() results in Figure 6, cells G6:J10, where the degrees of freedom shows up in cell H9.

On the other hand, if you want to use LINEST() directly, you don't need to supply the column of 1's on the worksheet: Excel supplies the 1's for you and you never see them. But if you're going to determine the degrees of freedom residual for yourself, then subtract the number of X variables (in this case, 3) from the number of observations (20) and then subtract 1 from the result to get 16.

In fact, you'll find that most intermediate statistics texts tell you that the degrees of freedom for the residual sum of squares is N-k-1, where N is the number of observations, k is the number of predictor variables, and 1 is for the column of 1's that you never see unless you arrange for them yourself.

So, to get the standard error of estimate, divide the sum of squares residual by the degrees of freedom for the residual, and take the square root of the result. The formula used in cell G15 of Figure 6 is:

=SQRT(H12/16)

The result is identical to that provided in the LINEST() results in cell H8.

Calculating the F Ratio for the Regression

There are a couple of ways to go about calculating the F ratio for the full regression. Both involve using the degrees of freedom for the residual and the degrees of freedom for the regression.

The prior section discussed how to get the degrees of freedom for the residual. The degrees of freedom for the regression is the number of X variables minus 1. So, if you have supplied the column of 1's explicitly on the worksheet, as in Figure 6, there are four X variables, and the degrees of freedom for the regression is 3.

If you use LINEST() and do not supply a column of 1's to it as an X variable—because Excel does that on your behalf—you still have four X variables; it's just that you're not looking at one of them. So the degrees of freedom for the regression is still 4-1, or 3, in this example.

One way to calculate the F ratio is to use the R2 value. Figure 6 does that in cell G17, where the formula is:

=(G14/3)/((1-G14)/16)

In words, the numerator is the R2 value divided by the regression degrees of freedom. The denominator is (1 – R2) divided by the residual degrees of freedom.

Another way uses the sums of squares instead of the R2 value. It's mathematically equivalent because we use the sums of squares to calculate the R2 value. The formula used in cell G18 of Figure 6 is:

=(G12/3)/(H12/16)

The numerator is the sum of squares regression divided by its degrees of freedom. The denominator is the sum of squares residual divided by its degrees of freedom.

You may know that a sum of squared deviations divided by its degrees of freedom is a variance, often termed a mean square. That's what we have in cell G18: one variance divided by another. And the ratio of two variances is an F ratio.

Here, we have the variance of the Y scores as predicted by the regression equation, divided by the variance of the errors in those predictions. If the resulting ratio is meaningfully larger than 1.0, we regard the regression as a reliable one: an outcome that we expect to be similar if we repeat this research with a different but similarly obtained sample of observations. And you can test the reliability of the observed F ratio by using Excel's F.DIST() function.

Getting the Standard Errors

The final task in deconstructing the LINEST() function is to calculate the values of the standard errors of the intercept and the regression coefficients. These values are returned in the second row of the LINEST() results. Figure 7 shows the required calculations.

Figure 7 shows the SSCP matrix and its inverse, shown earlier in Figure 4. To get the standard errors of the regression coefficients and the intercept, we need to multiply the inverse of the SSCP matrix by the mean square for the residual.

Figure 7 shows the inverse of the SSCP matrix in cells G12:J15.

The prior section showed how to calculate the mean square residual: simply divide the sum of squares residual by the residual degrees of freedom. Figure 7 does that for this example in cell M14, using this formula:

=L14/16

Note that L14 contains the sum of squares residual, and 16 is the degrees of freedom for the residual.

Note

Cell L14 in Figure 7 calculates the sum of squares residual in a more concise fashion than is done in Figures 5 and 6, where the errors of prediction (the residuals) are shown explicitly and the DEVSQ() function is used to get the sum of squares. Cell L14 in Figure 7 uses this array formula instead:

=SUM(((A3:A22)-(MMULT(B3:E22,TRANSPOSE(G3:J3))))^2)

which accomplishes the same result within the formula instead of showing the intermediate calculations on the worksheet.

The matrix shown in Figure 7, cells G18:J21, is the result of multiplying the inverse of the SSCP matrix by the mean square residual. The array formula is:

=G12:J15*M14

The square roots of the elements in the main diagonal of the matrix in G18:J21 are the standard errors for the regression equation. They are shown in Figure 7, in cells G24:J24. The formulas are as follows:

G24: =SQRT(G18)

H24: =SQRT(H19)

I24: =SQRT(I20)

J24: =SQRT(J21)

The relevant portion of the LINEST() results is also shown in Figure 7, in cells L24:O24. Note that the values in that range are identical to those in G24:J24, but of course LINEST() returns them in reverse of the order in which the original variables are entered on the worksheet.