Watch

Watching this resources will notify you when proposed changes or new versions are created so you can keep track of improvements that have been made.

Favorite

Favoriting this resource allows you to save it in the “My Resources” tab of your account. There, you can easily access this resource later when you’re ready to customize it or assign it to your students.

1 - r2 when expressed as a percent, represents the percent of variation in y that is NOT explained by variation in x using the regression line.
This can be seen as the scattering of the observed data points about the regression line.

Terms

An analytic method to measure the association of one or more independent variables with a dependent variable.

Give us feedback on this content:

Full Text

The coefficient of determination (denoted r2 and pronounced r squared) is a statistic used in the context of statistical models.
Its main purpose is either the prediction of future outcomes or the testing of hypotheses on the basis of other related information.
It provides a measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model.
Values for r2 can be calculated for any type of predictive model, which need not have a statistical basis.

The Math

A data set will have observed values and modelled values, sometimes known as predicted values.
The "variability" of the data set is measured through different sums of squares, such as:

This equation gives the most general form of the coefficient of determination.

Properties and Interpretation of r2

The coefficient of determination is actually the square of the correlation coefficient.
It is is usually stated as a percent, rather than in decimal form.
In context of data, r2 can be interpreted as follows:

r2, when expressed as a percent, represents the percent of variation in the dependent variable y that can be explained by variation in the independent variable x using the regression (best fit) line.

1 - r2when expressed as a percent, represents the percent of variation in y that is NOT explained by variation in x using the regression line.
This can be seen as the scattering of the observed data points about the regression line.

So r2is a statistic that will give some information about the goodness of fit of a model.
In regression, the r2 coefficient of determination is a statistical measure of how well the regression line approximates the real data points.
An r2 of 1 indicates that the regression line perfectly fits the data.

In many (but not all) instances where r2 is used, the predictors are calculated by ordinary least-squares regression: that is, by minimizing SSerr.
In this case, r2 increases as we increase the number of variables in the model.
This illustrates a drawback to one possible use of r2, where one might keep adding variables to increase the r2 value.
For example, if one is trying to predict the sales of a car model from the car's gas mileage, price, and engine power, one can include such irrelevant factors as the first letter of the model's name or the height of the lead engineer designing the car because the r2 will never decrease as variables are added and will probably experience an increase due to chance alone.
This leads to the alternative approach of looking at the adjusted r2.
The explanation of this statistic is almost the same as r2 but it penalizes the statistic as extra variables are included in the model.

Note that r2 does not indicate whether:

the independent variables are a cause of the changes in the dependent variable;

there is collinearity present in the data on the explanatory variables; or

the model might be improved by using transformed versions of the existing set of independent variables.

Example

Consider the third exam/final exam example introduced in the previous section.
The correlation coefficient is r = 0.6631.
Therefore, the coefficient of determination is r2 = 0.66312 = 0.4397.

The interpretation of r2 in the context of this example is as follows.
Approximately 44% of the variation (0.4397 is approximately 0.44) in the final exam grades can be explained by the variation in the grades on the third exam.
Therefore approximately 56% of the variation (1 - 0.44 = 0.56) in the final exam grades can NOT be explained by the variation in the grades on the third exam.

to predict future outcomes, to test hypotheses on the basis of other related information, to provide a measure of how well observed outcomes are replicated by a model, and to determine the strength of the linear relationship between two variables