Tidbit: Correlation and Simple Linear Regression

In business “Correlation” is generically used as a mutual relationship or connection between two or more things; statistically speaking correlation is the interdependence of variable quantities. I overhear many end users request information on the correlation of variables for prediction use, what they are referring to is actually simple linear regression. I don’t mean to outline all the math used in either function, rather I’d like to differentiate the fundamental reasoning for the business user.

Whether you are examining the data in Excel via CORREL(), R via cor(), or MATLAB via corrcoef(x,y), correlation is best used when X and Y are two variables you can control and measure. Simple Linear Regression would be used if you control X and are measuring Y. Time allowed to bake or grams of baking soda used are variables you might control (X) whereas height or density of the resulting cake might be the output variable (Y).

Similarities:

the standardized regression coefficient is the same as Pearson’s correlation coefficient (opposed to Kendall and Spearman).

The square of Pearson’s correlation coefficient is the same as the R² in simple linear regression (R² provides information about the goodness of fit of a model. In regression it is a statistical measure of how well the regression line approximates the real data points. For example if R² was to equal 1.0 (max value), this would indicate that the regression line perfectly fits the data.

Correlation and simple linear regression do not provide answers to causality directly.

Differences:

The regression equation (y=α+βx) can be used to make predictions on Y based on values of X.

Correlation usually refers to linear relationships, but it can refer to other forms of dependence such as polynomial or truly nonlinear relationships.

Related

To leave a comment for the author, please follow the link and comment on his blog: Kevin Davenport » R.