MySQL Correlation Computations

Updated 4 years ago

Probably more bunk has been written about correlation than about any other statistic. We'll keep this short and straight. At its simplest, correlation is a statistical measure of non-random, linear association between pairs of values in a dataset. It's denoted by r, and varies from -1 through +1, where -1 indicates perfect inverse correlation (the regression line goes down left to right), 0 indicates no correlation (there is no regression line; it's just a scatterplot), and +1 indicates perfect direct correlation (the regression line goes up left to right).

-- ---------------------------------------------------------------------------
-- Correlation
-- from the Artful Common Queries page
-- http://www.artfulsoftware.com/infotree/qrytip.php?id=837
-- ---------------------------------------------------------------------------
-- Probably more bunk has been written about correlation than about any other
-- statistic. We'll keep this short and straight. At its simplest, correlation
-- is a statistical measure of non-random, linear association between pairs of
-- values in a dataset. It's denoted by r, and varies from -1 through +1,
-- where -1 indicates perfect inverse correlation (the regression line goes down
-- left to right), 0 indicates no correlation (there is no regression line;
-- it's just a scatterplot), and +1 indicates perfect direct correlation (the
-- regression line goes up left to right).
-- For an example we'll use a bit of imaginary data:
DROP TABLE
IF EXISTS t;
CREATE TABLE t (id INT, x INT, y FLOAT);
INSERT INTO t
VALUES
(1, 68, 4.1),
(2, 71, 4.6),
(3, 62, 3.8),
(4, 75, 4.4),
(5, 58, 3.2),
(6, 60, 3.1),
(7, 67, 3.8),
(8, 68, 4.1),
(9, 71, 4.3),
(10, 69, 3.7),
(11, 68, 3.5),
(12, 67, 3.2),
(13, 63, 3.7),
(14, 62, 3.3),
(15, 60, 3.4),
(16, 63, 4.0),
(17, 65, 4.1),
(18, 67, 3.8),
(19, 63, 3.4),
(20, 61, 3.6);
-- If you like to think about such problems concretely, you can think of
-- id as a subject's id, x as a subject's height, and y as a subject's
-- score on a self-confidence questionnaire, so we would be computing a
-- correlation between height and self-confidence.
-- There are many correlation formulas. Most commonly used is the
-- Pearson product-moment correlation coefficient, which is valid only
-- for normally distributed data (data which roughly fits a bell curve).
-- A good Wikipedia page (http://wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)
-- offers several formulas for the Pearson coefficient.
-- Note that when the data skews significantly from a normal distribution,
-- you very likely need a different formula. That's one problem with
-- correlation: it isn't always a valid number to calculate.
-- Another problem is that computing just the correlation coefficient is
-- never enough. You also need to find the probability that the coefficient
-- you calculated is significantly different from 0. The Wikipedia
-- page on correlation describes several ways of testing the significance
-- of correlations.
-- You will probably also want the slope of the relationship or
-- regression line, its intercept with the Y axis, and the coefficient
-- of regression, which is the proportion of observed variation
-- due to correlation.
-- For this example we restrict ourselves to calculating correlation
-- assuming a normal distribution and no missing values.
-- Calculation of the correlation coefficient needs two passes: a first to
-- calculate basic statistical quantities, then a second to calculate the slope,
-- intercept and correlation coefficient from those basic quantities.
-- Calculating the coefficient of regression requires a third pass:
-- 1. Calculate the required basic statistics.
-- 2. Use the results of #1 to calculate slope, intercept and r.
-- 3. Use the slope and intercept from #2 to calculate the coefficient of regression.
-- 4. Collect and display the results.
-- If we're computing correlation in a database like MySQL, then,
-- you'd expect Views will be useful. Unfortunately we're immediately
-- bitten by two limitations in the MySQL implementation of Views: subqueries
-- in the FROM clause of a View are not supported, and neither are parameters.
-- Then do we need to encapsulate correlation in a stored procedure?
-- We could, but we needn't because SQL implementations like MySQL provide
-- the required basic statistical quantities, permitting us to do it all in one four-layer query ...
-- 1. The innermost query does step 1,
-- 2. The query that references the inner query does step 2,
-- 3. The next outer query calculates the regression coefficient by
-- cross-joining the step 2 query result, which is one logical row,
-- with the original table.
-- 4. The outermost query displays the result:
-- Read the query inside out:
SELECT -- Step 4
N, Slope, avgY - slope*avgX AS Intercept,
Correlation, CoeffOfReg
FROM (
SELECT -- Step 3
N, avgX, avgY, slope, intercept, Correlation,
FORMAT( 1 - SUM((y - intercept - slope*x)*(y - intercept - slope*x))/
((N-1)*varY), 5 ) AS CoeffOfReg
FROM t AS t2
JOIN (
SELECT -- Step 2
N, avgX, avgY, varY, slope,
Correlation, avgY - slope*avgX AS intercept
FROM (
SELECT
N, avgX, avgY, varY,
FORMAT(( N*sumXY - sumX*sumY ) /
( N*sumsqX - sumX*sumX ), 5 ) AS slope,
FORMAT(( sumXY - n*avgX*avgY ) /
( (N-1) * SQRT(varX) * SQRT(varY)), 5 ) AS Correlation
FROM (
SELECT -- Step 1
COUNT(x) AS N,
AVG(x) AS avgX,
SUM(x) AS sumX,
SUM(x*x) AS sumsqX,
VAR_SAMP(x) AS varX,
AVG(y) AS avgY,
SUM(y) AS sumY,
SUM(y*y) AS sumsqY,
VAR_SAMP(y) AS varY,
SUM(x*y) AS sumXY
FROM t
) AS sums
) AS calc
) stats
) combined;
-- +----+---------+---------------------+-------------+------------+
-- | N | Slope | Intercept | Correlation | CoeffOfReg |
-- +----+---------+---------------------+-------------+------------+
-- | 20 | 0.07066 | -0.8661640047683719 | 0.73064 | 0.53383 |
-- +----+---------+---------------------+-------------+------------+