Calculating the Pearson product-moment correlation coefficient

The Pearson product-moment correlation coefficient (PMCC) is a quantity
between -1.0 and 1.0 that estimates the strength of the
linear relationship between two random variables.

The PMCC in its usual form is somewhat cumbersome to calculate. Using
simple algebra, I have rearranged it to form an expression that should have
better numerical
stability and require fewer calculations.

Disclaimer: This page is primarily for my own reference. I am a
programmer without formal training in statistics, and I don't even
feel like I know what I'm doing. There are probably a ton of
assumptions that I am unwittingly making, I am almost certainly misusing the
terminology, and I could simply be flat-out wrong here. My apologies to
statisticians and to people like Zed Shaw who
have a far greater understanding of this stuff than I do. If you blindly
trust this page while building something important---even after what I just
told you---then the blame is all yours. Use your brain. Don't believe
everything you read. This is not even a very interesting article: It's
mostly just algebra. Don't read this page; It's a waste of your time.

On a more serious note: In an attempt to make this article less
cringe-worthy, I made an effort to find the original peer-reviewed article(s)
where the Pearson correlation might be defined precisely, but nothing I read
cited primary references (MathWorld
just cited textbooks, for example) and I don't have the money to buy
expensive journal articles for every little web page I write. After
searching for most of a day, I finally gave up in frustration and decided to
post this article anyway, flaws and all. If this article makes you cringe,
please consider doing something to advance the principle of openaccess.

The Math

Product-Moment Correlation Coefficient (PMCC)

Imagine we have two populations X and Y. Then
ρX,Y represents the product-moment coefficient of correlation between
them.

Various websites and textbooks describe the correlation coefficient in several equivalent ways:

As the ratio between the covariance of X and Y and the product of their standard deviations:

As the sum of the products of each pair of standard scores of the X and Y values, all divided by the number of degrees of freedom:

We often can't work with populations directly, so we can't determine the
exact value of ρX,Y. However, we can estimate it by
selecting a random sample of (x,y) pairs. This estimate is often labelled
r. Since we can use the same formula for either case, I call the
general formula PMCC(X,Y).

Definitions

Let and be the arithmetic
means of the elements in X and Y, respectively.

Let sX and sY be the standard deviations of
X and Y, respectively.

Then, the following relations apply. Note how we define N in
order to avoid having to do two separate analyses for population and
sample data:

,

,

Simple sums

Given the vectors X and Y, there are a few things we can calculate right away:

The sums of all the elements in each vector:

,

The squares of the sums of all the elements in each vector:

,

The sums of the squares of all the elements in each vector:

,

The sum of the products of the corresponding elements from each vector:

Arithmetic means

We can now reduce the arithmetic means in terms of our previous calculations:

,

Standard deviations

To simplify the standard deviations, we first reduce the squares of the deviations:

Then, we simplify the variance:

Therefore, the reduced standard deviations are:

,

Reducing the PMCC formula, Part 1

Recall the PMCC formula:

and the results we derived:

,

,

Performing substitution, we get:

Notice how the formula no longer depends on whether the data is from a population or from a sample.

Reducing the PMCC formula, Part 2

Let's reduce the summation from the previous section:

Substituting, we get:

Conclusion

Given two populations or samples X = {X1,X2,...,Xn} and
Y = {Y1,Y2,...,Yn} (and subject to some assumptions about the distributions of the data), the Pearson product-moment correlation coefficient of the two is given by:

where the following variables are defined:

,

,

,

Alternatively, we can use the expanded form (which is obtained by applying the previous substitutions):

Implementation issues

Computing sums of many terms

When computing the sum of several floating-point values that vary widely,
you can obtain a better approximation (less round-off error) by sorting the
terms in ascending order before adding them together. That way, the smaller
numbers will be added together before being added to larger numbers, rather
than being immediately truncated. The Wikipedia article on
numerical stability has a bit more information about this.