Linear Correlation and The Covariance Matrix

Introduction

For various exploratory efforts, understanding variable correlation is a necessary analysis step. Measuring correlation helps to build intuition about the problem at hand and can be used to identify simple multicollinearity if present.

The covariance matrix is the basis for understanding linear correlation. Each element of the matrix is defined as the following, where E[B] is the expectation value for the vector B and < B > is the mean value of the vector B.

Σij = Cov(Xi, Xj) = E[(Xi - <Xi>) (Xj - <Xj>)]

Generate Dataset

Below we generate a 2D dataset of random numbers, each vector of length 10.

import numpy as np
X = np.random.normal(0, 1, 20).reshape(2, 10)

Calculating Covariance

Let’s begin by calculating the Σij ourselves using the equation above…

One should immediately notice that the covariance values from numpy and pandas disagree with our initial calculation — Σ00 = 1.44 for numpy and pandas, and Σ00 = 1.29 for our calculation. What could be the cause of this difference? Both numpy and pandas calculate the unbiased estimation of the covariance, while we naively calculated the biased estimation of the covariance. The covariance is a measurement of the population. The data we have is a sample of that population. For large sample sizes, the covariance of the sample approaches that of the population. For small samples sizes, the biased estimation will always be smaller than the actual population covariance, hence the term biased.

How to calculate the unbiased covariance? In our above calculation, we used the mean as the expectation value, which yielded biased covariance values. The unbiased value is calculated by dividing by n-1, instead of n…

Σij = Cov(Xi, Xj) = 1/(n-1) · Σn (Xi,n - <Xi>) · (Xj,n - <Xi>)

More information on biased and unbiased estimators can be found on wikipedia right here.

Interpretation of Covariance

The covariance matrix Σ is a symmetric matrix, that provides information about how one independent variable will vary with another independent variable, i.e. correlation. If the two variables are truly independent, then the covariance will be close to zero. However, if the two variables tend to increase together, then the covariance will be a positive number. And likewise, if as one variable increases, the other variable tends to decrease, then the covariance value will be negative.

The absolute size of the a particular covariance element Σij will depend on both on the variability of the individual variables and the degree of correlation of those variables. This tends to obscure our ability to quickly glance at the covariance values and understand the degree of correlation.

Correlation via Pearson’s R Value

Linear correlation is measured by normalizing the covariance value by the standard deviations of the two variables. This metric is referred to as Pearson’s R.

Rij = Cov(Xi, Xj) / ( σi · σj )

The magnitude of this number is now bound between -1 and 1. Again, uncorrelated variables have values near zero. And variables with strong correlation will approach +/-1 depending on the whether the correlation is positive or negative.

In python we can calculate the Pearson’s R value with the help of the numpy library. Notice, that we have pass the ddof parameter a value of 1 — this makes sure that numpy returns an unbiased estimation of the standard deviation.

Alternatively, pandas makes life much easier, assuming the data are represented as a pandas.DataFrame object. There are different methods to calculate correlation, Pearson’s method calculates linear correlation. Here we explicitly pass the method variable a string value of ‘pearson’.

print df_X.corr( method='pearson')

Output:

0 1
0 1.00000 0.08746
1 0.08746 1.00000

Practical Calculations of Correlation and Covariances

With most of the below examples, distinguishing biased and unbiased estimators becomes less important. As datasets grow larger and large, n and n-1 are for all practical purposes equal.