Tag Archives: covariance

Covariance is the less understood sibling of correlation. While correlation is commonly used in reporting, covariance provides the mathematical underpinnings to a lot of different statistical concepts. Covariance describes how two variables change in relation to one another. If variable X increases and variable Y increases as well, Both X & Y will have positive covariance. Negative covariance will result from having two variables move in opposite directions, and zero covariance will result from the variables have no relationship with each other. Variance is also a specific case of covariance where both input variables are the same. All of this is rather abstract, so let’s look at more concrete definitions.

Covariance — Summation Notation

The definition you will find in most introductory stat books is a relatively simple equation using the summation operator (Σ). This shows covariances as the sum of the product of a paired data point relative to its mean. First, you need to find the mean of both variables. Then take all the data points and subtract the mean from its respective variable. Finally, you multiply the differences together

N is the number of data points in the population. n is the sample number. μX is the population mean for X; μY for Y. X̄ and Ȳ are the mean as well but this notation designates it as a sample mean rather than a population mean. Calculating the covariance of any significant data set can be tedious if done by hand, but we can set-up the equation in R and see it work. I used modified version of Anscombe’s Quartet data set.

Python

1

2

3

4

5

6

7

8

9

10

#get a data set

X&lt;-c(anscombe$x1,6,4,10)

Y=c(anscombe$y1,10,8,6)

#get the means

X.bar=mean(X)

Y.bar=mean(Y)

#calculate the covariance

sum((X-X.bar)*(Y-Y.bar))/(length(X))#manually population

sum((X-X.bar)*(Y-Y.bar))/(length(X)-1)#manually sample

cov(X,Y)#built-in function USES SAMPLE COVARIANCE

Obviously, since covariance is used so much within statistics, R has a built-in function cov(), which yields the sample covariance for two vectors or even a matrix.

Covariance — Expected Value Notation

[Trying to explain covariance in expected value notation makes me realize I should back up and explain the expected value operator, but that will have to wait for another post. Quickly and oversimplified, the expect value is the mean value of a random variable. E[X] = mean(X). The expected value notation below describes the population covariance of two variables (not sample covariance):

The above formula is just the population covariance written differently. For example, E[X] is the same as μx. And the E[] acts the same as taking the average of (X-E[X])(Y-E[Y]). After some algebraic transformations you can arrive at the less intuitive, but still useful formula for covariance:

This formula can be interpreted as the product of the means of variables X and Y subtracted from the average of signed areas of variables X and Y. This probably isn’t very useful if you are trying to interpret covariance. But you’ll see it from time to time. And it works! Try it in R and compare it to the population covariance from above.

Python

1

mean(X*Y)-mean(X)*mean(Y)#expected value notation

Covariance — Signed Area of Rectangles

Covariance can also be thought of as the sum of the signed area of the rectangles that can be drawn from the data points to the variables respective means. It’s called the signed area because we will get two types of rectangles, ones with a positive value and ones with negative values. Area is always a positive number, but these rectangles take on a sign by virtue of their geometric position. This is more of an academic exercise, in that it provides an understanding of what the math is doing and less of a practical interpretation and application of covariance. If you plot paired data points, in this case we will use the X and Y variables we have already used, you can tell just be looking there is probably some positive covariance because it looks like there is a linear relationship in the data. I’ve chosen to scale the plot so that zero is not included. Since this is a scatter plot including zero isn’t necessary.

First, we can draw the lines for the means of both variables as straight lines. These lines effectively create a new set of axes and will be used to draw the rectangles. The sides of the rectangles will be the difference between a data point and it’s mean [Xi - X̄]. When that is multiplied by [Yi - Ȳ], you can see that gives you an area of a rectangle. Do that for every point in your data set, add them up and divide by the number of data points, and you get the population covariance.

The following is a plot has a rectangle for each data point, and it is coded red for negative and blue for positive signs.

The overlapping rectangles need to be considered separately so the opacity is reduced so that all the rectangles are visible. For this data set there is much more blue area than there is red area, so there is positive covariance, which jives with what we calculated earlier in R. If you were to take the areas of those rectangles and add/subtract according to the blue/red color then divide by the number of rectangles, you would arrive the population covariance: 3.16. To get the sample covariance you’d subtract one from the number of rectangles when you divide.

Correlation is one the most commonly [over]used statistical tool. In short, it measures how strong the relationship between two variables. It’s important to realize that correlation doesn’t necessarily imply that one of the variables affects the other.

Basic Calculation and Definition

Covariance also measures the relationship between two variables, but it is not scaled, so it can’t tell you the strength of that relationship. For example, Let’s look at the following vectors a, b and c. Vector c is simply a 10X-scaled transformation of vector b, and vector a has no transformational relationship with vector b or c.

a

b

c

1

4

40

2

5

50

3

6

60

4

8

80

5

8

80

5

4

40

6

10

100

7

12

120

10

15

150

4

9

90

8

12

120

Plotted out, a vs. b and a vs. c look identical except the y-axis is scaled differently for each. When the covariance is taken of both a & b and a & c, you get different a large difference in results. The covariance between a & b is much smaller than the covariance between a & c even though the plots are identical except the scale. The y-axis on the c vs. a plot goes to 150 instead of 15.

To account for this, correlation is takes covariance and scales it by the product of the standard deviations of the two variables.

$latex
cor(X, Y) = \frac{cov(X, Y)}{s_X s_Y}
&s=2$

$latex
cor(a, b) = 0.8954
&s=2$

$latex
cor(a, c) = 0.8954
&s=2$

Now, correlation describes how strong the relationship between the two vectors regardless of the scale. Since the standard deviation in vector c is much greater than vector b, this accounts for the larger covariance term and produces identical correlations terms. The correlation coefficient will fall between -1 and 1. Both -1 and 1 indicate a strong relationship, while the sign of the coefficient indicates the direction of the relationship. A correlation of 0 indicates no relationship.

Here’s the R code that will run through the calculations.

Python

1

2

3

4

5

6

7

8

9

10

11

12

#covariance vs correlation

a<-c(1,2,3,4,5,5,6,7,10,4,8)

b<-c(4,5,6,8,8,4,10,12,15,9,12)

c<-c(4,5,6,8,8,4,10,12,15,9,12)*10

data<-data.frame(a,b,c)

cov(a,b)#8.5

cov(a,c)#85

cor(a,b)#0.8954

cor(a,c)#0.8954

Caution | Anscombe’s Quartet

Correlation is great. It’s a basic tool that is easy to understand, but it has its limitations. The most prominent being the correlation =/= causation caveat. The linked BuzzFeed article does a good job explaining the concept some ridiculous examples, but there are real-life examples being researched or argued in crime and public policy. For example, crime is a problem that has so many variables that it’s hard to isolate one factor. Politicians and pundits still try.

Another famous caution about using correlation is Anscombe’s Quartet. Anscombe’s Quartet uses different sets of data to achieve the same correlation coefficient (0.8164 give or take some rounding). This exercise is typically used to emphasize why it’s important to visualize data.

The graphs demonstrates how different the data sets can be. If this was real-world data, the green and yellow plots would be investigated for outliers, and the blue plot would probably be modeled with non-linear terms. Only the red plot would be consider appropriate for a basic, linear model.

I created this plot in R with ggplot2. The Anscombe data set is included in base R, so you don’t need to install any packages to use it. Ggplot2 is a fantastic and powerful data visualization package which can be download for free using the install.packages('ggplot2') command. Below is the R code I used to make the graphs individually and combine them into a matrix.

This tutorial is a continuation of making a covariance matrix in R. These tutorials walk you through the matrix algebra necessary to create the matrices, so you can better understand what is going on underneath the hood in R. There are built-in functions within R that make this process much quicker and easier.

The correlation matrix is is rather popular for exploratory data analysis, because it can quickly show you the correlations between variables in your data set. From a practical application standpoint, this entire post is unnecessary, because I’m going to show how to derive this using matrix algebra in R.

First, the starting point will be the covariance matrix that was computed from the last post.

This matrix has all the information that’s needed to get the correlations for all the variables and create a correlation matrix [V — variance, C — Covariance]. Correlation, we are using the Pearson version of correlation, is calculated using the covariance between two vectors and their standard deviations [s, square root of the variance]:

$latex
cor(X, Y) = \frac{cov(X,Y)}{s_{X}s_{Y}}
&s=2$

The trick will be using matrix algebra to easily carry out these calculations. The variance components are all on the diagonal of the covariance matrix, so in matrix algebra notation we want to use this:

Since R doesn’t quite work the same way as matrix algebra notation, the diag() function creates a vector from a matrix and a matrix from a vector, so it’s used twice to create the diagonal variance matrix. Once to get a vector of the variances, and a second time to turn that vector into the above diagonal matrix. Since the standard deviations are needed, the square root is taken. Also the variances are inverted to facilitate division.

Python

1

2

#pulls out the standard deviations from the covariance matrix

S<-diag(diag(C)^(-1/2))

After getting the diagonal matrix, basic matrix multiplication is used to get the all the terms in the covariance to reflect the basic correlation formula from above.

Understanding what a covariance matrix is can be helpful in understanding some more advanced statistical concepts. First, let’s define the data matrix, which is the essentially a matrix with n rows and k columns. I’ll define the rows as being the subjects, while the columns are the variables assigned to those subjects. While we use the matrix terminology, this would look much like a normal data table you might already have your data in. For the example in R, I’m going to create a 6×5 matrix, which 6 subjects and 5 different variables (a,b,c,d,e). I’m choosing this particular convention because R and databases use it. A row in a data frame represents represents a subject while the columns are different variables. [The underlying structure of the data frame is a collection of vectors.] This is against normal mathematical convention which has the variables as rows and not columns, so this won’t follow the normal formulas found else where online.

The covariance matrix is a matrix that only concerns the relationships between variables, so it will be a k x k square matrix. [In our case, a 5×5 matrix.] Before constructing the covariance matrix, it’s helpful to think of the data matrix as a collection of 5 vectors, which is how I built our data matrix in R.]

Python

1

2

3

4

5

6

7

8

9

#create vectors -- these will be our columns

a<-c(1,2,3,4,5,6)

b<-c(2,3,5,6,1,9)

c<-c(3,5,5,5,10,8)

d<-c(10,20,30,40,50,55)

e<-c(7,8,9,4,6,10)

#create matrix from vectors

M<-cbind(a,b,c,d,e)

The data matrix (M) written out is shown below.

Python

1

2

3

4

5

6

7

abcde

[1,]123107

[2,]235208

[3,]355309

[4,]465404

[5,]5110506

[6,]6985510

Each value in the covariance matrix represents the covariance (or variance) between two of the vectors. With five vectors, there are 25 different combinations that can be made and those combinations can be laid out in a 5×5 matrix.

There are a few different ways to formulate covariance matrix. You can use the cov() function on the data matrix instead of two vectors. [This is the easiest way to get a covariance matrix in R.]

Create a difference matrix (D) by subtracting the matrix of means (M_mean) from data matrix (M).

$latex {\bf D = M – M\_mean} &s=2$

Create the covariance matrix (C) by multiplying the transposed the difference matrix (D) with a normal difference matrix and inverse of the number of subjects (n) [We will use (n-1), since this is necessary for the unbiased, sample covariance estimator. This is covariance R will return by default.

It represents the various covariances (C) and variance (V) combinations of the five different variables in our data set. These are all values that you might be familiar with if you’ve used the var() or cov() functions in R or similar functions in Excel, SPSS, etc.