One of the most common and basic techniques for analyzing the relationships between variables is zero-order correlation. This tutorial will explore the ways in which R can be used to employ this method.

Tutorial Files

Before we start, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. This dataset contains pre and post test scores for 66 subjects on a series of reading comprehension tests (Moore & McCabe, 1989). Note that all code samples in this tutorial assume that this data has already been read into an R variable and has been attached.

Correlation Between Two Variables

The most fundamental way to calculate correlations is to directly operate on two variables. In R, this can be done using the cor() function. The cor() function accepts the following arguments ("Correlation, Variance...", n.d.).

x: the first variable to correlate

y: the second variable to correlate

use (optional): determines how missing values are handled; accepts "all.obs", "complete.obs", or "pairwise.complete.obs"

In most cases, x and y are the only arguments that you will use when running the cor() function. The basic format for calculating a correlation is cor(VAR1, VAR2), where VAR1 and VAR2 are the variables that you would like to correlate.

cor(VAR1, VAR2) Example

Suppose that our research question is: "How does a subject's pretest 1 score relate to his or her posttest 1 score?" The following example demonstrates how to use the cor() function to calculate the correlation between pretest 1 (PRE1) and posttest 1 (POST1).

>#use cor(VAR1, VAR2) to calculate the correlation between variable 1 and variable 2

> cor(PRE1, POST1)

[1] 0.5659026

Correlations Between Multiple Variables

When beginning to analyze a dataset, researchers often want to get a complete picture of all correlations, rather than just a single one. Conveniently, the cor() function can also be run on an entire set of data. The format for this operation is cor(DATAVAR), where DATAVAR is the name of the R variable containing the data.

cor(DATAVAR) Example

Note that the underlying code for the cor(datavar) function has changed in recent versions of R. The function is no longer able to receive datasets that do contain non-numerical values. In this case, you will receive an error to the effect of "x must be numeric," and should ensure that all of your data are in numeric form prior to using the function.

Suppose now that our research question is: "How do all of the test scores in the dataset relate to each other?" The following example demonstrates how to use the cor() function to calculate all of the correlations in a dataset.

It sounds like your 'x' variable is not numeric and therefore R is unable to correlate it. Try making your 'x' data numeric following the dummy coding technique demonstrated here: http://rtutorialseries.blogspot.com/2010/02/r-tutorial-series-regression-with.html

The funny thing is that when I checked all the fields with the is.numeric () function, the answers were all true and I could actually calculate correlations pairwise.Any other suggestions would be greatly appreciated.Regards,Ruben

I see what is happening now. You have created a variable named "Group" that contains the numeric version of the Group column from the dataset. However, this does not modify the original Group column in the dataset. So, when you try to run cor() on the datavar, it still sees the original text values for the Group column and cannot form a correlation. Instead, use your new Group variable and the original dataset inside the cor() function. Here is an example:

This will get you a correlation between Group (numeric) and all of the other columns in the dataset. Of course, you will get an NA still on the Group-Group correlation since the original dataset still contains text values.

Hi John,I executed the same commands using the R 2.10.1 for Windows in my virtual machine and everything worked as expected.Now that I know it's related to OS X, do you have any ideas how to solve it?Many thanks in advance,Ruben

Another reader commented that attach() can cause unexpected console errors, although I have never experienced problems with it up to this point. So, one other thing to try might be typing out the entire column name without using attach(). You could do this:

> #read in the data> datavar <- read.csv("dataset_readingTests.csv")> #create a variable containing the numeric version of the Group column> numericGroup <- as.numeric(datavar$Group)> #correlate the numeric Group variable with the original dataset> cor(datavar, numericGroup)

Otherwise, I'm not sure what to do at this point. I have never encountered the error message that you have posted. For the record, I am using Mac OS X 10.6.3 and R 2.10.0 GUI 1.30 Leopard build 64-bit (5511).

Hi John,I tried your suggestion but unfortunately I got the same error message.I'm really at a loss as to what is causing the problem so I will try to ask the R community.Anyway, thanks a lot for your help and for creating such great tutorials.Regards,Ruben

Hi John, I think I found the problem.The error message only appears with R version 2.11.0 for OS X. I tried to execute the code with R 2.10.1 for OS X and it worked perfectly.I'm going to report the issue in the R user groups.Regards,Ruben

Hi, I have a question about correlation in R. I am trying to compare time varying correlations between an asset and the S&P 500. I want to find the correlation for each date that I have data for. Here for example. say X = 1,4,6,7,8,3,2,9,1,2,3,3 and Y =5,2,3,4, 4,8,3,5,9,10,3 ,4 how can I find the correlation between X and Y for every point starting with when X = 4 and Y =2.

HelloI had the same problem as Ruben above but I solved it by uploading this file:as you can see I replaced: Basal with 0 DRTA with 1and Strat with 2it did not like to have text thereif you upload the csv below it will work-----Subject,Group,PRE1,PRE2,POST1,POST2,POST31,0,4,3,5,4,412,0,6,5,9,5,413,0,9,4,5,3,434,0,12,6,8,5,465,0,16,5,10,9,466,0,15,13,9,8,457,0,14,8,12,5,458,0,12,7,5,5,329,0,12,3,8,7,3310,0,8,8,7,7,3911,0,13,7,12,4,4212,0,9,2,4,4,4513,0,12,5,4,6,3914,0,12,2,8,8,4415,0,12,2,6,4,3616,0,10,10,9,10,4917,0,8,5,3,3,4018,0,12,5,5,5,3519,0,11,3,4,5,3620,0,8,4,2,3,4021,0,7,3,5,4,5422,0,9,6,7,8,3223,1,7,2,7,6,3124,1,7,6,5,6,4025,1,12,4,13,3,4826,1,10,1,5,7,3027,1,16,8,14,7,4228,1,15,7,14,6,4829,1,9,6,10,9,4930,1,8,7,13,5,5331,1,13,7,12,7,4832,1,12,8,11,6,4333,1,7,6,8,5,5534,1,6,2,7,0,5535,1,8,4,10,6,5736,1,9,6,8,6,5337,1,9,4,8,7,3738,1,8,4,10,11,5039,1,9,5,12,6,5440,1,13,6,10,6,4141,1,10,2,11,6,4942,1,8,6,7,8,4743,1,8,5,8,8,4944,1,10,6,12,6,4945,2,11,7,11,12,5346,2,7,6,4,8,4747,2,4,6,4,10,4148,2,7,2,4,4,4949,2,7,6,3,9,4350,2,6,5,8,5,4551,2,11,5,12,8,5052,2,14,6,14,12,4853,2,13,6,12,11,4954,2,9,5,7,11,4255,2,12,3,5,10,3856,2,13,9,9,9,4257,2,4,6,1,10,3458,2,13,8,13,1,4859,2,6,4,7,9,5160,2,12,3,5,13,3361,2,6,6,7,9,4462,2,11,4,11,7,4863,2,14,4,15,7,4964,2,8,2,9,5,3365,2,5,3,6,8,4566,2,8,3,4,6,42------

I supposed it would work if I substituted out the Group ASCII for a numeric, but I see the advantage in getting the results you initially achieved. Have there been any clues as to why this is the case, or being above to force define Group to being numeric somehow?

I recommend reading the above posts, as others have also experienced this problem. Ruben reported it as happening on a particular version of R for OSX, while Luca offered a modified CSV file which replaced the text terms with numeric ones.

Hi John,I have a lot of correlations to run, but I need significance values. How do I get a table with the r-values for the correlations between my variables but also p-values? (The cor.test() function doesn't work for multiple variables like cor() does).

P.S. I was also notified that the rcorr function in the Hmisc package may be useful for what you are doing, since it accepts a matrix X and returns a correlation matrix with p-values. I haven't used the function myself.

I want to assign different shapes to 4 groups of data that I'm plotting together. I know how to put in the shapes but not how to assign them to a specific group (e.g.; C2=square, C3=circle, etc). Thanks so much for your help.