Correlation is a statistical procedure to test the relationship between quantitative variables and categorical variables. In other words, it describes the degree of relation between two variables. It is one of the most commonly used statistical techniques. The present article is based on selected statistical textbook, review of the literature, and our own research experience study.

The concept of correlation was first proposed by Sir Francis Galton in 1894, which was further mathematically described by Karl Pearson in 1896.[1] Correlation analysis is a method of statistical evaluation of the strength of a relationship between two numerically measurable continuous variables.

In biostatistics, univariate statistical tests such as Chi-square test, Fisher's exact test, t-test, and analysis of variance do not allow taking into account the effect of other covariates/confounders during analyses.[2] However, a technique called partial correlation allows the researcher to control the effect of confounders/covariates in understanding the relation between the two selected variables.[3] Partial correlation looks at the relationship between two variables while removing the effects of other variables.

In statistical terms, correlation is a method of assessing a probable two-way linear association between two measurable continuous variables. The extent of “correlation” is measured by a statistic called the correlation coefficient, which represents the strength of the putative linear association between the two selected variables. In other words, it is a statistic representing how closely two variables co-vary; it is a dimensionless quantity whose value can vary from −1 (perfect negative correlation) through 0 (no correlation) to +1 (perfect positive correlation).[4] A positive coefficient of correlation indicates that the variables are directly related, i.e., as the value of one variable increases, the value of the other variable also tend to increase. On the contrary, if the coefficient is a negative number, it indicates that the selected variables are negatively related, i.e., as the value of one variable increases, the value of other tends to decrease. In statistical terms, any other form of relation between any two continuous variables that is not linear is not considered as correlation.[5]

In biological research, the relation between independent or the predictor variables and outcome or the dependent variable is explored. This explains how the risk factors or the predictor variables account for the possibility of the occurrence of a disease or presence of a phenotype. The disease outcome or the dependent variable is associated with biological factors (such as age and gender), lifestyle variables (such as physical activity, smoking, and alcohol consumption), physiological variables (blood pressure and pulse rate), and genetic factors (genetic mutations). To understand such “risk factors–disease” relationship, two tests may be used, i.e., correlation and regression (Gaddis and Gaddis, 1990). Correlation thus provides a quantitative way of measuring the degree or strength of the relation between the selected variables, whereas regression describes this relation mathematically by predicting the value of the outcome occurrence based on the independent predictor value.[6]

Types of Correlation

Pearson's r correlation

When there is normal distribution of the data or the data are “parametric,” Pearson's correlation “r” is used. It is used between the variables that are linear. Pearson's r correlation is calculated using the following formula:

where r = Pearson's r correlation coefficient

N = number of observations

Σxy = sum of the products of paired scores

Σx = sum of x scores

Σy = sum of y scores

Σx2 = sum of squared x scores

Σy2 = sum of squared y scores.

For the Pearson's r correlation, both variables should be normally distributed (bell-shaped curve I distribution) and have linearity. Linearity assumes a straight line relationship between each of the two variables.

Spearman's rank correlation

Spearman's rank correlation is a nonparametric test used to measure the degree of association between two variables. When the data or the distribution of the selected variables is not normally distributed or “skewed,” Spearman's rank correlation may be used. This test of correlation does not carry any assumptions about the distribution of the data and is used best when the variables are measured on a scale that is at least ordinal and the scores on one variable need to be monotonically related to the other variable.

Spearman's rank correlation is calculated using the following formula:

where ρ = Spearman's rank correlation

di = the difference between the ranks of corresponding variables

n = number of observations.

Statistical Simulations to Understand the Relationship between Correlation Coefficient and Scatterplots

The scatterplot between the selected variables can present their relationship. The higher the correlation between the selected variables, the more is the linear association between them and hence an obvious trend is seen in a scatter plot [Figure 1].

The scatterplot in [Figure 2] shows a linear association trend between the variables x and y, but the trend does not seem to be clear since the coefficient of correlation is low, i.e., 0.20. The trend seems to improve in [Figure 3], where the coefficient of correlation is 0.50. The trend in [Figure 4] and [Figure 5] shows that, higher the correlation in either direction, i.e., positive correlation or negative correlation, the more linear association is visible in the scatterplot. The strength of the correlation between x and y in [Figure 4] and [Figure 5] remains same but in opposite direction. In [Figure 4], when x increases, y also increases, whereas in [Figure 5], when x increases, y decreases or vice versa.

Interpretation of the size of correlation coefficient

The correlation coefficient value may be interpreted from negligible to high positive/negative as shown in [Table 1] (Hinkle et al., 2003).

Coefficient of correlation (r) is the degree of relationship between two variables, i.e., x and y, whereas coefficient of determination (R2) shows percentage variation in y which is explained by all the x variables together. The value of “r” may vary from −1 to +1, whereas the value of “r2” lies between 0 and +1.

Use of Correlation Analysis in Biological Data

In biological research, correlation analysis is used to understand the relation between the independent variables (or risk factors) with dependent variable (or the disease outcome). The selected variables may be continuous or ordinal. For example, to know the relation between systolic blood pressure (SBP) (continuous dependent) and risk factors/independent variables such as age (continuous) and weight (continuous), Pearson's correlation analysis would be used. On the contrary, to understand the relation between maternal age (continuous) and parity (ordinal) or number of hospitalization (ordinal) and history of stroke (ordinal), Spearman's correlation analysis would be used.

How to Perform Correlation in SPSS?

Linear regression can be tested through the SPSS statistical software (IBM SPSS Statistics for Windows, IBM Corp., Released 2011, Version 20.0, Armonk, NY, USA) in five steps to analyze data using linear regression. Following is the procedure followed [Table 1], [Table 2], [Table 3], [Table 4].

Example 1: Data (n = 967) on the waist circumference (WC) and SBP were collected and bivariate correlation would be tested to understand the relation between the two.

Since both the selected variables are continuous, bivariate correlation analysis is performed using Pearson's correlation coefficient after checking the normality assumptions for both variables. The Pearson's correlation coefficient, i.e., r = 0.395, P < 0.001 [Table 2], implies that a low positive correlation, yet statistically significant linear relation, is present between WC and SBP. The coefficient of determination, i.e., R2 is 0.156 ([0.3952]), which implies that WC accounts for only 15.6% variation in the SBP.

Example 2: Data (n = 936) on the WC and the body mass index (BMI) status were collected. BMI status was categorized into underweight, normal, overweight, and obese. Bivariate correlation would be tested to understand the relation between the two.

Since one of the selected variables is continuous (WC), while other is ordinal (BMI status), bivariate correlation analysis is performed using Spearman's correlation coefficient after checking the normality assumptions for both variables. The Spearman's correlation coefficient, i.e., r = 0.398, P < 0.001 [Table 3], implies that a low positive correlation, yet statistically significant linear relation, is present between WC and BMI status. The coefficient of determination, i.e., R2 is 0.158 ([0.3982]), implies that BMI status explains 15.8% variation in the WC.

Correlation analysis can also be used for calculating independent correlation between variables adjusting for the effect of other variables. Such analysis can be done using partial correlation analysis in SPSS. The following command is given:

Example 3: Data (n = 940) on the WC and the SBP were collected and partial correlation would be tested to understand the relation between the two controlling for confounding factors such as smoking status and education.

Since both the selected variables are continuous, bivariate correlation analysis is performed using Pearson's correlation coefficient after checking the normality assumptions for both variables. The Pearson's correlation coefficient, i.e., r = 0.381, P < 0.001 [Table 4], implies that a weak positive correlation, yet statistically significant linear relation, is present between WC and SBP after controlling for the effect of confounders, i.e., smoking and education.

The technique for testing the strength of linear relationship between two variables is correlation. It can be used for continuous or ordinal set of variables and can also assess the independent relation between the variables controlling for the effect of confounders or other variables.

Pearson K. Mathematical contributions to the theory of evolution. III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society of London. Series A. Containing Pap Math Phys Character 1896;187:253-318.