In this notebook, I show how to determine a correlation coefficient within the Bayesian framework both in a simply and a robust way. The correlation can be seen as a direct alternative to the traditional Pearson correlation coefficient.

The model and code shown here is motivated by the following blog posts and the book "Bayesian Cognitive Modeling":

First, let us start with defining our model both for the classic way of using a multivariate normal distribution as well as for the robust way that utilizes a multivariate student t-distribution.

In [2]:

importpymcaspymcfrompymcimportNormal,Uniform,MvNormal,Exponentialfromnumpy.linalgimportinv,detfromnumpyimportlog,pi,dotimportnumpyasnpfromscipy.specialimportgammalndef_model(data,robust=False):# priors might be adapted here to be less flatmu=Normal('mu',0,0.000001,size=2)sigma=Uniform('sigma',0,1000,size=2)rho=Uniform('r',-1,1)# we have a further parameter (prior) for the robust caseifrobust==True:nu=Exponential('nu',1/29.,1)# we model nu as an Exponential plus one@pymc.deterministicdefnuplus(nu=nu):returnnu+1@pymc.deterministicdefprecision(sigma=sigma,rho=rho):ss1=float(sigma[0]*sigma[0])ss2=float(sigma[1]*sigma[1])rss=float(rho*sigma[0]*sigma[1])returninv(np.mat([[ss1,rss],[rss,ss2]]))ifrobust==True:# log-likelihood of multivariate t-distribution@pymc.stochastic(observed=True)defmult_t(value=data.T,mu=mu,tau=precision,nu=nuplus):k=float(tau.shape[0])res=0forrinvalue:delta=r-muenum1=gammaln((nu+k)/2.)+0.5*log(det(tau))denom=(k/2.)*log(nu*pi)+gammaln(nu/2.)enum2=(-(nu+k)/2.)*log(1+(1/nu)*delta.dot(tau).dot(delta.T))result=enum1+enum2-denomres+=result[0]returnres[0,0]else:mult_n=MvNormal('mult_n',mu=mu,tau=precision,value=data.T,observed=True)returnlocals()defanalyze(data,robust=False,plot=True):model=pymc.MCMC(_model(data,robust))model.sample(50000,25000)printifplot:pymc.Matplot.plot(model)returnmodel

So the mean correlation (rho) is around 0.13. However, when we take a look at the histogram of the marginal posterior for rho, we can see that the frequency of distinct values for rho are pretty wide. We can characterize this with the 95% HDP (highest probability density) interval---also called credible interval---which is [-0.42 0.64]. Thus, with this HDP, we can get a very thorough view regarding the distribution of the parameters for rho. While the mean correlation is slightly positive, we cannot rule out a negative correlation or a non-existing correlation (rho=0). Actually, we can just count the number of times rho is positive by looking at the trace of rho. This can be seen as a Bayesian way of accepting or rejecting hypotheses (so-called ROPEs). A discussion about this can be found in one of John Kruschke's blog posts.

In [6]:

(model.trace('r')[:]>0.05).mean()

Out[6]:

0.61136000000000001

As a result, we see that only around 61% of all values of the posterior are above 0.05. Thus, we only see mediocre evidence for a positive correlation in the data.

Let us now relax this assumption and just determine what the probability is that there is no correlation at all (regardless negative or positive) with a ROPE of [0.05,0.05].

In [7]:

((model.trace('r')[:]<0.05)&(model.trace('r')[:]>-0.05)).mean()

Out[7]:

0.12236

We can only see week evidence that there is no correlation at all. Thus, the probability that there is a negative correlation is around 27%. Overall, the inference and our inspection of the marginal posterior reveals that there might be a slight tendency towards a positive correlation, but we are pretty unsure about it. We might want to gather further data to supplement our inference.

Let us now compare these results to a classic Pearson correlation coefficient:

In [8]:

fromscipy.stats.statsimportpearsonrpearsonr(x,y)

Out[8]:

(0.1746680664120466, 0.58716519218300223)

Not surprisingly, our Bayesian rho values is very similar to the one determined by the Pearson correlation coefficient. The p-value states that we cannot reject the null hypothesis that there is no correlation (two-sided).

We can see a slight negative correlation, however the 95% HDP interval again is very wide and also includes a correlation zero and positive correlation. The Pearson correlation shows similar results, but claims no statistical significance.

In [12]:

fromscipy.stats.statsimportpearsonrpearsonr(x,y)

Out[12]:

(-0.10965237912730059, 0.64537479933386654)

What happens, if we add an outlier to the data as elaborated in the follow-up blog post by Rasmus Bååth.

Suddenly, only this one data point, changed the results drastically. The inference now indicates a strong negative correlation and the 95% HDP interval contains only negative correlations. Here, we would for sure claim a strong belief in negative correlation according to our analysis.

In [16]:

fromscipy.stats.statsimportpearsonrpearsonr(x,y)

Out[16]:

(-0.56990140833523373, 0.0069939604634016196)

Not surprisingly, the Pearson correlation again agrees with our Bayesian inference and states the correlation as significant. This is not surprising, as the Pearson correlation is very non-robust to non-normality of the data as it is a measure of linear dependence. The same happends of course with our Bayesian model that models the data with a normal distribution.

Thus, we make our model more robust as elaborated in cited blog post. What we actually want, is a model that assumes that the majority of the data is normally distributed, but still allows outliers to exist. Within our Bayesian framework, we can choose any model for the likelihood. In that case, a robust model is the multivariate student t-distribution. For a detailed discussion please refer to corresponding blog post.