R, Julia and genome wide selection

What was this about? He read my post on R, Julia and the shiny new thing, which prompted him to assume that I was the proverbial old dog unwilling (or was it unable?) to learn new tricks. (Incidentally, with friends like this who needs enemies? Hi, Gus.)

Having a look at different—statistical—horses (Photo: Luis, click to enlarge).

I decided to tackle a small—but hopefully useful—piece of code: fitting/training a Genome Wide Selection model, using the Bayes A approach put forward by Meuwissen, Hayes and Goddard in 2001. In that approach the breeding values of the individuals (response) are expressed as a function of a very large number of random predictors (2000, our molecular markers). The dataset (csv file) is a simulation of 2000 bi-allelic markers (aa = 0, Aa = 1, AA = 2) for 250 individuals, followed by the phenotypes (column 2001) and breeding values (column 2002). These models are frequently adjusted using MCMC.

In 2010 I attended this course in Ames, Iowa where Rohan Fernando passed us the following R code (pretty much a transliteration from C code; notice the trailing semicolons, for example). P.D. 2012-04-26 Please note that this is teaching code not production code:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

nmarkers=2000;# number of markers

startMarker=1981;# set to 1 to use all

numiter=2000;# number of iterations

vara=1.0/20.0;

# input data

data=matrix(scan("trainData.out0"),ncol=nmarkers+2,byrow=TRUE);

nrecords=dim(data)[1];

beg=Sys.time()

# x has the mean followed by the markers

x=cbind(1,data[,startMarker:nmarkers]);

y=data[,nmarkers+1];

a=data[,nmarkers+2];

# inital values

nmarkers=nmarkers-startMarker+1;

mean2pq=0.5;# just an approximation

scalea=0.5*vara/(nmarkers*mean2pq);# 0.5 = (v-2)/v for v=4

size=dim(x)[2];

b=array(0.0,size);

meanb=b;

b[1]=mean(y);

var=array(0.0,size);

# adjust y

ycorr=y-x%*%b;

# MCMC sampling

for(iter in1:numiter){

# sample vare

vare=(t(ycorr)%*%ycorr)/rchisq(1,nrecords+3);

# sample intercept

ycorr=ycorr+x[,1]*b[1];

rhs=sum(ycorr)/vare;

invLhs=1.0/(nrecords/vare);

mean=rhs*invLhs;

b[1]=rnorm(1,mean,sqrt(invLhs));

ycorr=ycorr-x[,1]*b[1];

meanb[1]=meanb[1]+b[1];

# sample variance for each locus

for(locus in2:size){

var[locus]=(scalea*4+b[locus]*b[locus])/rchisq(1,4.0+1)

}

# sample effect for each locus

for(locus in2:size){

# unadjust y for this locus

ycorr=ycorr+x[,locus]*b[locus];

rhs=t(x[,locus])%*%ycorr/vare;

lhs=t(x[,locus])%*%x[,locus]/vare+1.0/var[locus];

invLhs=1.0/lhs;

mean=invLhs*rhs;

b[locus]=rnorm(1,mean,sqrt(invLhs));

#adjust y for the new value of this locus

ycorr=ycorr-x[,locus]*b[locus];

meanb[locus]=meanb[locus]+b[locus];

}

}

Sys.time()-beg

meanb=meanb/numiter;

aHat=x%*%meanb;

Thus, we just need defining a few variables, reading the data (marker genotypes, breeding values and phenotypic data) into a matrix, creating loops, matrix and vector multiplication and generating random numbers (using a Gaussian and Chi squared distributions). Not much if you think about it, but I didn’t have much time to explore Julia’s features as to go for something more complex.

MATLAB

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

nmarkers=2000#Number of markers

startmarker=1981#Setto1to use all

numiter=2000#Number of iterations

data=dlmread("markers.csv",',')

(nrecords,ncols)=size(data)

tic()

#this isthe meanand markers matrix

X=hcat(ones(Float64,nrecords),data[:,startmarker:nmarkers])

y=data[:,nmarkers+1]

a=data[:,nmarkers+2]

nmarkers=nmarkers-startmarker+1

vara=1.0/nmarkers

mean2pq=0.5

scalea=0.5*vara/(nmarkers*mean2pq)#0.5=(v-2)/vforv=4

ndesign=size(X,2)

b=zeros(Float64,ndesign)

meanb=zeros(Float64,ndesign)

b[1]=mean(y)

varian=zeros(Float64,ndesign)

#adjusty

ycorr=y-X*b

#MCMC sampling

fori=1:numiter

#sample vare

vare=dot(ycorr,ycorr)/randchi2(nrecords+3)

#sample intercept

ycorr=ycorr+X[:,1]*b[1];

rhs=sum(ycorr)/vare;

invlhs=1.0/(nrecords/vare);

mn=rhs*invlhs;

b[1]=randn()*sqrt(invlhs)+mn;

ycorr=ycorr-X[:,1]*b[1];

meanb[1]=meanb[1]+b[1];

#sample variance foreach locus

forlocus=2:ndesign

varian[locus]=(scalea*4+b[locus]*b[locus])/randchi2(4.0+1);

end

#sample effect foreach locus

forlocus=2:ndesign

#unadjustyforthis locus

ycorr=ycorr+X[:,locus]*b[locus];

rhs=dot(X[:,locus],ycorr)/vare;

lhs=dot(X[:,locus],X[:,locus])/vare+1.0/varian[locus];

invlhs=1.0/lhs;

mn=invlhs*rhs;

b[locus]=randn()*sqrt(invlhs)+mn;

#adjustyforthe new value of this locus

ycorr=ycorr-X[:,locus]*b[locus];

meanb[locus]=meanb[locus]+b[locus];

end

end

toc()

meanb=meanb/numiter;

aHat=X*meanb;

The code looks remarkably similar and there are four main sources of differences:

The first trivial one is that the original code read a binary dataset and I didn’t know how to do it in Julia, so I’ve read a csv file instead (this is why I start timing after reading the file too).

The second trivial one is to avoid name conflicts between variables and functions; for example, in R the user is allowed to have a variable called var that will not interfere with the variance function. Julia is picky about that, so I needed renaming some variables.

Julia pases variables by reference, while R does so by value when assigning matrices, which tripped me because in the original R code there was something like: b = array(0.0,size); meanb = b;. This works fine in R, but in Julia changes to the b vector also changed meanb.

The definition of scalar vs array created some problems in Julia. For example y' * y (t(y) %*% y in R) is numerically equivalent to dot(y, y). However, the first version returns an array, while the second one a scalar. I got an error message when trying to store the ‘scalar like an array’ in to an array. I find that confusing.

One interesting point in this comparison is using rough code, not really optimized for speed; in fact, the only thing that I can say of the Julia code is that ‘it runs’ and it probably is not very idiomatic. Testing runs with different numbers of markers we get that R needs roughly 2.8x the time used by Julia. The Julia website claims better results in benchmarks, but in real life we work with, well, real problems.

In 1996-7 I switched from SAS to ASReml for genetic analyses because it was 1-2 orders of magnitude faster and opened a world of new models. Today a change from R to Julia would deliver (in this particular case) a much more modest speed up (~3x), which is OK but not worth changing languages (yet). Together with the embryonic graphical capabilities and the still-to-develop ecosystem of packages, means that I’m still using R. Nevertheless, the Julia team has achieved very impressive performance in very little time, so it is worth to keep an eye on their progress.

P.S.1 Readers are welcome to suggesting ways of improving the code.
P.S.2 WordPress does not let me upload the binary version of the simulated data.
P.S.3 Hey WordPress guys; it would be handy if the sourcecode plugin supported Julia!
P.S.4 2012-04-26 Following AL’s recommendation in the comments, one can replace in R:

1

2

rhs=t(x[,locus])%*%ycorr/vare;

lhs=t(x[,locus])%*%x[,locus]/vare+1.0/var[locus]

by

1

2

rhs=crossprod(x[,locus],ycorr)/vare

lhs=crossprod(x[,locus],x[,locus])/vare+1.0/var[locus]

reducing execution time by roughly 20%, making the difference between Julia and R even smaller.

Use crossprod and tcrossprod instead of %*% which is slower. This is indeed one of the main problems of R (why have I to remember which functions are efficient and which are not?).

This
t(x[,locus])%*%x[,locus]/vare
is extremely slow (and repeated many times by iteration). It is much better to compute the cross-products at the beggining and use xpx[i]. For instance (this is Gauss Seidel but it can be easily lodified to BayesA):

You are right, there are several calculations in both the R and Julia codes that are repeated, so it makes sense to compute them once and reuse the results. Doing that will reduce the time for both languages; still Julia is faster but nowhere near the speed improvements claimed in its website.

We were visiting a friend who had a trainer breaking around 20 horses. The original pictures was taken in RAW format (to keep more info from the sensor) and then I applied a graduated neutral filter so I could increase exposure in the foreground without totally blowing up the sky. Finally added a B&W red filter. All processing done in Adobe Lightroom.

Incidentally, though I haven’t visited his course, I am in the process of porting this code from Fernando Rohan to Python and the QTLMAS XVI dataset. I managed to get some speedup, mainly due things like caching indexed/sliced vectors and values. Still, I get about 0.8s per iteration, and I ran two chains yesterday in parallel, for 12 hours to get 50.000 samples. I plotted the mean feature weights against each other, and while there is some correlations, and the two chains got the same “general ideas” QTL-wise, they don’t really look converged. Back to the drawing board, I guess.

I am now exploring pyopencl in order to run like 12-48 chains in parallel. That doesn’t really provide a huge speedup iteration-wise (1000 individuals is too low for that), but I hope I can somehow make the chains communicate with each other. This might mean faster convergence, and maybe enough samples to do p-values.

Another idea would be to use cloud computing. This would still take days, but at least I could run all three traits with enough replications and iterations in a low multiple of 24 hours, whereas my lowly desktop computer gets cranky when I run three or four of the chains in parallel…

I didn’t try the R or Julia implementation for this dataset yet, but I doubt it would be a pleasant experience with the 10000×1000 genotypes and 100,000×10000 sample traces. And then they are providing the alleles in a haplotyped fashion, so I guess I’d actually want to double or triple that number of features eventually.

This is teaching not production code by the way. As pointed out by AL above, there are several ways to improve speed. If you want to discuss more details for very large problems you would be wise to contact people like Rohan Fernando and Dorian Garrick, who have implemented heavy-duty versions of several genome wide selection algorithms in a system that can be accessed over the net.

Ah, thanks, good ideas. I am viewing this as an educational effort anyway. Parsing the literature on past QTLMAS workshop datasets, I get a feeling that a lot of people ran into the same issues with time and convergence.
Bayes B beats the competition almost always, but with the number of iterations that they cite I just don’t believe they were using chains that actually converge.
An obvious way to improve the speed would be to use C++ or at least inline portions of C++ within python, but even then I don’t expect more than a two-fold improvement, that’s why I think I will try to implement a population based aproach to make the chains converge quicker…
Also I hardly believe I would be allowed to upload marker data to a non-involved researcher’s server. I’m not much for mistrust, but that would be violating agreements I think.

I see the tic() toc() in the Julia Code outside a loop… In general the first run is more slow, because you have to compile everything. Try to use more than one run on the same environment for decide velocity of the run 😉 In Julia define your global constant variables as constants, help to the compiler and generate faster code:

const startmarker = 1981 # Set to 1 to use all
const numiter = 2000 # Number of iterations