This proposal is part of an enduring collaboration between the INRIA team Mistis and the team of computational and mathematical biology (BCM) from the TIMC-IMAG laboratory. The collaboration is devoted to the development of statistical and machine-learning algorithms that are needed to analyze large-scale biological data. As part of this research effort, the objective of the project will be to develop statistical methods that can scale with the massive dimension of population genomic data. The INRIA Mistis team is developing statistical methods for dealing with complex stochastic systems and the BCM team at TIMC-IMAG develops algorithms and software for analyzing biological data. The candidate is expected to interact with both teams during the project. The supervisors will be Florence Forbes (Mistis) and Michael Blum (BCM).

Job offer description

In the context of evolutionary biology, genomic data can be used to detect the genes involved in Darwinian selection. An archetypal example of Darwinian selection in humans involves the genetic adaptation to high-altitudes for Tibetans [1].

The detection of genes involves in adaptation makes use of genome scans where thousand or millions of genomic markers are scanned for statistical signatures of Darwinian selection [2]. However current statistical approaches suffer from several drawbacks including the a priori clustering of individuals into populations and the computational burden implied by the common use of MCMC algorithms.

To overcome these limitations, the objective will be to implement a purely statistical method that does not model explicitly the mechanistic and evolutionary processes acting on genetic variation. The implemented statistical method will be based on robust PCA [3], which decomposes a given data matrix into a low-rank component and a sparse component containing the outlier elements. Robust PCA will be used to detect the atypical genomic markers, which have been involved in biological adaptation. Robust PCA has already successfully been used in the area of video surveillance [4], face recognition, and collaborative filtering and the objective of the project will be develop Bayesian scalable algorithms which can account for the massive dimension of the genomic data. Variational Bayesian methods will be particularly investigated because they can provide solutions of comparable accuracy to MCMC algorithms with a substantially reduced computational burden [5]. The proposed algorithm is expected to be implemented in an open-source software.

Skills and profile

We look for candidates strongly motivated by challenging research topics. The applicant should have good background in applied statistics and computer science (good programming skills). The required knowledge includes ideally Bayesian statistics and machine learning. Background in biological data analyses will be appreciated but not required. Programming skills in C or in other languages well-suited for analyzing large scale data will be appreciated. The successful candidate should have good oral and writing communication skills in English.