Prix SFBI des deux meilleurs posters à Jobim 2017

Since the release of Oxford Nanopore Technologies (ONT) MinION sequencers in 2014 the number of reads produced with this new sequencing technology is still increasing. All the protocols remain in very active development (ONT provides updates of its chemistry and bioinformatics tools every 2-3 months). Hence the bioinformatical tools must be up to date throughout the development of MinION technology. The IBENS (Institut de Biologie de l'École normale supérieure) genomic facility is currently developing a new data analysis workflow for RNA-Seq experiments using ONT sequencing output: Toullig. This pipeline is based on Eoulsan and its bundled RNA-Seq pipeline for the Illumina reads.

The final goals of Toullig is to perform differential expression analysis from ONT long reads and produce a reference transcriptome by combining data from both Illumina and ONT technologies. In this poster, we present a RNA-Seq analysis for long reads on the Toullig step like, long read mapping or the quality control of the mapping. The new Eoulsan modules for Toullig and the toolbox for manipulating ONT data are available on GitHub.

The aim of biological data ranking is to help users faced with huge amount of data and choose between alternative pieces of information. This is particularly important when querying biological data integration systems, where even very simple queries can return thousands of answers. For instance, searching for the set of human genes involved in breast cancer returns thousands of answers in the reference database EntrezGene without any ranking in terms of importance. The need for ranking solutions, able to order answers, is crucial for helping scientists to organize their time and prioritize the new experiments to be possibly conducted. However, ranking biological data is a difficult task for various reasons: biological data are usually annotation files which reflect expertise, they thus may be associated with various degrees of confidence; the need expressed by
scientists may also be taken into consideration whether the most well-known data should be ranked first, or the freshest, etc. As a consequence, although several ranking methods have been proposed in the last years within the bioinformatics community, none of them has been deployed on systems currently in use.

The approach we propose to follow is to rank biological data by considering two steps. First, several ranking methods are applied to biological data (results are ordered using alternative ranking criteria). Second, we use consensus ranking methods reflecting the input rankings’ common points while not putting too much importance on elements classified as ”good” by only one or a few rankings. The problem, known as the median problem for a set of rankings, isNP-hard. However, since providing a consensus ranking is a crucial need for big biological data sets, designing scalable algorithms is highly challenging. Besides, the problem has been mainly studied in the case of permutations where elements are strictly ordered while in real applications some elements may be placed at the same position (considered as equally important). The challenge is then to design an algorithm computing one consensus ranking from a set of rankings with ties.

We introduce a new algorithm computing a consensus ranking from a set of rankings with ties. The originality of our approach lies in providing an efficient solution (i) based on a graph decomposition of the datasets to partition it efficiently and (ii) having several interesting and fundamental properties, which allow to evaluate the relevance of a given solution and able to provide the exact consensus in many cases. A set of experiments has been conducted on several hundreds of biological and synthetic data sets. First results appear to be very promising, making our algorithm able to compete with the best currently available algorithms while beingefficient enough to be used on real settings in particular as the algorithm used on http://conqur-bio.lri.fr/.