Résultats de l'appel à projets 2015

The analysis of deep sequencing data is both a bottleneck and a major issue for life sciences. Deep seq is commonly used to study a variety of genomic mechanisms as different as structural variations or fitness contribution of genes among many others. Sequencing technologies are evolving fast and the third generation is now accessible on the market. Third generation sequencing provides longer reads at the expense of higher error rates that makes them difficult to incorporate in a classical analysis pipeline. Their length carries the promise of overcoming two major problems in genome and transcriptome assembly: the presence of i/ long gemonic repeats and ii/ of distinct alleles in doploid or polyploid genomes. We propose to set up a service to perform de novo genomic and transcriptome assembly of third generation reads eventually combined with accurate second generation reads. This service will be developed, evualated and evoled as both the technique and bioinformatics solutions improved. Long read sequencing will be increasingly used in assembly projects, but to our knowledge no such service is available in an academic environment. Our group has acquired some expertise of error correction of third generation reads (see LoRDEC) and aims at exploiting high performance computing devices for this assembly service.

BioMAJ is an open source tool for the management of the biological databanks on core bioinformatics facility. It is commonly used on many bioinformatics facilities including for example the IFB-core cloud infrastructure to provide all users with the main public biological data collections directly available in their virtual machines. Since the original publication of this tool in 2008, the GenOuest bioinformatics core facility has supported developments for various improvements notably in terms of interface, leading to a new version produced in 2011 and presently used.

It is necessary to start a new phase of development to 1) improve the code and make it more maintainable; 2) adapt the tool to the new needs of biology. For example, with the arrival of new sequencing technologies, the production of data is more massive and the phases of analysis, including metagenomics, could benefit from more precise management of the data retrieved by BioMAJ.

Capitalizing on previous work carried out by GenOuest on the indexing of data banks, as well as on the ongoing work on the storage of metadata of banks within a graph-oriented database, we propose extra functionalities that allow the user to build its own banks to optimize the processing time for its calculations.

The new BioMAJ version will represent a tool able to create a custom data infrastructure. We wish therefore to carry out this project, to recruit a computer developer for 24 months.

The ability to profile DNA methylation levels at genome-wide scale and in single-base resolution has significantly advanced our knowledge on the role of epigenomics in disease onset and progression. Currently, bisulfite-sequencing technologies are the gold standard for these studies. Whole-genome bisulfite-sequencing provides the most comprehensive approach, but is still very expensive and resource consuming. Alternatively, target-enrichment techniques give a trade-off between genome coverage and cost, and are the preferable solution for epidemiological and clinical studies, built on the analysis of large populations and predefined target regions. However, existing pipelines for bioinformatics analysis of bisulfite-sequencing data focus on whole-genome approaches, where no enrichment of target regions is considered, and most of these are limited to methylation level calling or differential analysis.
In this context, we propose the development of a new service, called BISTAR, that aims to provide the first pipeline for the analysis of targeted bisulfite-sequencing data. BISTAR will cover all the steps of the analysis from sequencing reads to differential methylation analysis, allele-specific methylation and SNP analyses. Technically, BISTAR will offer a practical solution for biomedical researches, since it will combine time-efficient executions by parallelising the stages of the pipeline, and user-friendly deployment using virtual appliances that run locally and through the IFB infrastructure; BISTAR will thus provide cloud access with elastic resource provision, and both command-line and graphical user interfaces (Galaxy).
BISTAR will fill the gap in bisulfite-sequencing analyses and will provide a flexible and user-friendly tool for a broad spectrum of researchers in the life science community.

CRISPRs are genetic loci present in bacteria and archaea which, associated to cas genes, provide defense against foreign sequences. The CRISPR-Cas system is a highly successful biotechnology tool and CRISPR sequences are used in genotyping pathogenic bacteria. The CRISPR database and services developed at I2BC are leading international resources for CRISPR sequence analysis. These resources now need to be strengthened and new tools and services should be added to meet growing user demands. This proposal is a collaboration between a senior member of the initial CRISPRdb team, two labs with strong expertise in cas gene sequence analysis and bacterial genotyping, and three bioinformatics platforms providing engineering support. The project is divided in two parts. First we will improve the existing CRISPR tools in terms of database structure, search engines and interfaces and develop a standalone version of the CRISPR finding tool to support large scale analyzes. Second, we will incorporate new features: a Cas sequence database and analysis tool and a bacterial genotyping tool. Involving key players such as Institut Pasteur and the French Bioinformatics Institute (that will host the future web site) should provide a strong foundation to this new CRISPR-Cas resource and promotes its international status.

provide ready to use Galaxy analysis environments For Life Sciences communities

Mots clefs:

Résumé:

With virtualization technologies the way we consider accessibility and reproducibility (A/R) in computing science has shifted. From the classical approach where A/R was possible through bioinformatics tools distribution, we are now ready to use appliances available on marketplaces hosted in a cloud. Such appliances represent an important shift because not only tools are accessible for reproducibility, but also all the components contributing to the environment of analysis. As virtualization and cloud computing technologies will expand, we expect that the ability to build such containers and to deploy them on heterogeneous infrastructures (desktop, cloud, medium infrastructure hosted in a lab) will become a major topic for the activity consisting in providing services to scientists. On the other hand, the galaxy platform meets a great success in several scientific communities and becomes an important layer of environments dedicated to biological analysis. In this project, we plan to provide ready-to-use Galaxy-environments for analysis to several scientific communities. This project will be organized around two axes. In the first axe, partners from several scientific communities will design representative use cases. In the second axe, use cases will be implemented as containers and made accessible on the IFB cloud infrastructure. We expect that the technical solutions and expertise developed during the project will be re-usable and useful for wider scientific communities, specially for the european life science community: ELIXIR.

MatrixDB is the only interaction database reporting protein-protein and protein-glycan interactions involving extracellular molecules. The database can be searched with a wide range of keywords and provides tools for visualising and mining interacting partners and interactions but the actual specificity of protein-glycan interactions is still underexplored mainly due to the lack of formalised description of glycosaminoglycans (GAG) sequences. This prevents for example the description of binding sites on glycosaminoglycans in a standard form, which in turns prevent searches aiming at identifying proteins sharing common binding sites on glycosaminoglycans. The common concern in the glycobioinformatics community for glycan structure encoding gave rise to several standards (GlycoCT, GlycoRDF) that can be adapted to GAGs. This project builds on recent advance in glycan-related ontology and structural encoding to implement new search tools for glycan- protein interactions and to extract new knowledge from publications to be integrated in MatrixDB by manual curation. These tools will be useful to characterize the GAG sulfation pattern involved in protein interactions and to visualize their 3D structures, which will be cross-referenced with Glyco3D. Adopting current glycobioinformatics standards will also facilitate cross-referencing with other glycan-related well-established resources such as SugarBindDB, UniCarbKB and Glyco3D. This project will provide further services for exploring protein-glycosaminoglycan interactions. The concomitant expansion of stable and integrated databases, cross-referenced with popular bioinformatics resources should contribute to connecting glycomics with other -omics, following recommendations of “A roadmap for Glycoscience in Europe” (European Science Foundation) and “Transforming Glycoscience: A Roadmap for the future” (The National Academies USA).

Protein structure determination is crucial for understanding protein function, as it paves the way to the discovery of new drugs and of new approaches to control pathological biological processes. The recent advances in structural biology now allow collecting structural information from a variety of techniques at various resolutions. Integration of such heterogeneous data to determine hybrid structure is currently a computational challenge in molecular modelling, both in term of computing efficiency and availability of bioinformatics tools. The widely used ARIA software developed at Institut Pasteur has proven very efficient in automatically determining protein structures from NMR data. In this project, we will expand the repertoire of input data types that can be used with ARIA for hybrid structure determination. In parallel, it is necessary to bridge the data analysis modules of ARIA with other relevant structure generation engines to be able to analyse the data types. To ultimately provide a perfectly transparent service to the end user, we will design a web-interface for ARIA and make hybridARIA freely available for the scientific community, notably through the cloud deployed by IFB.

MicroScope platform raised in the Cloud: toward a Software as a Service for on- demand analyses of microbial genomes

Mots clefs:

Microbial genome annotation, Cloud computing, Software as a service

Résumé:

MicroScope is an integrated platform to support microbial genomes (re)annotation and comparative analysis. The current project aims at designing a version of the MicroScope platform using Cloud technologies to progressively switch into a Software as a Service (SaaS) distribution mode. This technical evolution will require several adaptations of the current architecture in: (i) the integration of the MicroScope components in a single appliance (ii) the adaptation of workflows for dynamic provisioning of cluster workers (iii) the setup of a service providing and handling the update of the required reference databanks for the different MicroScope instances. Additional functionalities will also be developed: (i) user interfaces for data and workflow management (ii) a central repository of MicroScope genomes to allow users to share their data within the community of microbiologist. The main purpose of the project is to provide biologists with an on-demand MicroScope solution without any requirement of specific computational skills and with minimal user support. Furthermore, it should increase the flexibility in scale and cost for the needs of computation and storage to face the challenge of Big Data in genomics. These technological developments will be made in collaboration between the CEA/Genoscope (LABGeM), the Pasteur Institute (C3BI/CIB) and the Institut Français de Bioinformatique (IFB), and could be the starting point of a new ELIXIR pilot project (www.elixir.eu) in the domain of microbial genomics.

Our objective is to provide support to biologists from the IBPS and beyond for their computational biology analyses, and to develop accessible tools for reproducible and transparent analyses in our fields of expertise. A major milestone in our road map for the incoming 2 years is the release of a set of Galaxy-compliant tools and workflows to study miRNAs, siRNAs and piRNAs of animals and viruses.

Seven tools are already available in Galaxy toolsheds. This mississippi tool suite allows to analyze small RNA sequencing datasets in order to annotate, align and visualize small RNAs and their meta-properties.

In addition to upgrading existing tools through a continuous development process, we propose here to extend the Galaxy mississippi tool suite with tools and workflows to provide further support to small RNA biology. Thus, we will develop and release tools and workflows to (i) analyse small RNA phasing and editing (ii) profile miRNAs and their differential expression (iii) discover new miRNAs from sequencing datasets (iv) and diagnose/discover viruses through metagenomic analyses of viral siRNAs.

Our service deployment plan includes the release of high-quality tools and workflows in Galaxy tool sheds, in Galaxy server instances at the IBPS bioinformatics platform as well as in docker-containers as an additional option for accessibility and reproducibility. In addition, we wish to benefit from the IFB cloud infrastructure to deploy and provide access to our small RNA-oriented Galaxy server instances.

Our project will benefit to both the small RNA and Galaxy communities.

The DNA barcoding initiative, proposed in 2003, represented a big step forward in standardized DNA­based species identification. It corresponds to the use of a single or few small portions of the genome (= standard barcodes) as a discrete taxonomy character for identifying unknown specimens by comparison with a reference database. This barcoding initiative was very successful and leaded to the collaboration of teams from almost all the countries around the world, producing extensive reference databases. However, the standard barcodes were designed in the context of Sanger sequencing, and the recent development of next generation sequencing allows further developments and discrimination power of the barcoding initiative. We suggest to complement the standard barcode with an approach taking advantage of the power of next generation sequencing. We propose to develop an extended barcode, composed simply of one or two gigabases of sequence reads obtained using shotgun approach of genomic DNA. The data production of the PhyloAlps project aiming to sequence the whole Alpine Flora following this strategy and funded by a France­Genomique grant in 2014 and by Genoscope can be considered as a large pilot experiment for this new DNA barcoding strategy. After the four year sampling effort, the 6,000 sequence datasets will be produced by the end of 2015. It is now time to elaborate the third step of this ambitious project which consist of developing a web platform dedicated to distribute these data. Beyond the scope of the phyloAlps project, our aim is to design e a prototype for a database like BOLD but dedicated to the Next­Generation DNA Barcodes.

With 50,000 data analysis per month and more than 1,500 citations (google scholar), the phylogenetic analysis pipeline Phylogeny.fr [1] is one of the most visible French IT resources both at the national and international levels. Phylogenetic analysis is performed by chaining (selected) programs together. Today, users’ needs have evolved; they can use Phylogeny.fr for teaching, inducing possibly hundreds of users at the same time, or employ it in batch mode leading to the submission of large amount of requests to the same server. Those practises have led to several engorgements of our servers. In this project, we thus plan to increase the robustness of Phylogeny.fr. The originality of the new version of Phylogeny.fr lies in considering a scientific workflow environment (Galaxy) coupled with a web interface allowing visualization and interaction with phylogenetic objects. More precisely, this project will provide (i) a large set of phylogenetic analysis bricks and for each brick, access to diverse programs, all encapsulated into Galaxy thus making the system able to deal with large groups of users and/or large sets of data, (ii) a set of optimized, robust and expressive workflows extending the basic phylogenetic workflow to various and rich contexts of phylogenetic analyses, (iii) an easy-to-install environment equipped with a new visualization layer, on top of the Galaxy system, and dedicated to phylogenetic analyses.

We propose to extend and improve the robustness of an innovative information system to formally represent, model, explore and visualize the molecular and anatomical developmental programs of animal embryos. This system currently forms the basis for the model organism database for the worldwide community of tunicate biologists, and is being adopted by three other communities, working with sea urchins, cephalochordates and the cnidarian Clytia hemispherica. The application covers the following three specific aims:

• Specific aim 1: Improvement of the robustness of the system. We will set up automated backup procedures, and automated testing procedures of system functionality. We will extend the user management system to define user groups and private spaces, allowing the sharing of data between collaborators. We will introduce an archiving/tracking procedure for successive versions of gene models across assemblies.

Specific aim 2: Extension to new data types. We will adapt the schema to host and represent new types of genomic data including: RNA-seq, ChIP-seq, ATAC-seq, SELEX- Seq, and transgenic lines. We will develop corresponding back-end management and biocuration interfaces.

Specific aim 3: Development of new user interfaces. We will increase the flexibility of the search interfaces by: i) supporting complex sequential queries, ii) setting up a Biomart server for the extraction of large datasets, iii) developing an API for the programmatic access to the database. We will extend the display interfaces, and in particular introduce reasoning engines to compute and display genetic regulation relationships taking place in each embryonic territory.

The complex, rapidly-evolving field of mass spectrometry-based proteomics analysis calls for collaborative infrastructures where the large volume of algorithms for proteomics data and annotation can be readily integrated whatever the language, evaluated on reference datasets, and chained to build ad hoc workflows for users. Currently the exploitation of data delivered by proteomics platforms are still restricted owing to limited dedicated in-house bioinformatics capabilities. The aim of this project is to provide the life science community with a collaborative research online platform that would enable end-users to further explore their proteomics data by sharing workflows and experiments. This proteomics research environment (ProteoRE) will be built upon the Galaxy framework, a software platform that gives experimentalists simple interfaces to powerful tools, while automatically managing the computational details. It enables ergonomic integration, exchange, and running of individual modules and workflows. Three modules for proteomics data downstream analysis are foreseen: i. proteomics data control quality, ii. differential expression analysis and iii. proteomics data annotation. In addition ProteoRE will allow users to select one or more tools or resources to annotate data, and by automatically tracking the provenance of data and tool usage and enabling users to selectively run (and rerun) particular analyses. The development of this research environment may involve corrective or evolutionary maintenance or testing and a helpdesk will be set up. Finally, a thematic school and tutorials will be considered for end-users training purposes.

Regulatory genomics is an active field of research enabled by high>throughput sequencing that allow genome>wide systematic analyses of (long>range) cis> regulatory elements, including their potential implication in diseases. This proposal aims at providing a new service to help users, mostly biologists dealing with genome>wide datasets, to rapidly narrow down to the best candidate cis>regulatory elements and regulated genes for further experimental testing. To this end, we will develop new functionalities and graphical displays at the interface between two well>established Web servers: Genomicus and RSAT. In their own fields, both tools represent advanced and well>established resources actively used by a large community. Genomicus is a graphical browser to perform comparative genomics analyses including predicted cis>regulatory interactions, while RSAT offers a suite of tools to identify and manipulate enriched motifs in genomic DNA. We will specifically target three types of use>cases: (i) how to identify enhancers of co> expressed genes and study their motif enrichments, (ii) how to narrow down to interesting candidates from genome>wide ChIP>seq data, and (iii) how to infer the impact of a variant on an enhancer through potential perturbation of a predicted transcription factor binding site. Developments will moreover increase user>friendliness and interpretability of the user interfaces that will provide tight communication channels between the two servers. The project will also emphasise training of novice users (through web media, tutorials and dedicated courses involving the IFB Cloud). RSAT is available on the IFB Cloud but requires regular maintenance; a virtual machine for Genomicus will be tested.

As systems level approaches and functional genomics methods are becoming mainstream to tackle most of the biological questions, the needs for a know-how in integrative analysis of high-throughput data is getting prominent. For this reason, bioinformaticians have nowadays a pivotal role in an increasing number of research projects. However, except for a minority of systems biology oriented labs (like the TAGC), which benefit of internal expertise in computational biology, most biological labs express a crucial need for bioinformatics services going beyond the routine analysis of raw data, to embrace a systems-level interpretation. This involves, among others, the identification of relevant tools among the variety of existing possibilities, tuning the parameters according to the particularities of the research project, designing custom workflows to address the domain-specific questions, integrating various data types revealing complementary aspects of the systems, synthesizing the multitude of result files in human-interpretable reports.

Here, we propose to convert the TAGC internal expertise in bioinformatics analyses into a service proposed to customers, coupled to our already existing TGML Next Generation Sequencing facility. The T5 project therefore consists in (1) gather and install all the in-house developed tools on an integrative server, (2) generalize programmatic access to all tools (3) develop ready-to-use “backbone pipelines” for the analysis of datasets that can be tailored to the project specificities.

Facing the emergence of new technologies in the field of metabolomics, software

solutions adopted so far (UNIX, R packages, etc.) clearly show their limits. Bottlenecks affect unified access to core applications as well as computing infrastructure and storage. In the context of a collaboration between the two national infrastructures in metabolomics and bioinformatics, we have developed a Virtual Research Environment (VRE) based on Galaxy framework for data analysis: workflow4metabolomics.org (W4M). This modular and extensible VRE includes existing components (XCMS functions, etc.) but also a whole suite of complementary statistical and annotation tools. This implementation is accessible through a web interface, which guarantees the parameters completeness. The advanced features of Galaxy have made possible the integration of components from different environments and of different languages. Finally, an extensible environment is offered to the metabolomics community, and enables preconfigured workflows sharing for new users, but also experts in the field. The aim of this proposal is to build new functionalities which take into account user interactivity experience (e.g. visualization) and to extend system interoperability with external data resources (e.g. reference database, external repository, web site...). These developments will therefore address the requirements of the experimental community and position W4M as the key resource for open-source computational metabolomics in Europe.

The use of fluorescent reporter gene technologies has become widespread in biology and has created a demand for bioinformatics tools to analyze the large amounts of data produced. Currently, few such tools are available to the life sciences community. We recently developed WellInverter, a web application based on generic measurement models and sound computational procedures for reconstructing growth rate, promoter activity, and protein concentrations from the primary fluorescence and absorbance data. Starting from this existing application, we propose to develop a scalable and user-friendly web service providing a guaranteed quality of service, in terms of availability and response time. We plan to optimize the algorithms underlying the inference of gene expression profiles from the primary data, put into place a parallel computational architecture with a load balancer to distribute the analysis queries over several back-end servers, and improve the graphical user interface to make the tool accessible to a broad user community. The resulting new version of WellInverter will be deployed on the IFB platform and accompanied by extensive user documentation, online help, and a tutorial. We expect this web application to become a widely-used and general-purpose bioinformatics tool providing an original service that is currently in high demand.