Tools for Biocuration and Data Visualization

Odd number posters will be presented on Monday, 8th April and even numbered posters on Tuesday, 9th April.

Posters 50 - 93.

50 Validation in the Protein Data Bank

Berrisford, JohnEBI

The Protein Data Bank (PDB) is one of the oldest biological archives. Having started with 7 entries in 1971, it has grown to over 145,000 entries in 2018. The way the quality of archival entries was understood and evaluated evolved over the years and consistent validation helps users to draw more reliable conclusions from the PDB archive. In 2008-2009, the wwPDB, the worldwide organisation which manages the PDB, set up community expert task forces to provide recommendations for how to judge the quality of models in the PDB archive, supporting experimental data and the correspondence between them.. These recommendations, using software packages provided by the structural biology community, were implemented in the wwPDB validation reports.The wwPDB validation reports summarise the overall quality of each PDB entry in a simple slider image. The sliders allow non expert users to quickly judge the quality of each PDB entry at a glance and select the best quality model for their work, without having to understand all the validation metrics that were used to create the sliders. All wwPDB partners put the slider image in a prominent position on their webpage for each PDB entry. The wwPDB validation service also produces a complete reports which contain details of every outlier in the PDB entry in PDF and XML format.The wwPDB validation service is made freely available via the wwPDB website (validate.wwpdb.org) and through an API and users are encouraged to use these services prior to deposition to the PDB. Following curation of PDB entry an official wwPDB validation report is produced which depositors are encouraged to provide to the referees of their publication and is required by an increasing number of journals.

51 Maintaining Balance in a Data Ecosystem

Bolton, ElizabethRat Genome Database

The Rat Genome Database (RGD, http://rgd.mcw.edu) uses both existing and internally developed ontologies to give a comprehensive view of normal and aberrant gene, pathway and phenotype data for rat as a model organism in physiology, toxicology and disease. In addition to serving as a resource for rat, RGD also transfers data to orthologs in human and additional animal models, providing a genomic database for disease researchers using less frequently studied models.Some of the biggest questions in our field today revolve around how to appropriately leverage data to display the best view of the current scope of knowledge around the research model of interest. To address this, RGD staff routinely monitor data trends to maintain balance in our annotation strategies as well as identify new data sources to build more comprehensive and personalized data models for researchers. We intend to infer relationships between previously unlinked objects using queries and comparative analyses of gene sets in different ontologies such as Pathway Ontology, Gene Ontology and BioPAX, as well as comparing to external databases such as Reactome, Rhea or PID. We additionally identify gaps in information on more newly characterized genes or genes presumed to play a role in disease pathways, and also add new data types to our additional animal models.Through the adoption or refinement of pipelines, tools, ontologies and querying strategies, RGD continues to look forward to the future while maintaining 20 years’ worth of legacy data.

52 Optimizing Collaboration and Workflow with BioAssay Express

Bunin, BarryCollaborative Drug Discovery

BioAssay Express (BAE) technology streamlines the conversion of human-readable assay descriptions to computer-readable information. BioAssay Express uses semantic standards and ontologies to mark up bioprotocols, which unleashes the full power of informatics technology on data that could previously only be organized by crude text searching (https://peerj.com/articles/cs-61/). One of several annotation-support strategies within BAE is the use of machine learning models to provide statistically backed "suggestions" to the curator. New data can be curated using a web-based interface, and legacy text-based data is created with the support of text mining and machine learning methods on the fly. The initial focus has been small molecule in vivo assays, for which we designed the Common Assay Template (CAT), which we have been continuously improving. We are also working on developing BioLogics Assay Template (BLAT) which introduces additional categories and vocabulary necessary to describe this new flavour of assay protocol.

The Genome Reference Consortium (GRC) was founded in 2007 to safe keep and further improve the reference genome assemblies of Human, Mouse and Zebrafish. Curators are responsible for providing ‘genomic care’ for chromosomes in order to remove errors and add sequence to the current reference assemblies through manual intervention. Examples of this work include resolution of alignment discrepancies with transcripts, identification and repair of misassembled sequence, gap closure, retiling of problematic regions and adding alternate loci.Assembly improvement is achieved using standardised operating procedures to ensure consistency between curators at different institutions. Issues requiring assessment are reported either by the partners, or the scientific community, resulting in curators assuming responsibility for the management of these issues. A variety of genome analysis tools are employed by curators in order to reach a satisfactory resolution to the issues raised, these can be classified as tracking systems, sequence evaluators and genome browsers. Examples of those used include JIRA, GenomeWorkbench, gEVAL and Ensembl. Regions under review and in progress are reported at genomereference.org.The curation results are released as infrequent, coordinate-changing major releases and also as frequent, minor releases in the form of genome patches. These either correct errors in the assembly or add additional alternate loci.Advances in technology have increased the number of sequenced genomes available for data mining; however, de novo assembly of next generation sequencing reads is still problematic, resulting in the continued need for manual curation.

Data integration promises to be one of the main catalysts in enabling new insights to be drawn from the wealth of biological data available publicly. However, the heterogeneity of the different data sources, both at the syntactic and at the semantic level, still poses significant challenges for achieving interoperability among biological databases. Ontology based data access (OBDA) has emerged as a promising approach to solving some of these challenges, by providing homogeneous access to data structured with distinct data models.Here, we propose a federated ontology-driven data integration approach, applied on three heterogeneous data stores. These span different areas of biological knowledge: 1) Bgee, a gene expression relational database; 2) OMA, a hierarchical orthology data store; and 3) UniProtKB, an RDF store containing protein sequence and functional information. We enable domain specialists to benefit from the integrated data by providing a web search interface based on natural language templates. These allow users to answer complex queries across the three sources, without requiring knowledge of a technical query language such as SPARQL. To be able to perform federated queries, we define a new semantic model for gene expression called GenEx. We then show how the relational data in Bgee can be expressed as a virtual graph by applying an OBDA approach, through dedicated relational-to-RDF mappings. This solution avoids data duplication, while allowing data exchange between the original Bgee database and other RDF stores. Furthermore, we make the RDF view of Bgee available via a public SPARQL 1.1 endpoint. Similarly, the materialized RDF view of OMA, expressed in terms of the existing Orthology (ORTH) ontology, is made available in a public SPARQL endpoint. In addition, we identify intersection points among the three data sources to perform joint queries across them. Finally, our experiments show that representative queries can be answered within seconds.

55 Efficient Curation of Genome Annotations through Collaboration with Apollo

Dunn, NathanLawrence Berkeley National Lab

Accurately annotated genomes are vital to understanding the biological function contributed by each genomic element. Researchers must review diverse information such as transcriptome alignments and predictive models based on sequence profiles, over potentially many iterations, and then integrate into a unified model for each genomic element. Tools that aid in the review, evaluation and integration need to be simple to install, configurable, efficient to use, and able to include additional analyses, genomes, workflows, and researchers (wherever they may be geographically located).To this end, the Apollo genome annotation editor provides a collaborative graphical platform for researchers to review and revise the predicted features on genome sequences in real-time (similar to Google Docs). Apollo can be downloaded directly to run locally (or via Docker) for individual users, and also be setup so that a single web server can concurrently support multiple research teams with hundreds of researchers and genomes.The most recent focus of Apollo has been to provide users with the ability to add and share genomes and genomic evidence (JBrowse tracks) directly through the interface using standard formats (e.g. GFF3, FASTA, BAM, CRAM, VCF), eliminating the need for an administrator to run additional scripts to load these onto the server. We are also focusing on enabling genome publishing as a browsable, graphical resource. When project researchers decide to make their genomic annotations publicly available they can generate snapshots of these in JBrowse archival hubs. Finally, we are enhancing our variant annotation capabilities, including the ability to visualize the impact a variant would have on the annotated isoforms they intersect.Apollo is used in over one hundred genome annotation projects around the world, ranging from annotation of a single species to lineage-specific efforts supporting the annotation of dozens of genomes. https://github.com/GMOD/Apollo/

56 IMGT/mAb-DB and IMGT/2Dstructure-DB for IMGT standard definition of an antibody: from receptor to amino acid changes

Duroux, PatriceIMGT, IGH, CNRS

IMGT®, the international ImMunoGeneTics information system®, http://www.imgt.org, is the global reference in immunogenetics and immunoinformatics. IMGT® is a high-quality integrated knowledge resource specialized in the immunoglobulins (IG), T cell receptors (TR), major histocompatibility (MH) of vertebrates, and in the immunoglobulin superfamily (IgSF), MH superfamily (MhSF) and related proteins of the immune system (RPI) of vertebrates and invertebrates.Annotated data of IMGT/mAb-DB and IMGT/2Dstructure-DB are being used to generate the IMGT standard definition of an antibody, from receptor to amino acid changes.Therapeutic proteins found in IMGT/mAb-DB and IMGT/2Dstructure-DB include the IG, the fusion protein for immune application (FPIA), the composite protein for clinical application (CPCA) and related protein of the immune system (RPI).IMGT/2Dstructure-DB on-line contains 5437 entries and was implemented on the model of IMGT/3Dstructure-DB in order to manage AA sequences of multimeric receptors. Chain and domain annotation includes the IMGT gene and allele names (CLASSIFICATION), region and domaindelimitations (DESCRIPTION) and domain AA positions according to the IMGT uniquenumbering (NUMEROTATION). The closest IMGT® genes and alleles and the complementarity determining region (CDR)-IMGT lengths are identified with the integrated IMGT/DomainGapAlign tool, which aligns the AA sequences with the IMGT/DomainDisplay AA domain reference sequences. The IMGT reference sequences are acquired by all the upstream work of manual biocuration.IMGT/mAb-DB, the IMGT database created as an interface for therapeutic proteins, contains 852 entries. IMGT/mAb-DB provides the receptor identification in 1 of the categories (IG, FPIA, CPCA, RPI, TR and MH), links to IMGT/2Dstructure-DB and to IMGT/3Dstructure-DBtarget name with the HGNC nomenclature, clinical indications, authority decisions and links related to them. The IMGT standard definition of an antibody can be generated.

57 GEO data on Xenbase: A pipeline to curate, process and visualize genomic data for Xenopus.

Fisher, MalcolmCincinnati Children's Hospital Medical Center

As high-throughput sequencing grows in popularity, and the volume of data continues to increase almost exponentially, it is increasingly hard for researchers without bioinformatics training to sift through and parse such data. Xenbase, the Xenopus model organism database, has developed a semi-automated bioinformatics pipeline to curate, process, and visualize the RNA-Seq and ChIP-Seq Xenopus data in NCBI’s Gene Expression Omnibus (GEO) data repository, using standard literature supported tools (RSEM, Macs2, deeptools etc) alongside new software (CSBB-v3.0). Data collection involves automated data import from GEO via E-Utils, followed by manual curation of experimental details (i.e. reagents, manipulations, stage manipulated and assayed). Replicate samples are manually grouped with control and experimental sample groups, for differential expression (DE) comparisons, defined where appropriate. The ‘experiment type’ (RNA-Seq or ChIP-Seq), ‘replicate group’,’control group’, and other associated metadata are then used as inputs for our bioinformatics pipeline. Raw Sequence data is downloaded from the SRA and is quality checked. QC reads are mapped to the latest Xenopus genome build, in case of RNA-Seq quantification and Differential expression analysis is performed and in case of ChIP-Seq mapped reads are used to call peaks. The pipeline generates files compatible with most genome visualization programs, such as IGV, or Xenbase’s JBrowse instance. Processed data can be downloaded from the Xenbase FTP. Heatmaps of expression values and DE fold change values for RNA-Seq data are also available as interactive visualizations. Results from the DE-analysis produce a searchable list of differentially expressed genes (which pass statistical criteria for significance) in the form of gene expression as phenotype statements, such as ‘pax6 expressed in eye - absent’. These automatically inferred phenotype statements will be compatible with our manually curated expression phenotypes.

The development of efficient text-mining tools promises to boost the curation workflow by significantly reducing the time needed to process the literature into biological databases. We have developed a curation support tool, neXtA5, that provides a search engine coupled with an annotation system directly integrated into a biocuration workflow.neXtA5 assists curation with modules optimized for the thevarious curation tasks: document triage, entity recognition and information extraction.Here, we describe the evaluation of neXtA5 by expert curators. We first assessed the annotations of two independent curators to provide a baseline for comparison. To evaluate the performance of neXtA5, we submitted requests and compared the neXtA5 results with the manual curation. The analysis focuses on the usability of neXtA5 tosupport the curation of two types of data: biological processes (BPs) and diseases (Ds). We evaluated the relevance of the papers proposed as well as the recall and precision of the suggested annotations.The evaluation of document triage by neXtA5 precision showed that both curators agree with neXtA5 for 67 (BP) and 63% (D) of abstracts, while curators agree on accepting or rejecting an abstract ~80% of the time. Hence, the precision of the triage system is satisfactory.Contributors: Aurore Britan, Isabelle Cusin, Valérie Hinard, Luc Mottin,Emilie Pasche, Julien Gobeill, Valentine Rech de Laval,Anne Gleizes, Daniel Teixeira, Pierre-André Michel, Patrick Ruchand Pascale Gaudet

The mission of the Gene Expression team at EMBL-EBI is the development of tools to facilitate submission, archival, reprocessing and visualisation of functional genomics data. Our tools and resources are continuously updated to incorporate data from new technologies, with our recent release of resources for single-cell RNA-sequencing (scRNA-seq) datasets which investigate the transcriptome at the single cell level.To enable capture of rich experimental metadata for scRNA-seq studies, we have created new templates for our submission tool, Annotare that represent the minimal technical information required to reproduce and reprocess single-cell experiments. A set of minimal information typically includes details describing the cell isolation protocol; cell quality measurements; library construction process and data file content. Annotations are chosen from a controlled vocabulary or mapped to Environmental Factor Ontology (EFO) terms to ensure consistency. Once submitted, datasets are reviewed by curators for accuracy before raw data is archived at the European Nucleotide Archive (ENA). Sample metadata and processed data are made available in our functional genomics archive ArrayExpress, currently hosting 120 single-cell datasets.Data sets are then reprocessed using our in-house standardised pipelines and visualised in the newest component of our added-value resource Expression Atlas, the Single Cell Expression Atlas, which contains 50 datasets across 9 species. Users can explore the expression of a specific gene of interest across different species and experiments. Results can be filtered by tissues and cell types, and point out whether the gene was identified as a "marker gene" in a particular cell population. Data points are presented in a t-SNE plot which showcases the variability of gene expression at the single cell level. Alongside the expression levels, the plots display metadata such as the cell cluster defined by the SC3 algorithm and any experimental variables.

60 The Bio-Entity Recognition for SABIO-RK Database

Ghosh, Sucheta
HITS

SABIO-RK (http://sabiork.h-its.org) has been developed as an absolutely expert-curated database for biochemical reactions and their kinetic properties with the aim of supporting the computational modelling to create models of biochemical reaction networks and also allowing the experimentalists to acquire further knowledge about enzymatic activities and reaction properties. This work is an initial attempt to use text-mining tools in order to facilitate annotators of the database. The most fundamental task in biomedical text mining is the recognition of named entities (called NER) or simply bio-entities, such as proteins, species, diseases, chemicals or mutations. Current state-of-the-art NER methods rely on pre-defined features that try to capture the specic surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information, which is a data- and computation-intensive process. In this work we explore and compare two modes of bio-entity recognition, namely entity-agnostic and entity-specific methods of bio-entity recognition. Entity-specific method is a fully-supervised method that entirely depends on SABIO-RK entity-definitions, whereas the entity-agnostic method can use other bio-chemical databases in addition to the SABIO-RK. This uses more feature engineering than the entity-specific method. We used state-of-the-art LSTM-CRF approach for these two methods. The overall performance of entity-specific method is significantly better than the entity-agnostic method. We also found that the precision is lower in entity-agnostic method whereas the recall is lower in entity-specific method. In future we may use a hybrid method of these two methods to use more data-sources in order to enrich our system.

IMGT®, the international ImMunoGeneTics information system®, http://www.imgt.org, is the global reference in immunogenetics and immunoinformatics. IMGT® is a high-quality integrated knowledge resource specialized in the immunoglobulins (IG), T cell receptors (TR), major histocompatibility (MH) of vertebrates, and in the immunoglobulin superfamily (IgSF), MH superfamily (MhSF) and related proteins of the immune system (RPI) of vertebrates and invertebrates.The study of the IG and TR repertoires in normal and pathological conditions is a challenge due to the huge diversity of the variable domains (1012 specificities per individual). Since 2010, IMGT® has developed IMGT/HighV-QUEST, which is so far the only online tool available on the Web for the direct analysis of complete IG and TR variable domains of NGS nucleotide (nt rearranged sequences, from vertebrate species.IMGT/HighV-QUEST analyzes up to 500,000 sequences per run, with the same degree of resolution and high-quality results as IMGT/V-QUEST and IMGT/JunctionAnalysis against IMGT reference directories. IMGT/HighV-QUEST uses the IMGT unique numbering, identifies the V, D and J genes in rearranged IG and TR sequences and, for the IG, characterizes the nt mutations and amino acid (AA) changes resulting from somatic hypermutations. The tool integrates IMGT/JunctionAnalysis for the characterization of V-D-J or V-J junctions, IMGT/Automat for a complete sequence annotation with the delimitation of the IMGT labels of description.The IMGT/HighV-QUEST statistical analysis, which allows the identification and characterization of the clonotypes, may analyse up to 1 million IMGT/HighV-QUEST results.In IMGT, the clonotype, designated as `IMGT clonotype AA', is defined by a unique V-(D)-J rearrangement and a unique CDR3-IMGT AA junction sequence. Several IMGT clonotypes nt may correspond to 1 IMGT clonotype AA.The new optional functionality "Analysis of single chain Fragment Variable (scFv)" sequences has been added.

The furthest aim of all scientific endeavors in biological sciences is to completely understand an organism as a ‘system’, mandating efficient access & integration of ever-increasing data published in research articles. However, due to its inherent nature, this data is not readily amenable for indexing in a database wherein it can be efficiently queried & correlated. Here, we attempt to demonstrate the flexibility of data curation models developed for experimental data digitization in Manually Curated Database of Rice Proteins, MCDRP. The manual literature-curation workflow adopted exploits various in-house developed logical data models, which split the experimental details of each data point into small logical units. Every ‘data-point’ of experimental data is associated with several pieces of information, such as gene id, growth conditions, plant type, etc., depicted by a structured collection of standard notations in MCDRP. These models have been utilized to digitize >90,000 data-points from over 15,000 published experimental data sets originating from over 150 different types of experimental techniques. The digitized data has high granularity and ease of access while being amenable to semantic integration. In addition, integration of digitized data contained in ~500 research articles identified several traits that are regulated by one or more rice protein, resulting into a complex network of associations. The current release of data has around 840 different traits mapped on to ~394 rice proteins out of which around 286 traits are associated with more than one rice protein. Out of these 394 trait regulatory proteins, physical interaction data has been digitized for 76 proteins in MCDRP. Integration of the digitized protein-trait/function association data and protein-interaction data into a single model provides probabilistic functional gene networks. Analysis of these networks indicates several putative & yet unknown functional associations between rice proteins.

UniProtKB is a world-known hub providing free access to protein sequences along with their functional annotation. There are two major sections of UniProtKB: Swiss-Prot comprising manually annotated (reviewed) proteins, and TrEMBL containing unreviewed proteins. Manual annotation involves searching for and analysis of relevant scientific literature, assisted by the usage of bioinformatics tools, by a trained expert curator. This is the source of high-quality information enriching our knowledge of the proteins and their function in living systems.Even though expert curation is an essential part of the UniProtKB database release cycle, with the exponential increase of number of sequences available it is not capable of annotating each and every protein record. As of release 2018_11 UniProtKB contained more than 137 million proteins, of which only about 550,000 (0.4%) were manually reviewed.To bridge the gap we have designed computational approaches and developed a stack of pipelines allowing us to propagate annotations to the unreviewed protein entries, based on similarity of their features with those of corresponding reviewed protein sets. We use two primary types of pipeline: fully automatic and semi-automatic. Fully automatic pipelines do not require manual intervention, and use either pre-configured sets of rules and conditions to be checked before propagating annotation, or generate rules on the fly (SAAS). The semi-automatic UniRule pipeline, on the contrary, requires an expert curator to prepare sets of rules in advance before the pipeline is run.UniProt aims to share not only pre-applied annotations, but also annotation strategies. To this end we are developing an open-source UniFire rule engine enabling external researchers and commercial companies to run rules on their protein sequences, in order to add extra knowledge to either newly discovered or privacy-protected protein sequences.

64 Linking chemical mentions to Medical Subject Headings in Full Text

Islamaj, RezartaNational Center for Biotechnology Information

The increased rate of published biomedical research in PubMed has resulted in a pressing need for automated methods to improve accessibility through accurate information retrieval and extraction. Chemicals and drugs are an important class of Medical Subject Headings (MeSH) terms that are assigned to each article during indexing at the National Library of Medicine (NLM).Accurate identification of chemical names has significant medical applications helping scientists better understand the usage and interactions of chemicals with other molecular entities, i.e. in drug development research. Moreover, correct identification of chemical entities can help link information retrieved from publications in disparate disciplines: i.e. from chemistry to medicine, biology and pharmacology. Chemical names found in biomedical literature do not follow a common format, however. While the International Union of Pure and Applied Chemistry (IUPAC) has clear rules for chemical nomenclature, authors frequently refer to chemical compounds using names which do not follow the naming standards. To develop a tagger able to accurately identify chemicals both in the article abstracts and in the full text, we need to develop a high-quality corpus that is representative of the chemical terminology in PubMed Central.In this work, we present a study of annotating chemicals in full text articles. Our articles are selected from the PubMed Central Open Access subset and were doubly annotated for chemical entities by 10 expert NLM indexers. They had high inter-annotator agreement, 82.9% and 77.4% on chemical mention and concept mapping, respectively, and all discrepancies were discussed until a 100% agreement was achieved. The corpus currently contains 45 articles and ~2000 unique chemical names, mapped to approximately 800 MeSH identifiers. Our evaluation of chemical entity recognition using the new dataset has shown improvements of 18% in precision and 20% in recall in Tagger One.

The Mouse Models of Human Cancer Database (MMHCdb; http://tumor.informatics.jax.org [formerly known as the Mouse Tumor Biology (MTB) database]) provides users with online access to data regarding mouse models of human cancer, encompassing genetically engineered mouse models, inbred strain models, and Patient Derived Xenograft (PDX) models. Information in MMHCdb is obtained from curation of peer-reviewed scientific publications and from direct data submissions from individual investigators and large-scale programs. Enforcement of standard gene and strain nomenclature and use of controlled vocabularies within MMHCdb enables complete and accurate searching of the published literature for relevant mouse models.MMHCdb was established in 1997 and, over the years, has undergone several revisions of both its database structure and web interface. We recently undertook a comprehensive web redesign to provide better visualization tools and simplify the presentation of search results focusing on reducing textual redundancy. Reorganized displays for tumor frequency and strain cohort results now provide the user with cleaner, more informative layouts. Users are now also have the option to send their found data set to the faceted browser for further data exploration. Addition of a tumor spectrum summary for each strain cohort enhances the utility of the search results page. Granular details for each tumor frequency and strain cohort record are still available by clicking through to the detail pages. The Quick Search functionality, shown at the top of each web page, was simplified and now allows users to query by tumor, strain cohort, genetic, or reference information in one simple field.MMHCdb is supported by NCI grant CA089713.

66 Finding the data in research publications

Levchenko, MariaEMBL-EBI

Capturing relevant information and data from publications is central to the biocuration workflow. But finding the specific details needed for expert curation can be a challenge. Supporting data can be found in Data Availability statements, as accession numbers from community databases, or attached as supplemental files. Cross-checking these sources can be time consuming but is a vital part of the job. Here we present new developments in the open research literature database Europe PMC that improve access to scientific data.Europe PMC contains over 5M full text as well as 35M abstracts for journal articles, preprints and other documents [1]. To simplify access to research data pertaining to a study, Europe PMC has gathered all data linked to a paper in a single view. It includes links to knowledgebases that have curated the article as well as the data on which the paper is based. We discover data mentions by daily text-mining for accession numbers for over 40 life science databases and data DOIs [2]. For such articles, and those with supplemental data, we generate a BioStudies record [3]. BioStudies database links together all the data behind the paper, making it into a citable stand-alone record, and we are working with journals to make it a simple way to cite disparate data early in the publication process. The literature-data links in Europe PMC can be explored using advanced filters that enable searches for particular datasets or data types, like protein structures or clinical trials. In addition, granular search allows to restrict searches to different article sections, like figure legends or results. By combining these functionalities users can get most relevant results for complex queries, like finding accession numbers for proteomic studies cited in the data availability statements of research publications.In this presentation we plan to show how various tools within the Europe PMC suite can help map out the way to supporting data and assist curation practices.

67 A deterministic algorithm to lay out reactions with nested compartments

Lorente, PascualEuropean Bioinformatics Institute

Reactome (https://reactome.org) is a free, open-source, open-data, open-graphics, curated and peer-reviewed knowledgebase of biomolecular pathways. Reactions are the basic building blocks and pathways are the result of concatenating two or more reactions. Both pathways and reactions are considered events and are organised in a hierarchical way. Navigating through different events is a common exercise to further understand the studied phenomena.Previously, we have placed existing regular pathway diagrams and textbook-style illustrations [1] in the pathway detailed pages, leaving single reactions as the only events without a self-contained image. Therefore, an automatic algorithm to deterministically lay reactions out was developed. It uses data directly from the database to generate images without human intervention.The algorithm’s strategy places inputs on the left, outputs on the right, catalysts on top and regulators at the bottom. Each element is placed in its corresponding compartment, and these are nested following Gene Ontology hierarchy [2]. The main initial requirements were to (i) support disease reactions, where some participants have to be crossed out, (ii) minimize space and (iii) avoid lines to cross unnecessary elements in the display. Finally, the result complies with the Systems Biology Graphical Notation (SBGN) [3].References:Sidiropoulos, K., Viteri, G., Sevilla, C., Jupe, S., Webber, M., Orlic-Milacic, M. et al. (2017). Reactome enhanced pathway visualization. Bioinformatics. 33(21): 3461–3467.Ashburner, M., Ball, CA., Blake, JA, Botstein, D., Butler, H., Cherry, JM. et al. (2000). Gene ontology: tool for the unification of biology. Nat Genet. 25(1):25-9.Le Novère, N., Hucka, M., Mi, H., Moodie, S., Schreiber, F., Sorokin, A. et al. (2009). The Systems Biology Graphical Notation Nature Biotechnology. 27(8):735-741.Acknowledgement:Research reported in this publication was supported by the NIH/NHGRI under Award Number U41HG003751.

Elucidate transcriptional regulatory interactions between transcription factors (TFs) and genes in bacteria has been relevant to understand those mechanisms the cell uses to survive in different environmental conditions. These interactions are extracted from literature by traditional manual curation and then stored in biological databases. However, biomedical literature increases every day and makes traditional curation time-consuming and demanding. Thus, the development of automatic approaches to assist curation of these interactions is relevant. Here, we present an automatic method to extract regulatory interactions from literature based on the following patterns associated to regulatory verbs: active and passive voice (“regulates”, “is regulated by”), attributive expressions (“TF-regulated”), and deverbal nouns (“regulation”, “regulator”). To extract regulatory interactions from active and passive voice we employed Open Information Extraction. This technique generates a set of triplets of related syntactic elements, such as (subject, verb, object), which is used to extract regulatory interactions. For attributive and deverbal expressions, we used textual patterns manually defined to extract interactions. Our method extracts regulatory interactions including the regulatory effect (activation, repression, regulation) with a good performance (F-score: 0.70) in a data set of curated sentences of Escherichia coli given by the team of RegulonDB. An advantage of our approach is the reliable of the extracted regulatory interactions (Precision: 0.81). However, we only recovered the 63% of the expected interactions (Recall: 0.63). We anticipate our work to be a starting point to assist curation of an article collection of Salmonella to expand the curation work of the team of RegulonDB.

MetaboLights database is an international metabolomics repository recommended by many leading journals including Nature, PLOS and Metabolomics. The service’s unique manual curation maintains quality, provides helpful support for users and ensures accessibility for secondary analysis of studies. MetaboLights hosts a wealth of cross-species, cross-technique, open access experimental research. As a part of our ongoing efforts to streamline the study submission and curation process, the MetaboLights team at EMBL-EBI has developed a new tool to edit and submit studies online. The tool provides MetaboLights users and curators with an intuitive and easy to use interface to create, edit and annotate their studies online. The convenient, context-aware editor navigates curators and users through the study to define a rich description of the experimental metadata including study characteristics, protocols, technology and related factors. Metadata descriptions are enhanced by mapping this information to controlled ontologies repositories using ZOOMA. Capturing such a complete data set benefits the community by making results findable, reproducible and reusable. Going forward we have plans to incorporate text mining tools such as Named Entity Recognition (NER) to annotate metadata, enabled by the robust architecture of the online editor. Other plans include offline edit support, direct channels for curators to contact and communicate with the submitters to make the whole process of data curation more submitter-friendly.

70 Leveraging curation efforts about discarded data: proposal of a new resource to report discarded data, stemming from the case of Bgee transcriptomics annotations

Niknejad, AnneSIB Swiss Institute of Bioinformatics - Department of Ecology and Evolution, University of Lausanne

For annotating samples or papers, biocurators are experienced at assessing the quality of the data, and spotting inconsistencies in the information available, in order to include only reliable annotations in their workflow. A large amount of the information reviewed by biocurators is thus discarded for various reasons, from human errors when submitting data, to low quality experiments producing untrustworthy results. An enormous part of biocurators’ work at assessing data quality and consistency is thus never made public, and is lost for the life science community.For the development of the gene expression database Bgee (https://bgee.org), we have manually curated thousands of transcriptomics datasets. We have decided to make publicly available the information we produced about discarded samples. This work of assessing the validity of such datasets requires to study the related papers and raw data with great scrutiny. As a result, a typical user of transcriptomics resources will most likely never be aware of some flaws we discovered.In this talk, we will present this information made public, containing the errors we discovered in transcriptomics datasets, e.g., samples discarded by the authors in their paper but submitted to public repositories, or inconsistent metadata raising doubts about the information submitted. More importantly, we want to address this question of how to make available this vast amount of work done by any biocurator. We would like to develop a resource, in collaboration with the biocuration community, allowing to retrieve information about raw data or papers that were already reviewed by a biocurator, so that users could be aware of the limitations or errors identified. In addition to presenting our resource about transcriptomics discarded samples, this talk would be a great platform to present our ideas, obtain feedback, and launch such a collaborative tool for biocuration.

The Biological General Repository for Interaction Datasets (BioGRID, see thebiogrid.org) is an open-access database resource for protein, genetic and chemical interaction data, as manually curated from the literature for human and other major model organisms. As of December 2018, BioGRID contains over 1,650,600 interactions captured from more than 57,650 publications. The recent development of genome-wide knockout libraries based on CRISPR/Cas9 technology has enabled many high-throughput genetic screens in cell lines. To capture these results, a newly developed aspect of BioGRID captures gene-phenotype relationships from genome-wide CRISPR/Cas9 screens, as well as CRISPR/Cas9-derived genetic interactions from either screens or focused experiments. This new resource, called the Open Repository of CRISPR Screens (ORCS, see orcs.thebiogrid.org) currently houses over 500 curated genome-wide screens performed in 417 human or mouse cell lines. A minimal information about CRISPR screens (MIACS) record structure was developed to represent key CRISPR/Cas9 screen parameters. ORCS serves as a unified resource for CRISPR/Cas9 datasets and provides a flexible interface for searching, filtering and comparing screen datasets. To maintain consistency with the original publications, ORCS reports published screen scores according to original scoring algorithms. Results are displayed at the publication-, screen- and gene-level with original scores and significance thresholds, along with associated analytical methods and other metadata. Current screen formats in ORCS include negative and positive selection based on viability and other phenotypes in conjunction with knockout (CRISPRn), transcriptional activation (CRISPRa) or transcriptional inactivation (CRISPRi) library designs. All data are freely available for download in various standardized formats and also as the original supplementary files associated with the publication. This project is supported by NIH R01OD010929 to MT, KD.

72 Visualizing protein residue conservation in InterPro and PDBe

Paysan-Lafosse, TyphaineEBI

The goal of the Genome3D project is to help biologist understand how a protein functions by providing predicted macromolecular structures for hundreds of thousands structurally uncharacterised protein sequences. InterPro provides functional analysis of proteins by classifying them into families, predicting domains, repeats and important sites. The Protein Data Bank in Europe (PDBe) manages the worldwide macromolecular structure archive for the Protein Data Bank (PDB).As part of the project, we have developed mechanisms to display structural data in a way that allows biologists to view sequence features such as conserved residues, in a structural context. The residue conservation scores are obtained from multiple sequence alignments, produced by searching the sequence from the structure against the UniProtKB database using HMMER. When the pattern of amino acid conservation is overlaid upon a structure, the often-disparate regions of high conservation may be brought together to form ‘hot spots’ of conservation indicating a functionally important region, such as a ligand binding site. The visualization of protein residue conservation will be available in the InterPro 7 website and as part of PDBe-KB, a community driven resource managed by the PDBe team that provides functional annotations and predictions for structure data in the PDB archive, via REST API and web pages.

The Vertebrate Genomes Project (VGP, vertebrategenomesproject.org) aims to generate near error-free reference genome assemblies of species from all 260 vertebrate orders, and ultimately, all 66,000 vertebrates. These genomes will be used to address fundamental questions in biology and disease, to identify species most genetically at risk for extinction, and to preserve genetic information of life.Reference genomes are differentiated from draft genomes by their low number of gaps, low number of errors, and high percentage of sequence assembled into chromosomes. The advance of long read sequencing and high throughput mapping technologies, together with the development of appropriate assembly algorithms, have vastly improved the generation of de novo assemblies at this scale. However, these assemblies still suffer from major structural issues caused by artificial duplications, misjoins and missed joins. Manual curation is used to correct these errors. With the help of the genome evaluation browser gEVAL (http://vgp-geval.sanger.ac.uk/index.html), issues can be detected and resolved to produce significantly improved assemblies in a manual yet efficient, high-throughput manner, within a very limited time-scale. The tools used and resulting assembly improvements will be presented.

Archival curation of scientific literature is a key source of information for bioinformatic repositories. However, manual curation is a costly process that cannot keep pace with the amount of papers continuously published. Biocurators face the challenges of identifying publications containing relevant data among millions of publications and prioritizing data extraction to maximize impact and relevance for the scientific community.The IntAct database (www.ebi.ac.uk/intact), along with fellow members of the IMEx Consortium (www.imexconsortium.org), extracts physical molecular interaction (PMI) data from the literature and makes it publicly available. IMEx curation follows a highly-detailed curation model to manually record experimental evidence of PMIs. This strategy provides a rich and accurate representation of the evidence, but it is particularly time-consuming, making the publication identification and prioritization tasks extremely important.In order to help IMEx curation efforts, we have developed a strategy to explore the ‘Dark Space’ of the literature: non-curated publications that may contain relevant PMIs. Using as basis text-mining data sets generated in house or provided in other resources like EVEX or STRING, we integrated related PMI resources, such as GO or BioGRID, and non-PMI data from pathway databases, such as Reactome or Omnipath. We defined features using annotations from these data sources and then used a random forest-based algorithm to infer scores that help to identify how likely a publication is to contain curatable PMIs and how well characterized the interacting molecules reported in them are. This has allowed us to generate a triaged list of publications used by IMEx to prioritize curation. Our scoring can identify publications containing PMIs with near to 90% accuracy. We plan to extend and refine the algorithm, making it an integral part of curation coordination at the IMEx Consortium.

75 MANE Select: a set of matched representative transcripts from NCBI-RefSeq and EMBL-EBI GENCODE gene sets for every human protein-coding gene.

Punar, ShashikantNational Center for Biotechnology Information, National Library of Medicine, National Institutes of Health

Over the last two decades, the RefSeq project at NCBI and the Ensembl GENCODE project at EMBL-EBI have provided independent, high-quality reference datasets describing the human gene complement, which serve as a reliable foundation for genomic research. To address the need for convergence on key high value annotations, we have initiated a new collaborative project called Matched Annotation from NCBI and EBI (MANE), which provides a matching (same coding region, UTRs and start and end coordinates) set of transcripts for human protein-coding genes. As a first step in this project, we introduce the MANE Select set, which aims to define a representative transcript per gene that is identical in RefSeq and Ensembl GENCODE.In the MANE Select methodology, first, independent pipelines at NCBI and EBI, complemented by expert curation, picked a representative transcript per protein-coding gene from their respective gene sets. We utilized various datasets such as RNA-Seq expression levels, conservation data from PhyloCSF, prior curation, transcript and protein length and concordance with UniProt canonical isoform. In the second step, the selected transcripts were updated to the same start and end coordinates, which were determined by FANTOM5 CAGE data and PolyA-seq data from conventional and nextGen sequencing methods. The MANE Select set is expected to be integrated across different public genome resources and simplify transcript choice for researchers who use transcript data for comparative genomics, clinical reporting, and basic research. All transcripts in this set will align perfectly to the GRCh38 assembly.This presentation will describe the methodology and challenges in leveraging various types of data to pick the MANE Select transcript. This work was supported in part by the intramural research program of the National Library of Medicine (NIH), and grants from the Wellcome Trust, and the National Human Genome Research Institute

Bioinformatics data resources, such as the European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena), provide a key foundation for global bioinformatics operations. With ever-growing flows of data, quality control and curation of content rely on the engagement of data providers, community standards developers and those who build data exploration tools. Here, in collaboration with the developers of the BlobToolKit genome assembly contamination visualisation framework, we provide a connection between incoming ENA data and visualisation tools that support the community in understanding cross-species contamination in assemblies.Contamination in public genome assembly data represents a major challenge for those wishing to exploit in full the wealth of information available from these data. The presence of additional species during genome sequencing and production can lead to the inclusion of non-target sequences, causing severe errors in analysis, such as claims of horizontal gene transfer, when the real issue is contamination. Without the removal or masking of this contamination, these errors will continue to be made, potentially affecting many later steps in the interpretation of the natural world.Through the integration into the ENA of workflows from BlobToolKit, developed by Mark Blaxter and Richard Challis at the University of Edinburgh, we aim to give data providers and consumers access to visualisation tools needed to discover contamination and to control, or control for, its occurrence in public genome assembly data. BlobToolKit separates different organisms within assembly data by providing an interactive visualisation, facilitating the exploration of assembly data, and thus, production of high-quality assemblies. BlobToolKit interactive reports will be publically hosted by ENA alongside assembly data, providing users, researchers and other affiliates with simple access to assembly quality control.

77 Bgee 14.1: providing full access to curated expression calls and their source data for 29 species

Robinson-Rechavi, MarcUniversity of Lausanne and SIBBgee provides gene expression expertise through a database, a webserver, and an R package.

Bgee 14.0 integrates RNA-Seq, Affymetrix, in situ hybridization, and EST data from 29 animal species. Bgee 14.1 adds 2882 new RNA-seq libraries for these 29 species. This notably allows us to cover a large anatomical diversity in non model organisms, for example with 250 (up from 15) libraries in chimpanzee, or 233 (up from 8) in horse. This in turn increases for these species the power of our gene expression rank over organs and of TopAnat, our enrichment test of gene lists over anatomy. This update also includes new libraries for model organisms, e.g. 666 for human or 249 for fruit fly. All these data are manually curated for healthy wild-type conditions and for annotation to the Uberon anatomy and to development ontologies. The major new functionality in Bgee 14.1 is that for each gene and condition, we provide access to all the underlying primary data, especially the calls from each data type and data source (e.g., RNA-seq library, Affymetrix probe and experiment), with external links to all the source data, in NCBI GEO, EBI ArrayExpress, or Model Organism Databases. This is an important functionality for expert users, such as curators from other databases who re-use information from Bgee, and allows to provide the appropriate citation for each data source at a fine granularity.Bgee is available at http://bgee.org

The quantity and complexity of three-dimensional (3D) volume-data depositions from electron cryo-microscopy (Cryo-EM) and related techniques in the Electron Microscopy Data Bank (EMDB) and the Electron Microscopy Public Image Archive (EMPIAR) is soaring. High-resolution single-particle analysis and sub-tomogram averaging studies usually yield maps that are interpretable through atomistic models that can be archived in the Protein Data Bank (PDB). However, at lower resolution, especially in whole-cell tomograms, this is not typically possible. Instead, components of the sample can be delineated (segmented) and identified (e.g. outer membrane, actin filaments, polysomes).To enable proper integration of such data and other bioinformatics resources (e.g. PDB, UniProt, GO), segmentations and their biological identification need to be captured. To make this possible, we have developed a number of web-based interactive tools designed to work with an open file format called EMDB Segmentation File Format (EMDB-SFF). EMDB-SFF captures EM map segmentations and biological annotations from public resources. The Segmentation Annotation Tool (SAT) and associated toolkit enables depositors to add structured biological annotation to EMDB-SFF segmentations, linking segments to ontologies and other authoritative bioinformatics resources. In the future, it will be possible to deposit such annotated segmentations in EMPIAR or EMDB and viewed in the Volume Browser, allowing users to explore semantic relationships between entries. EMDB-SFF also supports transforms between sub-tomogram averages and tomograms, providing further information on the organisation and interactions within the cellular environment.

79 ImexCentral: A platform for coordinated curation of interaction data within The IMEx Consortium.

Salwinski, Lukasz
UCLA

The International Molecular Exchange (IMEx) Consortium, a long term collaboration between the major public protein interaction data providers, coordinates curation, integration and dissemination of a high quality, non-redundant set of experimentally demonstrated protein interactions. The ImexCentral platform tracks the progress of individual publications through the entire curation pipeline, starting from the selection of the potential curation targets, all the way to the release of the curated interaction records. Apart from preventing redundant curation of the same publications by more than one consortium partner and assigning a unique, consortium-wide identifier to each interaction record, the ImexCentral site provides a set of tools aimed at improving the efficiency of selecting and screening of the potential curation targets. These include (a) a public, browser-based interface that can be used by any member of the research community to request curation of a specific publication, (b) a flexible scoring framework that can be used to incorporate and combine external curation priority scores, (c) publication record attachments that can be used to exchange or archive supplementary data files related to the curation process, (d) publication record comments and watch system that provides means to document curation progress and simplify collaborative curation.ImexCentral: https://imexcentral.org/icentralIMEx Consortium: https://imexconsortium.org/about-imex

80 Feature-Viewer, a visualization tool for positional annotations on sequences

SCHAEFFER, MathieuCALIPHO group - SIB

As a web platform and a database on human protein annotations, neXtProt aims to help making sense of the wealth of data relevant to these proteins. Therefore, we are providing to our users viewers to visualize in a more user-friendly way some of our data. These are developed as modular generic components so as to make them reusable by the life sciences community independently of neXtProt. With the large spectrum of positional features annotated on protein sequences (e.g. : disulfide bonds, post-translational modifications, topological domains, variants), comes the need to have a global and understandable graphical view. We developed an interactive feature-viewer allowing to display all kinds of features along a protein. This viewer is generic, lightweight, and interactive. It includes different options, an intuitive system of zoom, multiple shapes to represent features, and tooltips. Furthermore, this viewer is not only limited to proteins but can also be use to display other kind of sequences such as DNA sequences. Currently used in neXtProt but also in others projects (e.g. : https://cancer.sanger.ac.uk/), it can easily be integrated to any website. This component is open-source and the source code is available on Github (http://github.com/calipho-sib/feature-viewer).

81 Reactome Icon Library

Sevilla, CristofferEMBL-EBI

Reactome (https://reactome.org) is a free, open-source, open-data, open-graphics, curated and peer-reviewed knowledge base of biomolecular pathways. In Reactome, pathways are organised in a hierarchical way and a series of scalable, interactive textbook-style diagrams in SVG format are provided for the higher level pathways [1].For graphic consistency, the Icon Library (https://reactome.org/icon-lib) was created in March 2017 and as of December 2018, it has considerably grown up to 1,250 components. Its content ranges from simple protein labels to representations of organelles, receptors and cell types. It is fully integrated in the main search and its icons can be found by identifier, name, description, designer and/or contributor.A recent addition to the library is the cross-references from icons to different external resources (e.g. GO or UniProt) to enable: (i) tighter integration of the library within Reactome’s content and (ii) make possible the creation of mapping files from external identifiers to icon identifiers - available in the “Icons” section at https://reactome.org/download-data.The Icon Library is freely accessible (CC-BY 4.0 licence) and it is suitable for a broad range of purposes, from schematic pathway sketches in scientific presentations and publications to grant proposal illustrations. Detailed guidelines are provided at https://reactome.org/icon-info to help third party contributors to grow the community resource, being acknowledge as an author through a metadata file linked to a portfolio and/or ORCID id.References:Sidiropoulos,K. Viteri,G. Sevilla,C. Jupe,S. Webber,M. Orlic-Milacic,M. et al. (2017) Reactome enhanced pathway visualization. Bioinformatics, 33(21), 3461–3467.Acknowledgments:EMBL-EBI Core funding; Research reported in this publication was supported by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number U54GM114833 BD2K.

82 Curation and integration of single-cell RNA-Seq data for cross-study analysis and interpretation

Sponarova, JanaNEBION AG

The recent exponential increase of publicly available single-cell RNA-Seq studies and the power of this technology for de novo discoveries of tissue and disease biomarkers led us to explore the possibilities of building a manually curated and globally normalized single-cell RNA-Seq compendium. Effective utilization of the massive amounts of such data is challenging and requires a thorough understanding of the experimental and computational workflows between preparation of input cells and output of interpretable data. We summarize our approaches to quality control, curation, data integration and user-friendly visualization of the data into GENEVESTIGATOR, a high-performance knowledge-base, and analysis tool for gene expression data. The tool integrates thousands of professionally curated public transcriptomic studies from human or model organisms and nicely visualizes gene expression across different biological contexts such as diseases, drugs, tissues, cancers, cell lines or genotypes. We show that manual curation combined with the deep integration of microarray, bulk RNA-Seq and single-cell RNA-Seq data leverages the analysis and interpretation of one’s own results, taking advantage of the wealth of experimental conditions contained in the world’s transcriptomic data.

Biomedical knowledge integration could be immensely facilitated by automated text mining for relation extraction in biomedical literature. Substantial datasets annotated with ground truth for benchmarking are required to develop the relation extraction algorithms. To assist the collaborative effort to manual curation, efficient annotation tools incorporated with the annotation standard and workflow are needed to ensure the quality of the results. Here, we describe a web-based annotation tool Chinese Biomedical Semantic Annotation System (CBSAS) which follows a standardized annotation schema. CBSAS supports the annotation of both entities and relations which can be widely adopted so that the resulting annotated corpora could be exported in a standard format. CBSAS is different from the few existing tools. First, it support annotations of biomedical text in Chinese following a PubTator-like interface. Second, it provides annotations of both the disease-centered entities and the related relations among those entities. Third, the collaborative annotation roles are characterized with different functions by providing the superior annotator with detailed comparison results of the two groups of separate annotators. Through a trial annotation of a Chinese biomedical semantic relation corpus with two external user groups, CBSAS was shown to be capable of improving both the efficiency and accuracy of manual curation especially for the superior annotator. CBSAS is available in http://hprs.imicams.ac.cn/datamark_abs_CDR/demo.

Over the last 20 years, the Gene Ontology (GO) has provided a rich and consistent vocabulary to describe and annotate biological functions across the species of the tree of life. The unique proximity between knowledge representation experts and biological curators in the GO Consortium has continuously shaped the requirements and developments of the GO. A recent outgrowth of that close collaboration, Gene Ontology Causal Activity Modeling (GO-CAM) is a new semantic framework (RDF/OWL) designed to: (i) improve the expressivity and precision of biological annotations, (ii) improve the searchability of such annotations and (iii) allow for the description of larger processes and pathways by linking together standard GO annotations. GO-CAMs are constructed from standard GO annotations (BP, MF, CC terms from the GO ontology) linked to one another from the Relation Ontology (RO). Physiologically relevant contextual information is then added using additional ontologies such as Uberon (anatomical parts) and ChEBI (molecular entities). To support the curation of biological functions through this new semantic framework, the GO Consortium has developed Noctua, a web-based curation platform that includes a simple form interface as well as a more advanced graph editor. As of December 2018, over 2400 GO-CAMs have been produced with Noctua while 600 are currently in development or review. In GO-CAMs, any biological statement is stored as a set of triples “subject relation object” (e.g. ubiquitin-protein transferase activity enabled by NEDD4). GO-CAMs currently represent over 11,000 triples and many more can be inferred through semantic reasoning.GO-CAMs can be browsed, searched, visualized and downloaded at http://www.geneontology.org/go-cam. GO now also offers a SPARQL endpoint (http://rdf.geneontology.org) to improve the interoperability of both standard GO annotations and GO-CAMs with other knowledgebases.

85 A visualization system for navigating neurotoxic effects of chemicals in ChemDIS: Fipronil as a case study

Tung, Chun-WeiKaohsiung Medical University

The assessment of neurotoxic effects of chemicals on human beings requires labor- and resource-intensive experiments. Since the experimental assessment for the huge number of chemicals is impractical, it is desirable to develop a cost-effective and fast method for the screening of chemicals with potential neurotoxicity. Previously, we have developed a unique ChemDIS system (https://cwtung.kmu.edu.tw/chemdis) integrating several databases including STITCH, Gene Ontology, KEGG, Reactome, SMPDB and Disease Ontology for the inference of chemical-protein-disease association. However, the effects could be either toxic or therapeutic that can not be identified through the inference method. In this study, we present a novel visualization system for navigating the neurotoxic effects of chemicals. A database was developed by collecting disease-associated chemicals annotated with therapeutic or toxic effects. Chemicals were encoded as eight-dimensional feature vectors using ChemGPS-NP and plotted in a three-dimensional coordinate system. Currently, more than 500 chemicals and 15 neurological disorders were collected. Fipronil is utilized as an example to demonstrate the function of the proposed system for discriminating neurotoxic effects from therapeutic ones.

86 LIPID MAPS: Lipidomics Gateway

Valdivia-Garcia, MariaCardiff University

LIPID MAPS is the largest public open-access lipid database in the world including unique lipids and a diverse set of tools for mass spectrometry analysis and chemical structure drawing. The LIPID MAPS platform includes three main databases: the LIPID MAPS Structure Database (LMSD) containing chemical structures for over 40,000 biologically relevant lipids; The LIPID MAPS Gene/Proteome Database (LMPD) with over 8,500 genes and 12,500 proteins; and the LIPID MAPS In-Silico Structure Database (LMISSD) computationally generated from headgroups and chains from common lipids currently has over 1.1 million structures. Funded by a Wellcome Trust Biomedical Resources Grant, LIPID MAPS is currently developing and implementing new tools for the Lipidomics Community. These include statistical and lipid search software (LipidFinder) and the integration of these tools with other publicly available data resources (23). This work is augmented by manual curation of lipids into the database, from the scientific literature. The first stage of curation is the search process which applies the use of web engines for key words that target new natural lipids found in diverse biological organisms. The annotation stage includes verification and insertion of the new chemical structure. Each lipid in the database is allocated an ID according to its category and class. The algorithm then, produces a unique mole file, SMILES, InChIKey, exact mass, formula, chain composition and exact structure for each new compound and incorporated into the LMSD. Curation work is a laborious activity which includes going through many thousands of scientific publications to select very few that provide the required information. A new text mining tool is being developed to facilitate the search process. The incorporation of this tool will allow for scalability of the volume of content within LIPID MAPS chemical entities by providing more rapid content updates and identification of key reference materials.

87 Laying the foundation for an infrastructure to support biocuration

Venkatesan, AravindEMBL-EBI

Curation is a high-precision task that is essential to the maintenance of public biological databases. The tremendous growth in research papers being published provides challenges for curators in finding and assimilating scientific conclusions described in literature. Consequently, there is an urgent need to harness the potential of technological advancements made in recent years to support various curation workflows. Among a large suite of approaches, text mining offer solutions to enhance ranked reading lists, classification of articles, and the identification of assertions with their biological context. When text-mining outputs are made openly available for searching, linking of underlying data and access to articles, then we have the beginnings of an infrastructure that supports automated approaches to curation challenges.As part of Europe PMC (www.europepmc.org), and with the support of ELIXIR-Excelerate, we have laid the foundation for an infrastructure that facilitates literature-data integration. The infrastructure includes elements such as a platform to automatically ingest and aggregate texmined outputs from various sources, APIs to redistribute them and an application called SciLite (www.europepmc.org/Annotations), to display the text-mined annotations on articles. The infrastructure provides a mechanism to make deep links between the literature and data for clear provenance of curatorial statements. Furthermore, to ensure that the infrastructural components are indeed useful and meet the needs of curators, we have recently carried out: (1) an observational study to understand and identify common workflow patterns and practices; (2) a curator community survey to more specifically understand which entity types, sections of a paper and tools are of top priority. Here, we will provide an overview on the infrastructure, present the results of the user research and the curator survey and discuss how the outcomes can feed into optimising the current infrastructure.

For every new curation project, a new curation platform needs to be developed, and for every new requirement that then emerges, its interface and database need to be updated.To alleviate this, we developed a universal curation interface that enables scientists to easily capture any kind of information. VSM (Visual Syntax Method) allows scientists to formulate information in a simple but powerful 'sentence'-like format, where all terms are linked to ontology identifiers. The syntax of how terms relate to each other is defined by a small set of intuitive rules, useable no matter how long or complex the sentence is. VSM also supports 'template' sentences where users only need to fill out a number of terms/IDs, accessible via autocomplete, and any 'sentence' can still be extended without changes to the interface. This gives users an intuitive and flexible tool to capture semantics-based information with any amount of context details. (See also: scicura.org/vsm).We present the result of a large programming, design, and documentation effort, and the VSM curation interface is now available as an open-source web-component on GitHub and NPM.The 'vsm-box' module (github.com/vsmjs/vsm-box) is the main curation interface, and builds on a collection of supporting modules in the 'vsmjs' GitHub project. Each sub-module is highly customizable: the 'vsm-dictionary' module provides a scaffold for connecting a vsm-box to any term provider (e.g. BioPortal); the 'vsm-autocomplete' module allows customization of the content of autocomplete-items; etc. All code has automated tests with nearly 100% coverage.Extensive documentation is available on ( https://vsmjs.github.io ), describing how to configure and embed a vsm-box in a new project, and enabling interested individuals to contribute to next versions of the software.We encourage the community to make use of this open-source web component, and look forward to provide assistance in implementing new VSM-based curation projects.

89 Triaging PubMed literature to discover novel mutations for the Catalogue of Somatic Mutations in Cancer (COSMIC) with PubTator

Ward, SariWellcome Trust Sanger Institute

The COSMIC database catalogues somatic mutation data from the scientific literature for all known human genes across all human cancers. For the 723 recognised cancer genes in the Cancer Gene Census, mutation data and the related metadata are curated in substantial depth by expert data curators. One of the first 4 genes that was curated for COSMIC when the database was released publicly 14 years ago was KRAS. Over the last decade, KRAS has become one of the clinically most important and sequenced oncogenes in cancer. Accordingly, the related scientific literature in PubMed has exploded to a level that is impossible to curate exhaustively. To find papers that report new KRAS mutations, currently not represented in COSMIC, we have utilised two computational tools –PubTator and LitVar– to triage high value publications from PubMed abstracts as well as full-text articles in PubMed Central. Specifically, we first scanned the literature to obtain all mutation mentions that co-occurred with the KRAS gene in the same sentences. To narrow down the scope to those that are novel to COSMIC, previously COSMIC-curated mutations were filtered. As a result, the tools automatically returned a list of potentially novel KRAS mutations for further human examination. 14 new substitution mutations, 1 nonsense mutation and 4 new deletions/insertions were found from 14 publications and curated in COSMIC with their patient and sample related metadata. Most of the mutations were found outside the well-known oncogenic hotspots of exon 1 codons 12 and 13 and exon 2 codon 61 expanding the number of potentially relevant mutations in oncology. PubTator and LitVar have provided COSMIC curators with a high value tool to help triage large numbers of publications in a cost-effective way. The tools are now being applied as part of routine practice and will be deployed to support curation with other triage methods for a balanced search strategy.

90 Nightingale: a library of re-usable data visualisation components

Watkins, XavierEuropean Bioinformatics Institute

With the exponential growth of biological data and its increase in complexity, it is necessary to develop visualisation components so that users of resources like UniProt can efficiently extract valuable information from large amounts of data. Developed initially through a collaboration between UniProt and InterPro and now an EBI-wide initiative, “Nightingale” is a library of reusable data visualisation components providing tools to display protein features (ProtVista), protein interaction and 3D structure, with many more components to come. Implemented using established web standards (web components) and designed with flexibility in mind, these components can easily be added to any web resource allowing users to display data from the Proteins API as well as their own API. The components are also able to communicate with each other, allowing the creation of rich dashboard-like user interfaces.

91 Expanding the reach of data with new visualisation tools in GigaDB

Xiao, SiZheGigaScience

Objective:GigaScience (www.gigasciencejournal.com) aims to revolutionize publishing by promoting reproducibility of analyses and data dissemination, organization, understanding and use. As an open access and open-data journal, we publish ALL research objects from 'big data' studies across the entire spectrum of life and biomedical sciences. The journal’s affiliated database, GigaDB (www.gigadb.org) serves as a repository to display the data and tools associated with GigaScience publications. Here we present a variety of tools to help display and visualise different kinds of data in GigaDB to further the transparency of research and reduce the burden on the user.Methods:GigaDB has added such functionality by including widgets into the individual dataset page. For those datasets that have content available in these formats, users can access them directly within the dataset page. The following tools and widgets are already installed in GigaDBSample Map BrowserThe sample map browser (http://gigadb.org/site/mapbrowse) allows users to explore samples across any dataset by geographic location.SketchFab 3D ViewerIncorporation of the SketchFab widget enables interactive visualisation of surface-rendered reconstructions of 3D image, such as microCT e.g. DOI:10.5524/100364. The 3D viewer allows users to interact and explore image data prior to download.Code Ocean WidgetCode Ocean widgets display the interactive code of datasets. It allows users to easily rerun the published code/pipeline. e.g. DOI:10.5524/100308JBrowseThe JBrowse genome assembly viewer is available as a widget for datasets that provide chromosome level assemblies with comprehensive annotations, e.g. DOI:10.5524/100240Conclusion: GigaDB is addressing data visualization, transparency and reproducibility of research with the integration of the above tools, making it easier for reviewers, readers and users to view and access big data.

92 Exploring neXtProt data and beyond: A SPARQLing solution

Zahn, MoniqueSIB Swiss Institute of Bioinformatics

The neXtProt platform (www.nextprot.org) developed at SIB Swiss Institute of Bioinformatics is a one-stop-shop for human proteins proposing solutions to select, explore and reuse available genomic, transcriptomics, mass-spectrometry- and antibody-based proteomics data. The neXtProt team manually curates data from the literature (post-translational modifications, variant phenotypes, protein-protein interactions, etc.) and combines it with high quality omics data generated by systems biology projects using a single inter-operable format. neXtProt data are FAIR (Findable, Accessible, Interoperable, and Reusable), with full traceability ensured by extensive use of metadata.In the last four years, neXtProt has been promoting the use of SPARQL, a semantic query language for databases, to check, explore, and visualize its data. SPARQL queries are used to check the quality and consistency of the data loaded at each release. To date, over 450 queries have been written such that non-zero results trigger investigation. In an effort to automate these tests, all of the queries or a particular sub-set can be launched and the results written to a file.Semantic technologies can help generating innovative hypotheses where classical data mining tools have failed (protein function prediction, drug repositioning...). In order to promote the use of semantic technologies as data mining tools for life sciences, neXtProt provides over 140 pre-built queries and documentation of its data model to guide the user in his or her first steps. The use of SPARQL allows users to run federated queries across resources relevant for human biology or build customized views. All our SPARQL queries are open source and available on GitHub.