Functional Annotation

Odd number posters will be presented on Monday, 8th April and even numbered posters on Tuesday, 9th April.

Posters 1 - 49

1 FlyBase A Valuable Source of Molecular Interaction Data

Agapite, JulieFlyBase, Harvard University

The knowledge of a molecule’s binding partners can provide insights into that molecule’s function and/or its involvement in a particular biological process. FlyBase curation of molecular interactions is primarily focused on capturing protein-protein, RNA-protein and miRNA-mRNA interactions from low-throughput studies, in which interactions are typically supported by multiple independent forms of evidence. The FlyBase molecular interaction dataset consists of a total of 42,548 interactions which represent 28,969 distinct pairwise interactions and involve the products of 5,817 genes. Importantly, the vast majority of the low-throughput studies curated by FlyBase are not curated by other interaction databases. Only 7.6% and 5.5% of publications with FlyBase curated interactions have also been curated by BioGRID and the IMEx Consortium, respectively, making FlyBase an essential source of these well-supported interactions.

2 The SwissLipids knowledge resource for lipid biology

Aimo, LucilaSwiss-Prot, SIB

SwissLipids (www.swisslipids.org) is a freely available knowledge resource for lipid biology.In SwissLipids, targeted expert biocuration of lipid metabolism using Rhea (www.rhea-db.org) and UniProt (www.uniprot.org) provides the knowledge needed to generate a library of more than 550,000 annotated lipid structures from over 300 lipid classes. This lipid library is fully mapped to ChEBI, Rhea and UniProt. It is organized in a hierarchical classification that maps lipidomics data to possible lipid structures and associated biological knowledge, like this:PC(38:4) -> PC(18:0/20:4(5Z,8Z,11Z,14Z)) -> PLA2G4A.The SwissLipids website provides a range of search and browse options as well as an identifier mapping service for resources such as LIPID MAPS and HMDB.SwissLipids is used for lipidomic data interpretation and annotation by projects such as the Innovative Medicines Joint Initiative (IMI-JU) for Diabetes (IMIDIA, http://www.imidia.org/), the EU H2020 project METASPACE (http://metaspace2020.eu/) on bioinformatics for spatial metabolomics, any by commercial lipidomics service providers such as LipoType (www.lipotype.com), as well as many individual research laboratories.The SwissLipids biocuration effort focuses on human lipids and those of major model organisms from vertebrates to yeast, experimental systems used by our collaborators in the LipidX project of the Swiss Initiative in Systems Biology SystemsX.ch. Here we provide an overview of some of our latest biocuration work, including a targeted biocuration effort that covers hundreds of classes of complex glycosphingolipids – building on the work of expert resources such as SphinGOMAP.

3 Adding knowledge to the UniProt resource by proteomics and genomics integration

Alpi, EmanueleEMBL-EBI

UniProt provides a broad range of Reference protein data sets for a large number of species, specifically tailored for an effective coverage of sequence space while maintaining a high quality level of sequence annotations and mappings to the genomics and proteomics information.Proteomics data are key for understanding and consequently annotating the existence, the functions (mediated or not by PTMs) and the localization of proteins.With respect to publicly available bottom-up proteomics data from MS proteomics repositories, UniProt provides mappings to its Reference proteomes via the website, the FTP site and programmatically by means of a new Proteins RESTful API (https://www.ebi.ac.uk/proteins/api/doc/) which also provides many additional types of data such as genomic coordinates and variants mapped from Ensembl, 1000 Genomes, ESP (Exome Sequencing Project), ExAC (Exome Aggregation Consortium), COSMIC (Catalogue Of Somatic Mutations In Cancer), ClinVar (Clinical significance of Variants), dbSNP and TCGA (The Cancer Genome Atlas) together with manually reviewed ones.We currently have 17,595 human reference canonical proteins and 62,851 additional human isoforms mapped to proteomics experiments from EPD (Encyclopedia of Proteome Dynamics), MaxQB (MaxQuant DataBase) and PeptideAtlas, together with data for other ten additional species.Collaborating MS proteomics repositories are cross-referenced from within UniProt and work to add additional proteomics and variation data providers is ongoing also in order to expand the covered species and focus on PTMs- and localization- related proteomics studies and the corresponding data sets.Some of the ongoing collaborations with MS proteomics resources: Consortium for Top Down Proteomics (CTDP), ProteomicsDB, jPOST (Japan Proteome Standard Repository/Database), MassIVE (Mass Spectrometry Interactive Virtual Environment), GPMDB (Global Proteome Machine Database) and iProX (Integrated Proteome Resources).

4 An evidence-based model for representing signaling pathways in FlyBase.

Antonazzo, GiuliaFlyBase, University of Cambridge

While there are many resources listing members of molecular signaling pathways across species, defining the boundaries of a pathway can be a challenging task. Such pathway resources are often the product of a one-pass compilation effort, and although they tend to be good at presenting the core pathway members, they also tend to be static and not necessarily up to date with the current literature. Drosophila melanogaster has served as a central model system for the molecular characterization of many important pathways. FlyBase, the primary knowledgebase for D. melanogaster genetic research, has examined how to best represent up-to-date, comprehensive data on pathways, which extends to cover less well characterized pathway modulators. We have pursued a curation model in which the extent of supporting experimental evidence for each pathway member is captured using the Gene Ontology (GO), following principles agreed upon by the GO consortium. Genes annotated as a pathway member or regulator are presented in dedicated ‘Pathway report’ pages together with literature corroborating pathway membership status, links to other curated data and to external resources. As of the October 2018 FlyBase release we have completed a first pass curation of 10 major pathways. As signaling is a very active area of fly research, the pages will be revised as part of on-going paper curation at FlyBase, reflecting the current knowledge. These pathway member lists can support and fuel further biological discoveries, and we are currently exploring approaches that use these alongside other annotations in FlyBase to find biological patterns and make functional predictions.

5 Collaborative curation of antigen presentation and recognition in UniProtKB/Swiss-Prot with IMGT®

Argoud-Puy, GhislaineSIB Swiss Institute of Bioinformatics

Each individual carries a highly diverse T cell receptor (TR) repertoire able to recognize a wide variety of foreign peptides presented on major histocompatibility (MH) proteins. The TR repertoire is shaped during T cell maturation in the thymus, so that each T cell clone expresses a unique pair of TR chains, resulting from somatic V-(D)-J recombination of one of each of the variable (V), diversity (D) and joining (J) genes spliced to a constant (C) gene. By contrast, MH proteins expressed on antigen presenting cells are the products of unique genes which are highly polymorphic in the human population, each individual carrying a specific set of allele groups/haplotypes.Here we describe the curation of human TR and MH catalogues in UniProtKB/Swiss-Prot. The TR catalogue was built in collaboration with IMGT® and provides representative sequences for germline-encoded V-, D-, J- and C TR chains. This set of 112 UniProtKB/Swiss-Prot entries are identical to GRCh38, use official nomenclature from IMGT/GENE-DB and are directly linked to the IMGT resource. The MH catalogue provides representative sequences of 102 serologically distinct HLA allele groups, and their coding allelic variants described with the Nomenclature for Factors of the HLA System. In the future we plan to perform a more comprehensive curation of MH alleles associated with different phenotypic traits and to improve the representation of MH-peptide-TR interactions.

6 Automated generation of modular and standardized gene descriptions using structured data at the Alliance of Genome Resources

Arnaboldi, ValerioWormBaseSummarizing gene function into short human-readable text from an ever-growing literature corpus is a time- and labor-intensive task. In order to automate the writing of a gene description, we developed an algorithm that uses curated gene data provided by member databases of the Alliance of Genome Research (Alliance; www.alliancegenome.org). Data includes gene associations to Gene Ontology (GO) terms [1], Disease Ontology (DO) terms [2] and human orthologs. This project has resulted in modular and standardized gene descriptions for the seven species at the Alliance portal (yeast, worm, fly, zebrafish, mouse, rat and human), including those that lacked them, for purposes of display and use at the Alliance and at the relevant Model Organism Databases, thus facilitating discoverability and interspecies comparison.The problem of generating a text summary from multiple ontology terms annotated to a gene was formulated as a set-covering problem and solved with a ‘greedy’ algorithm, which is the best known approach to solve the problem in polynomial time. The algorithm is designed to balance readability of a gene description with the amount of information it provides. The software is provided as an open source Python package and integrated into the build process of the Alliance website; it can be downloaded and integrated into other pipelines wherever text descriptions are needed. For example, the software is used in a custom WormBase pipeline to generate gene descriptions for C. elegans and nine other nematode species. We are currently refining the algorithm by methods that include weighting ontology terms by information content and using data-type specific rules.[1] The Gene Ontology Consortium. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 2017 Jan 4;45(D1):D331-D338.[2] Schriml LM et al., Human Disease Ontology 2018 update: classification, content and workflow expansion. Nucleic Acids Res. 2018 Nov 8.

7 Enhanced enzyme annotation in UniProtKB using Rhea

Axelsen, KristianSIB Swiss Institute of BioinformaticsThe UniProt Knowledgebase (http://www.uniprot.org) is a large reference resource of protein sequences and functional annotation. More than 45% of UniProtKB/Swiss-Prot entries are enzymes, which were traditionally annotated using EC (Enzyme Commission) numbers, the hierarchical 4 digit enzyme classification based on the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB).Here we describe our work on the enhancement of enzyme annotation in UniProtKB using Rhea (https://www.rhea-db.org). Rhea is a comprehensive expert-curated knowledgebase of biochemical reactions that uses the ChEBI ontology to describe reaction participants, their chemical structures, and chemical transformations – a computationally tractable description of reaction chemistry.UniProt has recently adopted Rhea as the reference vocabulary for enzyme annotation in UniProtKB, and now describes all enzymatic reactions using Rhea where possible. Rhea provides improved consistency and precision of enzyme annotation in UniProtKB and allows UniProt users to search, browse, and mine enzyme data in new ways, combining approaches from the fields of cheminformatics and bioinformatics.Going forward, UniProt curators are now using Rhea in their daily work: major curation efforts are focusing on improved coverage of human and microbial metabolism in health and disease (some of which will be described here) as well as biosynthetic pathways for natural products (see the poster “Diverse taxonomies for diverse chemistries: enhanced plant and fungal metabolic pathway annotation for natural product biosynthesis in UniProtKB/Swiss-Prot”). You can learn more about Rhea in the poster “Rhea, an expert curated resource of biochemical reactions for enzyme annotation”, which also describes how we aim to improve the alignment of Rhea with the Gene Ontology (GO) and other knowledge resources such as Reactome.

8 Using Wikidata for community engagement and display of curated Plasmodium genomes

Böhme, UlrikeWellcome Sanger Institute

The genomes of 11 malaria parasite species (Plasmodium spp) are currently being curated, including those of the rodent-malaria parasites, P. chabaudi, P. yoelii and P. berghei; the human-infective species, P. falciparum, P. vivax, P. malariae, P. ovale and P. knowlesi; the chimpanzee parasite P. reichenowi and two avian malaria parasites, P. gallinaceum and P. relictum. For all of them Gene Ontology terms and products are captured based on publications.A curation workflow has been established. Artemis, an annotation tool that can read and write to a Chado relational database, is being used for structural and functional annotation. A transfer annotation tool that has been implemented in Artemis can transfer GO terms and products to other curated Plasmodium genomes. As part of a collaborative effort with PlasmoDB (http://www.plasmodb.org) every 4 to 6 months the annotated and curated genomes are sent to PlasmoDB to be integrated with a wide variety of functional genomics data sets. To facilitate community structural annotation, genomes currently present in Chado are being loaded into Apollo (http://genomearchitect.github.io). The plan is for interested parties of the community to get log-in details so that they can improve structural annotation. In addition, Plasmodium genomes have been added to Wikidata (https://www.wikidata.org) with the aim of facilitating community-participation in functional annotation and making it much easier to share annotation between databases. Based on Wikidata a new website has been set up, that provides all information present on Wikidata in a nice overview.

Genome and proteome annotation pipelines are generally custom built and therefore not easily reusable by other groups, which leads to duplication of effort and suboptimal results. One cost-effective way to increase the data quality in public databases is to encourage the adoption of annotation standards and technological solutions that enable the sharing of biological knowledge and tools for genome and proteome annotation.Here we use the HAMAP system (https://expasy.hamap.org) to demonstrate technical solutions that facilitate the combination and reuse of functional genome annotation systems from any provider. HAMAP classifies protein sequences using a collection of expert-curated protein family signatures and annotation rules that provide the same level of detail and quality of annotation as expert curated UniProtKB/Swiss-Prot records. HAMAP provides a major part of the content of the UniProt UniRule pipeline for the annotation of UniProtKB/TrEMBL. The current implementation of HAMAP uses a custom rule format and annotation engine though that are not easy to integrate into external pipelines.We have translated the rules of our HAMAP proteome annotation pipeline to queries in the W3C standard SPARQL 1.1 syntax (SPARQL is a recursive acronym for the SPARQL Protocol and RDF Query Language). This allows users to apply HAMAP rules to annotate protein sequences expressed as RDF using off-the-shelf SPARQL engines to achieve UniProtKB/Swiss-Prot levels of detail and quality - without any need for a custom pipeline. If other annotation projects adopt the same approach, it will be possible to share the rules of different projects, execute them with any SPARQL engine, and compare the results.HAMAP SPARQL rules and documentation are freely available for download from the HAMAP FTP site ftp://ftp.expasy.org/databases/hamap/hamap_sparql.tar.gz under a CC-BY-NC license.

10 InterPro2COGs: A mapping between InterPro and Clusters of Orthologous Genes (COGs)

Chang, Hsin-YuEMBL-EBI

InterPro and Clusters of Orthologous Genes (COGs) are widely used protein family resources for the annotation of protein sequences. While InterPro (an EMBL-EBI resource) combines 13 different protein family databases providing coverage of sequences from all domains of life, COGs (an NCBI resource, not part of InterPro) is based on bacterial and archaeal sequences. To understand the complementarity of the InterPro and COGs databases, related entries need to be identified. To achieve this, Jaccard indices (JI) and Jaccard containment (JC) indices were calculated for sequence sets matched by each database entry. This analysis revealed that 92% of 4,873 COGs could be mapped to InterPro, with 950 COGs having equivalent entries. Meanwhile, 3,317 COGs represented more specific subfamilies of broader InterPro families, and 224 represented larger families of their closest InterPro counterparts. On the basis of this analysis, we created a mapping between the InterPro and COGs entries and used this to assess the feasibility of automatically producing one annotation set based on the other. Here we show the results of such annotation cross-mapping using a freshwater metagenomic dataset. Our results raise the intriguing possibility of accurately mapping between COGs and InterPro annotation, but also show more work is required in order to resolve some of the more complex mappings.

11 MetaboLights Open Access Metabolomics Resource

Cochrane, KeevaEBI

Metabolism has been studied in depth and utilised in medical diagnostics from the early examples of the observational urine wheel through to modern clinical chemistry testing. The major pathways involved in metabolism have been well defined, yet the concept of studying the metabolome, a view to understanding the entire biochemical landscape of a particular, is still a field still in its infancy. With instrumentation and informatic analysis methods being continually modified and developed, we look to encourage the community to develop high standards of data capture from the outset.The Metabolomics group at EMBL-EBI have created MetaboLights to support the growing community, to provide a cross-species, cross-technique, open access experimental research repository. The success of MetaboLights to date can be attributed to the standards of data deposition requirements, the quality of a manual curation process and the continued support as a recommended repository for journals including Nature, PLOS and Metabolomics.MetaboLights strives to become the model resource for metabolomics and therefore is eager to develop and integrate with others where possible. As such, the Metabolomics team has worked hard to develop a streamlined online submission portal and works closely with companies and core facilities to ensure easy integration of data through standardised systems or API integration. MetaboLights also actively collaborates in the development of tools that allow convenient discovery of metabolomics and multi-omics research such as MetabolomeXchange and OmicsDI.

12 Capturing phenotypes for inclusion in a multi-species interaction database

Cuzick, AlayneRothamsted Research

The pathogen-host interactions database PHI-base (www.phi-base.org) is a knowledge database. It contains expertly curated molecular and biological information on genes proven to affect the outcome of pathogen-host interactions reported in peer reviewed research articles. The recent release of PHI-base version 4.6 contains information from >3000 manually curated references. The data provide information on 6438 genes from 263 pathogens tested on 194 hosts in 11340 interactions. Pro- and eukaryotic pathogens are represented in almost equal numbers. Viruses are not included. Host species belong ~65% to plants and 35% to other species of medical and/or environmental importance. Phenotypes are assigned to each interaction. Genes not affecting the disease interaction phenotype are also curated.Historically nine high level controlled language phenotypes have been used in PHI-base. These are now being extended into a formal logically defined precomposed ontology named phipo (pathogen host interaction phenotype ontology) using the ODK (ontology development kit) and registered with the OBO foundry (http://www.obofoundry.org/ontology/phipo.html). PHI-base is a multi-species database, therefore we are aiming to develop a species neutral vocabulary for maximum data comparability. We are also developing a Community curation tool called PHI-Canto to enable authors to curate their own data and annotate genotypes with phipo terms.PHI-base continues to use and openly provide data with a variety of resources encouraging maximum data interoperability.In Urban et al., (2017) doi: 10.1093/nar/gkw1089 and our other poster on PHI-Canto, the development of a community author curation tool is explained.This work is supported by the UK Biotechnology and Biological Sciences Research Council (BBSRC) (BB/I/001077/1, BB/K020056/1). PHI-base receives additional support from the BBSRC as a National Capability (BB/J/004383/1).

13 Annotation of tRNA modifications genes in model organisms

de Crecy-Lagard, ValerieUniversity of Floridat

RNAs are post-transcriptionally modified by chemical modifications that affect all aspects of translation. In any given genome, a range of 10 genes in intracellular bacteria or symbionts to over 130 genes in higher eukaryotes are dedicated to this process.The near complete sets of tRNA modification genes are currently available for only one organism per domain of life: Saccharomyces cerevisiae for Eukarya, Escherichia coli for Bacteria and Haloferax volcanii for Archaea. In this study, we set out to analyze to analyze the current status of annotation of these genes in Uniprot. We also predicted the tRNA modification genes in three other model organisms: the model gram-positive Bacillus subtilis 168, the model methanogen Methanocaldococcus janaschii and in Homo sapiens. In all cases, additional experimental analyses were required (LC/MS or tRNA-Seq) to confirm the identify and location of many modifications. Theses analyses revealed major differences in different organisms from the same kingdom, not only between the modifications found but also in the choice of enzymes catalyzing the same reactions. In addition, as an increasing number of mutations underlying human genetic diseases map to genes encoding for tRNA modification enzymes and our knowledge on human tRNA-modification genes remains fragmentary, this work provides a comprehensive up-to-date compilation of human tRNA modifications and their enzymes that can be used as a resource for further studies.

The Gene Ontology Consortium relies on both curator-driven development and workshops of domain experts for revision of the Gene Ontology (GO) to enable better GO term representation for diverse areas of biology. One of the earliest workshops was held in 2005, and brought together expert immunologists and GO developers to improve the representation of immunological processes. The workshop and follow-up efforts resulted in the addition of over 700 new terms, improvements in existing terms, and overall rearrangement of the ontology hierarchy. While many of these terms have been used in annotation of gene products associated with the immune system, it is apparent after 12 years that a number of the terms have never been used in annotation, and many well studied immune system proteins are inadequately annotated. In order to get a better picture of the state of annotation for immunology, we have studied annotation patterns for GO terms that are subclasses of ‘immune system process’ for gene products from human, mouse, and rat. We find an overall trend of more annotations in all three species for GO terms that are subclasses of ‘innate immune response’ rather than ‘adaptive immune response’. Annotation of human gene products have less of a skewing to subclasses of ‘innate immune response’ whereas annotation of mouse and rat gene products are strongly skewed. We believe these differences are the result of both differing types of data available in different species, as well as differing annotation practices of curators working on these species. Furthermore, analysis of annotations in all three species tends toward less granular terms in general. GO annotation for immunology can be quite complex and demands a degree of domain expertise to select the most appropriate terms. Based on our results, we are planning a focused effort in GO annotation for immunology, in order to improve the utility of the GO for term enrichment and other downstream analyses for immunological data.

15 Functional annotation of the population-specific variations from whole genome analyses of a Chinese population

Advances in genome sequencing technology provide an opportunity to investigate the human genetic diversity across different population groups. By whole-genome sequencing of a Chinese population, as a part of the CAS Precision Medicine Initiative (CASPMI) project, we generated a comprehensive variation map and identified population-specific variations, including 55,271 SNPs and 6,774 indels.To further study phenotype associations, we checked the variants that were also present in the GWAS catalog, and found 253 SNPs that are associated with 213 traits or diseases. In those SNPs correlated with various metabolic-related traits and diseases (Fig.1a), the frequency for T-allele of rs1549293 in KAT8(associated with susceptibility to a wider waist) varies dramatically among populations and is high in the CASPMI population (Fig.1b). By an association analysis, this SNP demonstrated a significant association with larger waist circumference in the 246 males of our population (p-value=0.0093). Within the population, northern men who carried this SNP with the TT genotype had significantly larger waist measurements than did southern males (Fig.1c). Considering the general discrepancy in body build between northern and southern Chinese, the association of the T allele with a wider waist measurement implies its possible link to population-based differentiation in physique. Moreover, this work also provides a method to select potential functional sites from the vast variations derived from whole genome sequencing.

The Chinese Genomic Variation Database (CGDB) is a public database that is designed to widely collect curated genomic variations in Chinese populations and to comprehensively integrate the annotations on biology functions, phenotypes and disease associations. As a part of the CAS Precision Medicine Initiative (CASPMI) project, CGDB integrates curated high-quality variations in whole genomes of 597 Chinese individuals, including SNPs and small indels. CGDB also collects public resources of genomic variations in Chinese populations, such as the Chinese individuals’ data in the well-known 1000 Genomes Project. In current version, CGDB totally holds about 36.2 million (M) SNPs and 2.0 M indels. Moreover, comprehensive annotations of variations are provided in CGDB, including genomic locations, related genes, allele frequencies in worldwide populations and phenotype or disease associations. CGDB also provides friendly web interfaces for browse, search, visualization and data retrieval. Taken together, CGDB will continue providing novel, population-specific data sets of genetic variants for use in personalized and precision medicine.

17 Pfam and MGnify: using metagenomics to improve the Pfam coverage of microbial sequence space

El-Gebali, SaraEMBL-EBI

MGnify is EMBL-EBI’s metagenomics resource that includes over a billion non-redundant protein sequences identified from their assemblies, with less than 1% of sequences being in common with UniProtKB. Of these non-redundant metagenomics sequences, 58% have at least one match to a Pfam entry. This level of Pfam annotation demonstrates the sensitivity of profile hidden Markov models (HMMs) to annotate novel sequences, but also means that there is a large proportion of known metagenomics proteins that are not annotated by the Pfam database.The metagenomic protein sequences from the MGnify resource have been clustered using LinClust into 248 million sequence clusters, ranging in size from 1 sequence to 25,653 sequences. Although most sequences from metagenomic sources are not found in UniProtKB, there are ~150,000 sequence clusters in the MGnify protein database that contain two or more sequences that are found in UniProtKB, and are therefore tractable to construct new Pfam entries. 1417 of these clusters contain at least a 1000 sequences. We have used these clusters as a starting point to build new Pfam entries in an effort to improve the coverage of microbial sequence space, and thus better represent the diversity of protein sequences from environmental sources. Of the potential families created from the clusters, 699 (49%) did not overlap with any entries in the Pfam database, meaning that they covered novel sequence space within Pfam. A further 714 (50%) overlapped with existing Pfam entries, indicating that these are outliers of known families. In these cases we either expanded the existing Pfam entry to include the new members, or created a new entry and grouped the related entries into a clan. The Pfam entries generated from this approach will contribute to a better understanding of the evolution and function of microbes, and the role that microbes play in biogeochemical nutrient cycling and ecosystem function.

IMGT®, the international ImMunoGeneTics information system®, http://www.imgt.org, is the global reference in immunogenetics and immunoinformatics. IMGT® is a high-quality integrated knowledge resource specialized in the immunoglobulins (IG), T cell receptors (TR), major histocompatibility (MH) of vertebrates, and in the immunoglobulin superfamily (IgSF), MH superfamily (MhSF) and related proteins of the immune system (RPI) of vertebrates and invertebrates.The goat (Capra hircus) and the sheep (Ovis aries) are two ruminant species with economic and scientific interests.The goat genome has been sequenced in 2016, the IGK locus is situated on chromosome 11 in forward (FWD) orientation. An IMGT flat file was created, IMGT000009, which comprises the extracted region from the chromosome 11.In order to localize sheep IGK locus, the sequence of the most 5’ V gene and of the most 3’ C gene from the goat locus were used to BLAST the sheep genome. The sheep IGK locus is in reverse (REV) orientation. An IMGT flat file was created, IMGT000010, which comprises the extracted region from chromosome 3.The internal annotation tool IMGT/LIGMotif was used for the identification of the variable (V), joining (J) and constant (C) genes. The biocuration of these genes was performed using IMGT® tools (IMGT/Automat, IMGT/NTItoVALD). All the labels (V-REGION, J-REGION, C-REGION…) of each type of genes (V, J, C genes) were characterized. Subsequently, a multiple alignment and a phylogenetic tree of the V-REGION sequences of the goat and sheep IGKV genes and of a representative of each IGKV subgroup in human were performed.The goat IGK locus on chromosome 11 (FWD) spans 447 kb and consists of 21 IGKV genes (5 functional, 3 ORF, 13 pseudogenes), 4 IGKJ genes (1 functional, 3 ORF) and 1 IGKC gene (functional).The sheep IGK locus on chromosome 3 (REV) spans 150 kb and consists of 18 IGKV genes (5 functional, 1 ORF, 12 pseudogenes), 4 IGKJ genes (1 functional, 3 ORF) and 1 IGKC gene (functional).

19 The challenges of annotation and integration of scRNA-Seq into Bgee

Fonseca Costa, SaraUniversity of Lausanne

The Bgee database integrates expression data from diverse sources and techniques, from in situ hybridization to bulk RNA-Seq. We are now integrating single cell RNA-seq (scRNA-Seq) transcriptomes, and we will present here the specific challenges which this new data type raises. Integrating such data should allow us to provide information on expression per cell type, or on the stochasticity of gene expression.One challenge is to provide well supported calls of presence / absence of expression. We are both adapting the methods we use for bulk RNA-Seq, and benchmarking new specific methods. A second challenge is to represent information in a consistent manner, using existing ontologies. Specific cell types can be lacking, and cells of the same type can be sampled in different organs, thus necessitating term composition. A third challenge is that our annotation procedure has always relied on the expertise of the scientists doing the sampling, as reflected in submission information or in the literature, but for scRNA-Seq cell types are often determined a posteriori, after analysis of the data. We need to capture this data-derived information in a distinct manner from the expert-derived information of, e.g., which organ was dissected. Moreover, new cell types can be proposed based on these analyses, and it remains to be decided how these are captured in ontologies and annotations.We will present our solutions to these challenges, based on manual annotation and expert analysis of a few chosen datasets of RNA-Seq which present different characteristics.

The Animal Transcription Factor DataBase (AnimalTFDB) is a resource aimed to provide the most comprehensive and accurate information for animal transcription factors (TFs) and cofactors. The AnimalTFDB has been maintained and updated for seven years and we will continue to improve it. Recently, we updated the AnimalTFDB to version 3.0 (http://bioinfo.life.hust.edu.cn/AnimalTFDB/) with more data and functions to improve it. AnimalTFDB contains 125,135 TF genes and 80,060 transcription cofactor genes from 97 animal genomes. Besides the expansion in data quantity, some new features and functions have been added. These new features are: (i) more accurate TF family assignment rules; (ii) classification of transcription cofactors; (iii) TF binding sites information; (iv) the GWAS phenotype related information of human TFs; (v) TF expressions in 22 animal species; (vi) a TF binding site prediction tool to identify potential binding TFs for nucleotide sequences; (vii) a separate human TF database web interface (HumanTFDB) was designed for better utilizing the human TFs. The new version of AnimalTFDB provides a comprehensive annotation and classification of TFs and cofactors, and will be a useful resource for studies of TF and transcription regulation.

New genetic techniques developed for Drosophila allow transgene expression to be targeted to the cells at the intersection between expression patterns of two ‘hemi-driver’ transgenes. These ‘split’ driver combinations are particularly useful for neurobiologists as they allow the isolation and manipulation of individual or specific groups of neurons. The FlyLight project has produced several thousand hemi-driver lines and has published 3D images and detailed expression descriptions for several hundred combinations of these. This expression data is also curated by FlyBase (FB) – the Drosophila model organism database – using ontology terms as part of a semi-formalised annotation system. Virtual Fly Brain (VFB) – an interactive toolkit and visualisation resource for Drosophila neurobiologists – aims to integrate images from FlyLight with formal descriptions of expression from FlyBase to provide researchers with a single, focused set of tools to explore and use these resources.The FlyBase Chado database allows rich descriptions of gene and transgene expression patterns to be captured. However, the database schema was originally designed to capture the expression patterns of single genes or transgenes. We have developed a curation system that allows us to unambiguously link both hemi-drivers to expression pattern annotations. With this in place, hemi-driver combinations can be presented as hyperlinked text on FlyBase pages and parsed by VFB to link FlyBase annotations to identifiers for split-driver expression patterns. These identifiers are in turn linked to 3D images on VFB, allowing users to browse annotations and view 3D images together.We will present recent work to semi-automate the curation and integration of FlyLight split expression data and to improve its representation on the FB and VFB websites. This will allow the Drosophila neurobiology research community to make full use of this rich dataset.

There are several computational tools and services in the literature that assist the experimental biomedical research. However, they also have shortcomings especially in terms of data connectivity, which limits their application to real-world problems. Here we aim to develop a comprehensive resource, to address these shortcomings by connecting various biomedical resources, focusing on the fields of drug discovery and precision medicine.The CROssBAR system will contain 3 modules: (1) a novel computational method for the comprehensive prediction of unknown drug/compound - target protein interactions to reveal novel on and off-target effects. We have developed a method called DEEPScreen using deep convolutional neural networks which predicts drug-target interactions based on 2D structural compound representations; (2) multi-partite networks where different types of nodes will represent compounds/drugs, genes/proteins, pathways and diseases, and the edges will represent the known and predicted pairwise relations between them; and (3) an open access database and a web-service to provide access to the resultant networks with its components. We have developed data pipelines for the heavy lifting of data from different data sources like UniProt, ChEMBL, PubChem, Drugbank and EFO. We persist only specific data attributes required for the learning by the implementation of logic rules. The CROssBAR database of attributes is hosted in self-sufficient, easy to access collections in Mongo DB.The CROssBAR system will provide a database of connected information currently dispersed in different biological resources as well as access to known and predicted drug-target interactions. It will also provide a network of gene to disease associations to help researches in the interpretation of biomedical data.

23 ViralZone: recent updates to the virus knowledge resource.

Le Mercier, PhilippeSIB Swiss Institute of Bioinformatics

ViralZone (viralzone.expasy.org) is a web resource that links virus sequence data with biological knowledge. It contains 702 viral fact sheets that provide an overview of virus biology, illustrations of virions, genomes, and viral biological processes such as viral replication cycles and host-virus interactions, and links to external resources.Virus diversity is still under active exploration and ViralZone is continually updated to reflect the current knowledge in the field. In 2018, 80 new fact sheets were created to comply with the addition of new genera/families by the International Committee on Taxonomy of Viruses ICTV.Other recent developments in ViralZone include the addition of data on host receptors for viral entry into target cells, including 257 host-virus interactions. Human viral receptors (attachment or entry receptors) comprise 56 human proteins and 12 kinds of carbohydrates targeted by 58 different viruses.

SIGNOR (http://signor.uniroma2.it) -- the SIGnaling Network Open Resource -- is a manually curated database that captures, organizes and displays signaling information as binary causal relationships between biological entities (proteins, chemicals, protein families, complexes, small molecules, phenotypes and stimuli). These relationships are displayed as signed directed graphs in a viewer application that places entities in specific compartments (extracellular, membrane, cytoplasm, nucleus).SIGNOR annotates about 20,000 interactions between 5,000 biological entities maintaining the link to the published experiments that support the interaction. The data in SIGNOR can be freely explored in the WEB interface or downloaded for local analysis. Users can upload a user defined list of proteins and query the database for causal relationships that link the proteins in the query list. A similar approach is implemented in DISNOR (https://disnor.uniroma2.it/), a new resource that uses a comprehensive collection of disease associated genes, as annotated in DisGeNET, to interrogate SIGNOR in order to assemble disease-specific logic networks linking disease associated genes by causal relationships.DISNOR is an open resource where more than 4000 disease-networks, linking ~ 2800 disease genes, can be explored. For each disease curated in DisGeNET, DISNOR links disease genes through manually annotated causal relationships and the inferred 'patho-pathways' can be visualised at different level of complexity.

The UniProt Knowledgebase (UniProtKB, https://www.uniprot.org) is a comprehensive, high-quality and freely accessible resource of protein sequences and functional information. Here we describe our work on the enhanced annotation of plant and fungal pathways for natural product biosynthesis in UniProtKB using the Rhea resource of biochemical reactions (https://www.rhea-db.org).Plants and fungi produce an enormous variety of natural products with extremely diverse molecular structures and activities. These natural products may have interesting medicinal properties (as antibiotics, anti-cancer treatments, analgesics or immune-suppressive drugs) as well as applications in the agronomy (as insecticides, fungicides and more) food (as flavors or pigments for example) and energy sectors (biofuels). Linking these chemicals to their natural biosynthetic pathways in UniProt via Rhea will facilitate efforts to study their biology, and produce them and their derivatives at industrial scales. We will describe examples of curated pathways for compounds ranging from the first natural antibiotic isolated, the terpenoid mycophenolic acid of Penicillium species, through anti-cancer drugs such as vincristine and vinblastine in Catharanthus roseus, an exotic plant found only in Madagascar, to lycopadiene, a tetraterpenoid biofuels of the microalga Botryococcus braunii. These and many other examples highlight the incredible value that the careful study and curation of non-model organisms from all branches of the tree of life can provide.

26 Functional annotation of dementia-related miRNAs using the Gene Ontology

Lovering, RuthUniversity College London

To understand the basis of disease it is crucial to know the functions of the genes involved and the pathways they act in. MicroRNA regulation of cellular processes is a relatively new field of study, but there is intense interest in this field, due to the potential use of microRNAs as therapeutic agents and biomarkers. Unfortunately, most resources available for microRNA are limited in scope and quality. The association of Gene Ontology (GO) terms with gene products has proven to provide a highly effective resource for large-scale analysis of biomedical datasets, but until recently there has been no substantial effort dedicated to applying GO terms to microRNAs. We have recognised this gap and for the past 4 years we have been curating microRNAs using Gene Ontology.We are now focused on the annotation of dementia-relevant microRNAs and we have made considerable progress towards creating a publicly accessible bioinformatic resource for these RNAs. Following the review of over 130 peer reviewed papers we have created 670 annotations for 140 microRNAs. For example, we have captured the role of 6 individual microRNAs that regulate the levels of TNF, a cytokine which has been shown to activate microglial cells.We will illustrate how our functional annotations can be used to visualise the roles of individual microRNAs in a dementia-relevant molecular interaction network, thereby demonstrating that this resource will be a valuable addition to the advancement of microRNA research and may be used to predict proteins with a role in dementia.

27 LncBook: a curated knowledgebase of human long non-coding RNAs

Ma, LinaBeijing Institute of Genomics, CAS

Long non-coding RNAs (lncRNAs) have significant functions in a wide range of important biological processes. Although the number of known human lncRNAs has dramatically increased, they are poorly annotated, posing great challenges for better understanding their functional significance and elucidating their complex functioning molecular mechanisms. Here, we present LncBook (http://bigd.big.ac.cn/lncbook), a curated knowledgebase of human lncRNAs that features a comprehensive collection of human lncRNAs and systematic curation of lncRNAs by multi-omics data integration, functional annotation and disease association. In the present version, LncBook houses a large number of 270 044 lncRNAs and includes 1867 featured lncRNAs with 3762 lncRNA-function associations. It also integrates an abundance of multi-omics data from expression, methylation, genome variation and lncRNA-miRNA interaction. Also, LncBook incorporates 3772 experimentally validated lncRNA-disease associations and further identifies a total of 97 998 lncRNAs that are putatively disease-associated. Collectively, LncBook is dedicated to the integration and curation of human lncRNAs as well as their associated data and thus bears great promise to serve as a valuable knowledgebase for worldwide research communities.

28 NLM’s Conserved Domain Database (CDD): current curation efforts

Marchler-Bauer, AronNational Institutes of Health

NLM’s Conserved Domain Database (CDD) annotates protein sequences with the positions of conserved domain footprints, and the functional sites inferred from such footprints. It is a well-established and widely used resource. CDD maintains an archive of evolutionary conserved protein domain models, as multiple sequence alignments (MSAs) converted into position-specific score matrices (PSSMs). Sequences are matched to these PSSMs using RPS-BLAST. CDD’s collection of domain and protein family models includes those imported from external providers (Pfam, SMART, COG, PRK, TIGRFAMs), and those developed in-house. For the latter, curation staff use 3D structure to confirm distant evolutionary relationships and to refine MSAs. They iteratively recruit and incorporate quality sequence data, and further refine MSAs for phylogenetic analysis to produce hierarchical classifications of functionally distinct families and sub-families, enriched with functional site annotation obtained from literature and 3D structure data. These hierarchical classifications undergo two rigorous validations by independent curators, before being released into the public domain. CDD supports both a live search service for protein and nucleotide queries, as well as pre-computed domain and site annotation for most of the protein sequences tracked by NCBI’s Entrez system. In-house curators have more recently embarked on curating specific domain architectures (SDAs, the sequential order of conserved domains in a protein sequence), using SPARCLE: Subfamily Protein Architecture Labeling Engine, to provide functional characterization and labeling of protein sequences that have been grouped by their characteristic SDA. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. This work was supported by the Intramural Research Program of the National library of Medicine at the National Institutes of Health/DHHS.

29 Yeast Complexome - The Complex Portal rising to the challenge

Meldal, BirgitEMBL-EBI

The Complex Portal (www.ebi.ac.uk/complexportal) is an encyclopedia of macromolecular complexes from 20 model organisms that is manually curated based on literature evidence. In 2018 we completed the first draft of all Saccharomyces cerevisiae complexes, also known as the ‘yeast complexome’. Throughout the project we worked in close collaboration with SGD and UniProt curators to build a list of all known and predicted complexes in yeast. It contains nearly 600 well defined complexes including super-structures like polymerases and ribosomal processomes where complexes act as participants of larger complexes. In total, approx 30% of the yeast proteome are represented in these stable complexes. Additionally, we have roughly 30 potential complexes on a ‘watch list’ for which we could not find enough evidence for their existence in vivo. During the course of this project,1382 new, manually assigned, Gene Ontology (GO) annotations specific to complexes as entities have been produced.This dataset is now being used a gold standard to validate complexes predicted by topological analyses of networks derived from large-scale experiments and for GO enrichment analysis of participant proteins and genes. Additional complexomes are in preparation, including human, mouse and C.elegans, with that of Escherichia coli scheduled to be completed in 2019.

Michaloud, JoumanaIMGT, the international ImMunoGeneTics information systemInstitut de Génétique Humaine IGH, UMR 2009 UM-CNRSIMGT®, the international ImMunoGeneTics information system®, http://www.imgt.org, is the global reference in immunogenetics and immunoinformatics.

32 Experimental tools: a new way to categorise transgenic alleles in FlyBase.

Millburn, Gillian

The long history of fly research plus the sophisticated range of applicable genetic engineering techniques mean that a large number of increasingly complex transgenic fly lines have been generated and described in the literature. While this rich genetic tool-kit helps to make Drosophila melanogaster an ideal model organism to answer a wide range of biological questions, it also creates a potential problem - how to find the most appropriate fly line for a particular experiment from the large set that are available. To help address this issue, FlyBase has recently introduced the 'experimental tool' data class. Reports have been generated for commonly used sequences with useful properties that are exploited to study the biological function of another gene product or a biological process. These include tools that enable a gene product to be detected (e.g. the FLAG tag, EGFP, mCherry), target a gene product somewhere specific within a cell (e.g. nuclear localisation signal, signal sequence), drive expression in a binary system (e.g. UAS, GAL4) or enable clonal/conditional expression (e.g. FLP, FRT). Controlled vocabulary terms are used to describe the common uses for each tool (e.g. epitope tag, green fluorescent protein, recombinase). By linking the appropriate experimental tools to transgenic constructs and insertions, researchers can more easily identify constructs and fly stocks that have the particular characteristics they are interested in. Experimental tools have been linked to existing transgenic construct alleles in the database and this curation is now included as a standard part of the curation of new genetic reagents. As of the FB2018_06 FlyBase release, entries have been made for 415 tools, and these have been linked to 110,752 transgenic alleles.

Rhea (http://www.rhea-db.org) is an expert curated resource of biochemical reactions that uses the ChEBI (Chemical Entities of Biological Interest) ontology of small molecules to describe reaction participants, their chemical structures, and transformations. Rhea currently includes over 11,500 reactions (curated from a similar number of publications), covering reactions of the Enzyme Classification of the IUBMB and thousands more.Rhea is now the reference vocabulary to annotate enzyme-catalysed reactions in UniProtKB (see accompanying poster “Enhanced enzyme annotation in UniProtKB using Rhea”). It also provides reaction data for many other resources such as the metabolomics repository MetaboLights, the Enzyme Portal, the SwissLipids knowledgebase for lipid biology (described in the poster “The SwissLipids knowledge resource for lipid biology”) and the metabolic modelling platform MetaNetX.Here we describe some of our latest work in Rhea, including continued curation of the reaction dataset, work on improving the alignment of Rhea with the Gene Ontology (GO) and Reactome, and the development of an RDF representation of Rhea and a SPARQL endpoint to serve it (https://sparql.rhea-db.org/sparql). Together these developments will further enhance the utility of Rhea as a means to link biology and chemistry and knowledge resources in both domains.

34 DDBJ (DNA Data Bank of Japan) Activity

OKIDO, ToshihisaBioinformation and DDBJ center, National Institute of Genetics

DDBJ (DNA Data Bank of Japan) at National Institute of Genetics (NIG) is a member of the International Nucleotide Sequence Database Collaboration (INSDC) for over 30 years. In addition to the INSDC activity, DDBJ manages additional databases in collaboration with National Bioscience Database Center (NBDC) and Database Center for Life Science (DBCLS) in Japan. Since 2013, Japanese Genotype-phenotype Archive (JGA) provides human genotype and phenotype data with signed consent agreements authorizing data usage for specific research. The JGA is access-controlled and contains raw data from the NGS technology or array-based platform, images, and metadata for research project. In 2018, Genomic Expression Archive (GEA) was launched to service functional genomics data including gene expression, epigenetics and genotyping SNPs array in the MAGE-TAB format, compliant with the MIAME and MINSEQE guidelines. For the submission of prokaryote genomic sequences, the DDBJ recommends using DFAST (DDBJ Fast Annotation and Submission Tool, https://dfast.nig.ac.jp), as the annotation pipeline to prepare ready-to-submit data. DDBJ also offers the NIG supercomputer system to analyze massive data with useful bioinformatics tools for Japanese researchers. Private information can be stored in the DDBJ Group Cloud (a paid service), where data can be shared among restricted users only. One instance of this service is AMED Genome group sharing Database (AGD). Since 2018, we started additional service to provide a secure analysis environment composed of computation 16 nodes (AMD Epyc 512 GB memory each). A user can download controlled-access data from JGA/AGD through a high-speed network and analyze them together with private data. In summary, DDBJ continues developing databases and the computer system for the progress of life sciences.

35 Building non coding RNA networks in IntAct: from yeast to human

Panni, SimonaDiBEST University of Calabria

In the last decade, an increasing number of studies have reported on the involvement of ncRNAs in various physiological processes and demonstrated their crucial role as epigenetic regulators. NcRNAs have been suggested as promising targets for the treatment of many human pathologies, therefore the availability of the molecular interactions networks which involve ncRNAs in public repositories would provide researchers with the opportunity both to design better experiments and to investigate therapeutic interventions. However, in comparison to PPI networks, no standardized representation has been developed for RNA interactions, and data from different sources are difficult to compare. Since 2002, the HUPO Proteomic Standard Initiative (HUPO-PSI) has provided a well-defined annotation system for molecular interactions, standardizing the minimal information requirements to describe an interaction experiment and defining the syntax of terms used for protein interaction annotation, to allow the sharing of data from different resources to build better defined networks. The IntAct team has recently started a project focused on the development of similar standards for the capture and annotation of ncRNAs interactions. In this public resource, the knowledge about RNA, proteins or genes involved in the interaction is integrated with a detailed description of the cell types, tissues, experimental conditions and effects of mutagenesis, providing a computer-interpretable summary of the published data integrated with the huge amount of protein interactions already gathered in the database (database website https://www.ebi.ac.uk/intact/). In order to accomplish this, ncRNAs are annotated with RNAcentral identifiers that allow to identify ncRNA sequences unambiguously. This effort will provide high-quality, reliable networks for the advancement of ncRNA research, for example to identify specific hubs to engineer to modulate gene expression or to predict off-target effects.

Molecular interaction (MI) networks provide maps to explore cellular processes from a systems perspective. Combining them with genomic variation data can bring in-depth insight into the challenge of understanding the effector mechanisms of amino acid variation.The IMEx Consortium (www.imexconsortium.org) is an international collaboration between databases that curate MI data from the scientific literature, represent it with full experimental detail and make it freely available for the scientific community. Over the last 14 years, IMEx curators have collectively annotated 900,000 physical binary interactions, assigning details such as kinetic parameters, variable experimental conditions or construct details, including binding interfaces and mutations that affect interactions.Leveraging the IMEx detailed curation model, we have compiled a data set of over 40,000 annotations of protein mutations affecting interaction and made it freely available at www.ebi.ac.uk/intact/resources/datasets#mutationDs. The data features information about the amino acid changes, their effect over the interaction and full reference to the experimental interaction evidence from which it was extracted. Over 22,000 unique sequence changes, affecting 4500 proteins from 300 different species, are reported. Around 75% of the annotation are mapped to human proteins, providing high-quality experimental evidence of sequence change effects which directly relate to existing variation data.We present the latest updates of the data set, along with future perspectives and developments such as its integration within existing variation annotation tools, like ENSEMBL’s VEP; its extension to include mutations in DNA/RNA as interacting partners; and our plans to increase accessibility. This openly available resource is an invaluable tool with immediate applications in the study of variation impact on the interactome, interaction interfaces and previously un-annotated variants, among other key questions.

Genome Properties (GPs) is a resource that predicts functional features such as biochemical pathways and complexes in sequenced genomes. It utilises InterPro to determine the presence or absence of constituent proteins for each property, and as such functions as an InterPro companion resource. Using GPs to compare, for example, bacterial genomes, it is possible to rationalize the function of 1000s of individual genes, down to a few 100 relevant properties, thus streamlining the process of phylogenetic profiling. As a result, GPs lends itself to the large-scale analysis of uncharacterised genomes, such as those generated from metagenomic studies. All data in GPs are freely available in a way that facilitates collaborative curation of GPs, as well as a web interfaces for interrogating the GP data. Comparing GPs coverage of genomes with other pathway resources, such as KEGG and SEED Systems, reveals complementarity of coverage, although each resource provides unique coverage. The strength of GPs comes from the scalability of the approach and the ability to annotate lesser-studied species, underlining its utility in broadly annotating diverse and emerging organisms, such as those as being identified using metagenomics.

FungiDB (http://FungiDB.org) is a free online database that enables data mining and analysis of the pan-fungal and oomycetes genomic sequences and functional data. FungiDB is part of the Eukaryotic Pathogen Bioinformatics Resource Center (http://EuPathDB.org) and contains genome sequence and annotation for over 130 species including pathogenic species from the Cryptococcus, Histoplasma, and Coccidioides genera. In addition to genomic sequence data and annotation, FungiDB includes whole genome polymorphism data, transcriptomic data based on RNA sequence, microarray experiments, MS-based proteomics data, ChIP-seq, metabolic pathways and all expressed sequence tag data from GenBank.All genomes in FungiDB are run through a standard analysis pipeline that generates additional data such as signal peptide and transmembrane domain predictions, GO term and EC number associations and orthology profiles. Selected genomes are actively curated at FungiDB and functional annotations including gene names/synonyms, product descriptions, EC numbers and GO terms are regularly integrated into FungiDB. We also have integrated several phenotypic datasets and implemented search pages to query these phenotypic data. Input from the community (images, files, PubMed records, etc) can be added to FungiDB records (ie. gene pages) via user comments; these comments become immediately visible and searchable. User comments are regularly reviewed for curated organisms, which results in improved functional annotations.An additional feature of FungiDB is the availability of a private 'User Workspace' which permits researchers to analyze their own data using Galaxy workflows and integrate back into FungiDB where they can examine the results in the context of publicly-accessible datasets. Analytical tools and sophisticated 'Search Strategies' support in silico queries against a wealth of integrated and automatically generated data, enabling scientists to ask their own questions and develop testable hypotheses.

Sinclair, Michael S.Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA

The Human Disease Ontology (DO) is a standardized representation of human disease based on etiology, affected anatomical site, genetic factors, cell type of origin, and clinically significant phenotypes and symptoms. We are enhancing the DO’s clinical utility by integrating the diverse and interconnected drivers of disease to better represent complex diseases, cancers and metabolic diseases by linking DO terms with those of related biomedical ontologies via highly structured design patterns. The DO contains logical definitions (axioms) to describe relevant disease drivers, constructed with a specific Relations Ontology (RO) term as an OWL object property to create a restriction between a DO term and another OBO Foundry ontology term. For example, a carcinoma is defined, with an Equivalent Classes axiom, as a ‘cell type cancer’ (DO) and ‘derives from’ (RO) some ‘epithelial cell’ (Cell Ontology). Thus the relatedness of each organ system cancer with an epithelial cell of origin (e.g. trachea carcinoma and colorectal carcinoma) is defined via an ‘inferred’ parentage to carcinoma. Defining these relationships in the DO thus enables enhanced querying to identify DO terms sharing a common disease driver as a taxonomy of linked diseases, allowing the user to investigate previously undocumented, indirect links between diseases. Here, we present the development and implementation of the DO’s structured design pattern for genetic factors of disease defined with the Sequence Ontology (SO) and the DO’s exploration of a transition to the Molecular Sequence Ontology (MSO, a SO companion ontology).The addition of highly structured design patterns for a comprehensive set of axioms linking DO and SO/MSO entities will enable powerful queries into diseases with common genetic etiology. When placed alongside other factors in disease such as environment, anatomy, and tissue of origin, such queries will facilitate a more in-depth understanding of complex disease.

40 The Drosophila nuclear pore complex - a one-stop shop for all we know about it, UniProtKB

Speretta, ElenaEMBL-EBI

The UniProtKB database provides the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. It facilitates scientific discovery by organising biological knowledge and enabling researchers to rapidly comprehend complex areas of biology.The Drosophila melanogaster annotation program at UniProtKB focuses on the manual annotation of characterised proteins and contains 3,549 reviewed entries. This number continues to increase as new information becomes available.Here we present a systematic review of the Drosophila nuclear pore complex, one of the largest eukaryotic macromolecular assemblies which form channels through the nuclear membrane. Recent studies have shown that the nuclear pore complex functions go beyond that of regulating trafficking; several nucleoporins, which are components of the nuclear pore complex, are involved in chromatin organization, regulation of gene expression and DNA repair. These new functions give the nuclear pore complex an additional role as regulator of many pathways relevant to disease.In this presentation we will describe the annotation of 35 potential Drosophila nucleoporins as an example of how UniProtKB manual biocurators collect and integrate data from a range of resources such as scientific literature, protein sequence analysis tools, other databases and automatic annotation systems to provide a comprehensive and critical review of the most current biological information for a protein. This systematic effort has improved information consistency and searchability for nucleoporins within the database and, once more, makes UniProtKB a unique platform to review all publicly available information and a sketching board for data-driven hypothesis testing.

41 ChlamBase: a Wikidata-backed genome database for the Chlamydia research community

Su, AndrewScripps Research

The accelerating growth of genomic and proteomic information for Chlamydia species, coupled with unique biological aspects of these pathogens, necessitates bioinformatic tools and features that are not provided by major public databases. To meet these growing needs, we developed ChlamBase, a model organism database for Chlamydia that is built upon the WikiGenomes application framework, and Wikidata, a community-curated database. ChlamBase was designed to serve as a central access point for genomic and proteomic information for the Chlamydia research community. ChlamBase integrates information from numerous external databases, as well as important data extracted from the literature that are otherwise not available in structured formats that are easy to use. In addition, a key feature of ChlamBase is that it empowers users in the field to contribute new annotations and data as the field advances with continued discoveries. ChlamBase is freely and publicly available at chlambase.org.

42 Protein Structures and their features in UniProt

Tyagi, NidhiEMBL-EBI

Annotation of proteins based on structure-based analyses is an integral component of the UniProt Knowledgebase (UniProtKB). There are over 100,000 experimentally determined 3-dimensional structures of proteins deposited in the Protein Data Bank. UniProt works closely with the Protein Databank in Europe (PDBe) to map these 3D structural entries to the corresponding UniProtKB entries.SIFTS (Structure Integration with Function, Taxonomy and Sequences), which is a collaboration between the Protein Data Bank in Europe (PDBe) and UniProt, facilitates the link between the structural and sequence features of proteins by providing correspondence at the level of amino acid residues. A pipeline combining manual and automated processes for maintaining up-to-date cross-reference information has been developed and is run with every weekly PDB release. Various criteria are considered to cross-reference PDB and UniProtKB entries such as a) the degree of sequence identity (>90%) b) an exact taxonomic match (at the level of species, subspecies and specific strains for lower organisms) (c) preferential mapping to a curated SwissProt entry (d) mapping to proteins from Reference/Complete proteome (e) mapping to the longest protein sequence. Some cases are inspected manually by a UniProt biocurator using a dedicated curation interface to ensure accurate cross-referencing. These cases include short peptides, chimeras, synthetic constructs and de novo designed polymers.To date, UniProt has successfully completed the non-trivial and labour-intensive exercise of cross referencing ~410,000 polypeptide chains (138,899 PDB entries) to 45,676 UniProtKB entries.Structural information in UniProtKB enables non-expert users to see protein entries in the light of relevant biological context such as metabolic pathways, genetic information, molecular functions, conserved motifs and interactions etc.

43 Developing a novel approach to characterize genes essential to the function of a tissue

Vallat, BastienUNIL

What are the genes responsible for the function of an organ? During evolution, are the same genes contributing to the development of similar organ functions, or are new genes recruited in a convergent evolution fashion? To answer these questions, we need to accurately identify the genes with a unique expression contribution to any given organ or condition. To do so, differential gene expression analyses are often used, but they are limited in the set of conditions that can be compared in an experiment. Top expressed genes can also be used, but they are often dominated by ubiquitous genes which are not informative on a specific organ.No bioinformatics resource exists which provides an organ or tissue-view of gene expression patterns, as opposed to a gene-centric view, over many species. Thanks to manual organ annotations based on UBERON ontology, the Bgee database (https://bgee.org/) contains all information needed to provide this organ-centric view of gene expression.We present (i) a metric to identify genes with a specific expression contribution to an organ function or to any specific condition; and (ii) an “organ page” in Bgee. This page can be used in conjunction with other resources (e.g. organ specific cancers) to understand the healthy and pathological function of organs, and their evolution.

44 Functional annotations in the PDBe Knowledge Base (PDBe-KB)

Varadi, MihalyEMBL-EBI

New technologies are driving the expansion of structural data deposited in the Protein Data Bank, with over 145,000 structures referencing more than 45,000 unique UniProtKB entries. However, the inherent value of macromolecular structures can only be fully realised when they are examined in their biological context. PDBe-KB (Protein Data Bank in Europe - Knowledge Base; https://pdbe-kb.org), established in 2018, is an international, community-driven resource with the primary goal of collating and making available structural and functional annotations contributed by partner resources. It is managed by the Protein Data Bank in Europe team at EMBL-EBI.Partner groups provide computational predictions for ligand binding sites, catalytic sites, protein-protein interfaces and post-translational modification sites, as well as selected physico-chemical parameters (solvent accessibility, residue depth). Literature-based manual curations of functional sites and quaternary structure assemblies are also being added, while predictions of effects of residue mutations will be contributed in 2020. The annotations are stored in a highly interconnected graph database, which enables comparison between prediction methods, facilitates data exchange and allows insights into the biological function by providing a comprehensive view of the functional context of the protein structure. Functional annotations are made available on PDBe-KB web pages, programmatically via API and as a distributed Neo4J graph database. As a first example, PDBe-KB is launching structure pages for a UniProtKB accession, referenced by PDB structures.

45 Autophagy Targeted Curation

Varusai, ThawfeekEuropean Bioinformatics Institute (EMBL-EBI)

Autophagy is the lysosomal degradative process of cellular components and is essential for cell survival, differentiation, development and homeostasis. Dysfunctional autophagy results in various pathologies including neurodegenerative diseases, infectious diseases and cancers. Several forms of autophagy have been identified to date including Macroautophagy (MA), Chaperone Mediated Autophagy (CMA) and Microautophagy (MI) and this list is expanding. With growing knowledge of the subject and high clinical relevance, there is a need to consolidate information on autophagy mechanisms in a comprehensive, reliable and usable fashion.Reactome is a freely available manually curated human pathway database that provides an interactive service to search, investigate and download data. Well-characterized mechanisms of MA and Mitophagy were curated and made available in Reactome a few years ago. With increasing mechanistic knowledge in the field, in 2018 we have completely updated and revised the autophagy pathway. We have now dedicated a separate top-level pathway for autophagy with several new sub-pathways including CMA, MI and Lipophagy. This pathway contains a detailed map of autophagy events ranging from selective degradation of proteins to dynamic membrane reorganizations.Reactome provides thoroughly curated, annotated and reviewed information on the autophagy pathway. Every mechanistic detail in the process is curated and annotated with reference literatures that contain experimental confirmation. As molecular mechanisms unveil in future, we plan to add Glycophagy, Xenophagy and Reticulophagy to this list.Users can treat Reactome as a start point to swiftly understand the complex biology of autophagy at a molecular level. Additionally, analysis tools in Reactome can be used to investigate empirical data. Taken together, Reactome provides a one-stop pathway guide and analysis platform of currently established autophagy mechanisms for researchers and clinicians.

46 Experiment-based computational method for proper annotation of the molecular function of enzymes

Veronique, de BerardinisCEA/Genoscope/UMR8030

The rate of protein functional elucidation lags far behind the rate of gene and protein sequence discovery, leading to an accumulation of proteins with no known function. In addition, only a tiny fraction of enzymes have experimentally established functions. The, function family is often extrapolate from a small number of characterized proteins to all members of a family leading to over-annotation (de Crecy-Lagard et al, 2016; Schnoes et al., 2009). Here, two examples of an integrated strategy for to highlight the functional diversity within protein families most of the time underestimated will be presented. This approach relies with a high-throughput enzymatic screening on representatives, structural and modeling investigations based on the Active Site Clustering Method (de Melo-Minardi et al., 2010), analysis of genomic and metabolic context. We investigated the protein family with no known function, DUF849 Pfam family, and unearthed key residues for 14 potential new enzymatic activities including 4 physiological one, leading to the designation of these proteins as b-keto acid cleavage enzymes (Bastard et al., 2014). The second study will illustrate that proteins with high sequence similarity might not have the same function. We determined the enzymatic activities of representative 100 O-acyl-L-homoserine transferases of the two unrelated families, MetX and MetA, involved in the first step of the methionine biosynthesis and assumed to always use acetyl-CoA and succinyl-CoA, respectively. This strategy allowed us to identified the specific determining positions responsible for acyl-CoA specificity in the active sites of MetX and MetA enzymes, actually iso-functional for both activities. We then predict that >60% of the 10,000 sequences from these families currently in databases are incorrectly annotated. Finally, we uncovered a divergent subgroup of MetX that participate only in L-cysteine biosynthesis as O-succinyl-L-serine transferases (Bastard et al., 2017).

47 Making curated genome annotations available for expression calls of RNA-Seq

Wollbrett, JulienSIB, UNIL

Bgee (https://bgee.org/) is a database to retrieve and compare gene expression patterns in 29 animal species, produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data), exclusively based on healthy wild-type samples.The Bgee pipeline (https://github.com/BgeeDB/bgee_pipeline) allows to detect active signal of expression from RNA-Seq data, so that presence/absence expression calls could be integrated over different libraries and experiments into Bgee. It uses intergenic regions to estimate the background transcriptional noise in the sample analyzed, to define a non arbitrary threshold for the presence/absence of expression. As gene annotation quality varies greatly between species, it is important to ensure that the intergenic regions used are not annotation artefacts. We define “true” intergenic as regions without gene annotation and with very low (background) expression over all RNA-Seq libraries analyzed. We define “false” intergenic as regions without gene annotation but with higher expression, presumably because they contain non annotated genes, most likely non-coding genes. Only the true intergenic regions are taken into account to generate the presence/absence threshold.We have adapted this pipeline to create a new R package called BgeeCall. The purpose of BgeeCall is to allow users to analyse their own RNA-Seq data to generate accurate presence/absence expression calls, taking advantage of the set of true intergenic regions defined by Bgee. This is the first R package to provide rigorous presence/absence calls for RNA-Seq, providing an alternative to frequently used arbitrary cut-offs such as 1 FPKM or 2 TPM. Moreover, the set of intergenic regions labeled as probably incorrect could help refining genome annotations.This package will be made available on Bioconductor and through the Bgee GitHub.

48 NucMap: a database of genome-wide nucleosome positioning map across species

Dynamics of nucleosome positioning affects chromatin state, transcription and all other biological processes occurring on genomic DNA. While MNase-Seq has been used to depict nucleosome positioning map in eukaryote in the past years, nucleosome positioning data is increasing dramatically. To facilitate the usage of published data across studies, we developed a database named nucleosome positioning map (NucMap, http://bigd.big.ac.cn/nucmap). NucMap includes 798 experimental data from 477 samples across 15 species. With a series of functional modules, users can search profile of nucleosome positioning at the promoter region of each gene across all samples and make enrichment analysis on nucleosome positioning data in all genomic regions. Nucleosome browser was built to visualize the profiles of nucleosome positioning. Users can also visualize multiple sources of omics data with the nucleosome browser and make side-by-side comparisons. All processed data in the database are freely available. NucMap is the first comprehensive nucleosome positioning platform and it will serve as an important resource to facilitate the understanding of chromatin regulation.

49 Challenges in the annotation and identification of pseudoenzymes in UniProt Knowledgebase

Zaru, RossanaEMBL-EBI

Challenges in the annotation and identification of pseudoenzymes in UniProt KnowledgebaseThe UniProt Knowledgebase (UniProtKB) collects and centralises functional information on proteins across a wide range of species. For enzymes, which represent between 20-40% of most proteomes, UniProtKB provides additional information on EC classification, catalytic activity, cofactors, enzyme regulation, kinetics and pathways, all based on critical assessment of published experimental data. UniProtKB has recently enhanced the way in which enzyme function is represented and has adopted Rhea as a vocabulary to annotate and represent enzyme-catalysed reactions. Computer-based analysis and structural data are used to enrich the annotation of the sequence with the identification of active sites and binding sites. While the annotation of enzymes is well defined, the curation of pseudoenzymes in UniProtKB has highlighted some challenges: how to identify them, how to assess their experimental lack of catalytic activity, how to annotate their lack of catalytic activity in a consistent way and how much can be inferred and propagated from experimental data obtained from other species. Using various examples from the curation of the C.elegans kinome and phosphatome, I will illustrate some of these issues and discuss some of the changes we propose to implement to expand the annotation and discovery of pseudoenzymes. Ultimately, improving the curation of pseudoenzymes will provide the scientific community with valuable information to understand the evolution of these proteins, the aetiology of related diseases and the development of drugs.