Abstract

A report on the Genome Informatics conference, held at the Wellcome Genome Campus Conference Centre, Hinxton, United Kingdom, 19–22 September 2016.

We report a sampling of the advances in computational genomics presented at the most recent Genome Informatics conference. As in Genome Informatics 2014 [1], speakers presented research on personal and medical genomics, transcriptomics, epigenomics, and metagenomics, new sequencing techniques, and new computational algorithms to crunch ever-larger genomic datasets. Two changes were notable. First, there was a marked increase in the number of projects involving single-cell analyses, especially single-cell RNA-seq (scRNA-seq). Second, while participants continued the practice of presenting unpublished results, a large number of the presenters had previously posted preprints on their work on bioRxiv (http://www.bioRxiv.org) or elsewhere. Although earlier in 2016, Berg et al. [2] wrote that “preprints are currently used minimally in biology”, this conference showed that in genome informatics, at least, they are already used quite widely.

Personal and medical genomics

Several talks covered systems and new technologies that clinicians, patients, and researchers can use to understand human genomic variation. Jessica Chong (University of Washington, USA) described MyGene2 (http://mygene2.org), a website that allows families to share their de-identified personal data and find other families with similar traits. Jennifer Harrow (Illumina, UK) discussed using BaseSpace (https://basespace.illumina.com/) for the analysis of clinical sequencing data. Deanna Church (10x Genomics, USA) presented Linked-Reads, a technology that makes it easier to find variants in less accessible genomic regions such as the HLA locus. Several presenters showed new methods to identify the functional effects of sequence variants. Konrad Karczewski (Massachusetts General Hospital, USA) presented the Loss Of Function Transcript Effect Estimator (LOFTEE, https://github.com/konradjk/loftee). LOFTEE uses a support vector machine to identify sequence variants that significantly disrupt a gene and potentially affect biological processes. Martin Kircher (University of Washington, USA) discussed a massively parallel reporter assay (MPRA) that uses a lentivirus for genomic integration, called lentiMPRA [3]. He used lentiMPRA to predict enhancer activity, and to more generally measure the functional effect of non-coding variants. William McLaren (European Bioinformatics Institute, UK) presented Haplosaurus, a variant effect predictor that uses haplotype-phased data (https://github.com/willmclaren/ensembl-vep).

In a keynote lecture, Elaine Mardis (Washington University, St Louis, USA), described computational tools and databases created to collect and process cancer-specific mutation datasets. A substantive increase in the amount of clinical sequencing performed as part of cancer diagnosis and treatment necessitated the development of these tools. She emphasized the shift in categorization of cancers—previously oncologists classified cancers by tissue, but increasingly they classify cancers by which genes are mutated. Mardis suggested that we should instead describe cancers by the affected metabolic and regulatory pathways, which can provide insight even for previously unseen disruption. This disruption can be genetic mutations, but it can also manifest as other changes to cellular state, which must be measured with other techniques, such as RNA-seq. The tools Mardis described help interpret the mutations identified by sequencing. These include the Database of Curated Mutations (DoCM). She also described Personalized Variant Antigens by Cancer Sequencing (pVAC-seq), a tool for identifying tumor neoantigens from DNA-seq and RNA-seq data. She also described Clinical Interpretations of Variants in Cancer (CIViC), a platform for crowd-sourcing data on clinical consequences of genomic variants. CIViC has 1565 evidence items describing the interpretation of genetic variants, and Mardis announced a forthcoming Variant Curation Hackathon to identify more.

Variant discovery and genome assembly

Several speakers presented tools and methods about analysis of genome assemblies and exploration of sequence variants. Jared Simpson (Ontario Institute for Cancer Research, Canada) started the second session with an overview of base calling for Oxford Nanopore sequencing data and his group’s contribution to this field, Nanocall (http://github.com/mateidavid/nanocall). Simpson also discussed Nanopolish, which can detect 5-methylcytosine from Oxford Nanopore sequencing data directly, without bisulfite conversion. Kerstin Howe (Wellcome Trust Sanger Institute, UK) presented her work with the Genome Reference Consortium on producing high quality assemblies for different strains of mouse and zebrafish. Ideally, future work will integrate graph assemblies. Frank Nothaft (University of California, Berkeley, USA) described ADAM (https://github.com/bigdatagenomics/adam), a library for distributed computing on genomics data, and Toil, a workflow management system. These systems are about 3.5 times faster than standard Genome Analysis Toolkit (GATK) pipelines.

In a keynote lecture, Richard Durbin (Wellcome Trust Sanger Institute, UK) discussed genome reference assemblies and the pitfalls of using a single flat reference sequence. Genomicists use the reference genome for mapping sequencing reads, as a coordinate system for reporting and annotation, and as a framework for describing known variation. While the reference genome makes many analyses simpler, it biases these analyses towards what is previously seen. Durbin briefly discussed the advantages of the newest human reference assembly, GRCh38, which fixes many previous problems and includes alternate loci to capture complex genetic variation. But to more effectively work with this variation, Durbin said we need to switch from a flat reference to a “pan-genome” graph that includes much known variation [8]. To do this, we will need a new ecosystem of graph genome file formats and analysis software. Durbin discussed the work of the Global Alliance for Genomics and Health to evaluate proposed systems for working with graph genomes.

A few speakers presented on curating data from the literature. Alex Bateman (European Bioinformatics Institute, UK) analyzed the feasibility of curating data on biomolecules from the literature. He determined that despite a vast increase in the amount of biomedical literature, most does not need to be analyzed by curators. Benjamin Ainscough (Washington University, St Louis, USA) described DoCM (http://docm.genome.wustl.edu/), a database of known mutations in cancer. DoCM contains approximately 1000 mutations in 132 cell lines.

Transcriptomics, alternative splicing, and gene prediction

Speakers discussed several aspects of analyzing transcriptomic datasets. Hagen Tilgner (Weill Cornell Medicine, USA) described the use of long read technology to discover novel splice isoforms and long non-coding RNAs (lncRNAs) in the human transcriptome. Simon Hardwick (Garvan Institute of Medical Research, Australia) presented a set of spike-in standards for RNA-seq, called Sequins (http://www.sequin.xyz/). These standards act as a ground truth to measure the accuracy and precision of transcriptome sequencing. Pall Melsted (University of Iceland, Iceland) presented Pizzly, a new tool to detect the gene fusions that often occur in cancer from transcriptome data, approximately 100 times faster than established methods. Annalaura Vacca (University of Edinburgh, UK) presented a meta-analysis of FANTOM5 cap analysis gene expression (CAGE) time-course expression datasets. Using these data, she identified known immediate early genes and candidate novel immediate early genes.

Comparative, evolutionary, and metagenomics

Some projects on analysis of metagenomics datasets were presented. Owen White (University of Maryland, USA) presented an update on the Human Microbiome project, which ties together metagenomics data with phenotype data on host individuals. Curtis Huttenhower (Harvard University) described using HUMAnN2 (http://huttenhower.sph.harvard.edu/humann2) to process metagenomics and metatranscriptome data from the Human Microbiome Project (http://hmpdacc.org/).

A few speakers discussed comparative genomics and evolutionary approaches. James Havrilla (University of Utah, USA) presented a statistical model to identify constraint in different domains within a protein. Sonja Dunemann (University of Calgary, Canada) described the caution necessary before claiming horizontal gene transfer. David Curran (University of Calgary, Canada) presented work on Figmop [13], a profile hidden Markov model that identifies orthologs not identifiable using the popular Basic Local Alignment Search Tool (BLAST) method.

Several speakers described analyses of genetic traits in population-level datasets. Sriram Sankararaman (University of California, Los Angeles, USA) presented an analysis of human admixture with Neanderthal and Denisovan populations [14]. Alicia Martin (Massachusetts General Hospital) presented work using the Sequencing Initiative Suomi (SISu, http://sisuproject.fi/) data to understand recent population history and migration in Finnish populations. Moran Gershoni (Weizmann Institute of Science, Israel) described sex differentially expressed genes from common tissues from Genotype-Tissue Expression (GTEx) [15] data. He identified 244 X-linked sex differentially expressed genes, 16 of which are in multiple tissues.

Conclusion

The presentations described above were a major attraction of this conference. As in most conferences, of course, the ability to interact with conference attendees provided another major benefit. Increasingly, these benefits accrue not just to the hundreds of in-person attendees at the conference but to thousands of scientists elsewhere. The meeting had an “open by default” policy that encouraged wide discussion of presentations on Twitter and elsewhere. By following the meeting via Twitter, reading preprints on bioRxiv, examining software on GitHub and Bitbucket, and viewing slide decks posted on the internet, many engaged with the advances presented in Hinxton without leaving their home. Even those at the meeting enjoyed an enhanced ability to discuss new work both during and after talks. And those who participated in Twitter found new colleagues to interact and collaborate with long after the meeting ended.

While one can follow Genome Informatics from thousands of miles away, we cannot deny the importance of the meeting itself as a locus for bringing together new research and engaged researchers. Although results are now immediately available to all, there is no substitute for attending in person, which is also the only way to present work at the meeting. And it was the thematically balanced and high-quality program that attracted so much discussion in the first place. We hope that this history of an interesting and excellent scientific program continues and look forward to Genome Informatics 2017.

Corresponding author

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.