Breadcrumbs

Computational genomics workflows are regularly used to not only rapidly accelerate R&D in the field of genomics, but they are also increasingly used to make clinical diagnoses tailored to individual genomes. As DNAnexus has grown to support a large network of industries and collaborators, we have noticed that these workflows are often developed and shared across users and organizations on a global scale.

DNAnexus workflows currently provide the core functionality of allowing users to create and execute a computational workflow within a DNAnexus project. However, for users and organizations collaborating on multiple private or public projects, these local workflows may be less suitable for long-term maintenance in the context of larger organizations and consortia. In recognition of the need to manage workflows with a truly global network of collaborators, we are excited to introduce an additional suite of features that can be applied to objects we call ‘global workflows’.

As with DNAnexus applications, global workflows are published to a global space accessible by authorized users across projects. Like a Github repository or Docker repository, global workflows are versioned and updated with a globally unique name. Global workflows can be tagged, associated with broad categories (e.g. ‘read mapping’, ‘germline variant calling’, ‘somatic variant calling’, ‘tumor-normal variant calling’), and defined to run across cloud regions and providers. They can also be developed by a specified set of users and subsequently published or released to a larger set of authorized users who can run but not modify the workflow. Together, these features empower workflow developers to better share and advertise their workflows to a broad set of users and organizations across multiple regions and clouds.

A user creates a global workflow in essentially the same way as a regular workflow (see this tutorial for more details on how to create a global workflow). In fact, existing workflows on the DNAnexus platform can be converted to global workflows in a straightforward way. Since workflows written in CWL or WDL can be directly converted into workflows on our platform, these workflows can also be easily converted to global workflows. As a result, portable workflows can also be imported to our platform and used in a way that meets organizational needs for access control and collaboration at scale.

To illustrate the use of global workflows, we have published a public workflow available to all users of our platform. For example, from our CLI, you can run:

Here, you can see that there is a GATK4 best practice pipeline available for you to use. You can treat this workflow name like any other global applications on the platform. Examples for how to use these features can be seen in more detail here.

Workflow release management features were built by the Developer Experience team at DNAnexus. Thanks to the DNAnexus Science team for contributions to the design of this feature. Please see our documentation for a tutorial on how to use these features and contact support@dnanexus.com if you have any feedback or questions.

In this blog we quantify Google Brain’s recent improvements to DeepVariant – detailing significant improvement in both exomes and PCR genomes. We reflect on how this improvement was achieved and what it suggests for deep learning within bioinformatics.

Subsequently, we realized that these difficult samples had a common feature – they were all prepared with PCR. Also, the initial release of DeepVariant had not been trained on exome data. Based on this, DNAnexus provided exome and Garvan Institute PCR WGS samples to Google Brain to train improved DeepVariant models, which are now released. We also recognize Brad Chapman, who independently collaborated with Google Brain.

Many factors make variant calling more complex in PCR+ samples and exomes. PCR amplification is biased by GC-content and other factors. Errors that accumulate in the PCR process are difficult to distinguish from true variants. Exomes have an additional complexity of uneven capture efficiency and coverage.

Knowledge of these facts is incorporated differently in human-programmed methods, traditional machine learning, and deep learning methods. Human-written methods require a programmer to carefully develop heuristics which capture the relevant properties without too much rigidity. Traditional machine learning approaches require a scientist to identify informative features that capture the information, encode those from the raw data, and train and apply models.

Deep learning based methods like DeepVariant attempt to represent the underlying data in as raw a manner as possible, relying on the deep learning models to learn the features. Here, the scientist’s role becomes to identify the examples that embody the diversity of conditions the tool must solve, and to organize how to present as many and varied of these examples to the training machinery as possible. How Google Brain made these improvements may be as interesting as the improvements themselves.

Exome-Trained Models Significantly Improve Performance

Google Brain’s initial release was DeepVariant v0.4. The v0.5 version added the ability to apply either a WGS model or an exome model. The v0.6 version specifically added PCR+ training for the first time.

On exomes, DeepVariant improved from approximate parity to a 2-fold error reduction relative to GATK. There are two ways to put this improvement in accuracy into context. DeepVariant has an absolute decrease of 231 errors and 121 false positives. For a callset of 500,000 exomes – as the UK BioBank will soon make available – the fact that false calls tend to distribute randomly while true variants are shared in the population could mean a reduction of 60 million false variants positions in the full callset (for reference, the 60,000 exome ExAC set contained 7 million variant positions).

Another way to frame the value is: does the improved accuracy allow similar performance as traditional methods at a lower coverage?Figure 2 demonstrates that DeepVariant on the same exome downsampled randomly to 50% coverage is more accurate than other methods at full coverage. This could allow a re-evaluation of required sequencing depth to either increase throughput, save cost, or refocus sequencing depth on complex but important regions.

Including PCR+ WGS Data Greatly Improves DeepVariant Performance

DeepVariant was not specifically trained on PCR-prepared WGS samples until the v0.6 release, when 5 PCR+ genomes were added to the training set. Even before this addition, DeepVariant was the most accurate SNP caller surveyed in our PCR+ data. However, as Figure 3A indicates, the inclusion of PCR training data improved SNP accuracy noticeably.

Indels in these samples were a problem for almost all of the callers. Indel error rates were so much higher in these samples that this is the dominant error mode across every method. The exception was Strelka2, which performed vastly better. Illumina clearly put great care into modeling PCR errors when developing Strelka2. Edico was also able to use this observation to improve their DRAGEN method, as detailed in our recent blog.

This observation was key to realizing that the performance penalty was due to PCR. When facing a problem with additional noise and complexity, it can be unclear whether the lack of signal makes the problem fundamentally less solvable or if the problem remains similarly tractable, but instead requires more effort. Strelka2’s performance was proof that current methods could improve significantly.

As Figure 3B shows, the inclusion of PCR+ data has a huge impact on the error rate of DeepVariant, with a 3-fold error reduction. Simply identifying the right training examples was sufficient to go from worse than GATK to 10% better than the prior leader, Strelka2.

How Do New Training Data Impact Other Applications?

By including PCR+ and exome data in training, DeepVariant is now being asked to encode more information within its networks. Asking methods to multi-task in this way can lead to interesting effects and can force trade-offs which decrease overall performance.

This could be understood in a few ways. First, harder problems may create pressure for network weights to identify subtle but meaningful connections in data. For example, a strong deep learning approach trained on novice chess opponents would only be so good. Another is that training examples that depend differentially on various aspects of a problem helps resist overfitting, allowing the model to discover and preserve subtle signals it could not otherwise. The result is a better generalist model.

There are many variables changed in the progression of DeepVariant from v0.4 to v0.6 – more overall training data has been added and tensors are changed – but if DeepVariant were to benefit from the complexity, one might hypothesize this would manifest in a more pronounced change on Indel performance, as that area is the most impacted in both the exome and PCR data. To test this, we ran each version on a “standard” benchmark – PCR-Free WGS HG002.

The new releases of DeepVariant show a trajectory of improvement, which is impressive given that DeepVariant led our benchmarks in accuracy even in its first release. SNP error rate is about 10% improved. The Indel error rate is improved more substantially, at a 40% reduction. Also, the greatest impact occurs with the addition of the PCR training data in v0.6. With this sort of uncontrolled experiment, we can’t conclusively say that this occurs because DeepVariant is learning something general about Indel errors from the PCR data. However, the prospect is enticing.

What Will Deep Learning Bring for the Field?

As deep learning methods begin to enter the domain of bioinformatics, it is natural to wonder how skills in the field will shift in response. Some may fear these methods will obsolete programming expertise or domain knowledge in genomics and bioinformatics. In theory, the raw methods to train DeepVariant on these new data types, confirm the improvement, and prevent regressions can be shockingly automatic.

However, to reach this point, domain experts had to understand the concepts of PCR and exome sequencing and identify their relevance to errors in the sequencing process. They had to understand where to get valid, labeled data of sufficient quantity and quality. They had to determine the best ways to represent the underlying data in a trainable format.

The importance of domain expertise – what is sequencing and what are its nuances – will only grow. How it manifests may shift from extracting features and hand-crafting weights to identifying examples that fully capture that nuance.

We also see these methods as enabling, not deprecating, programming expertise – these frameworks depend on large, complex training and evaluation infrastructures. The Google Brain team recently released Nucleus, a framework for training genomics models. Nucleus contains converters between genomics formats like BAM and VCF and TensorFlow. This may allow developers to tap into deep learning methods as specialized modules of a broader bioinformatics solution.

We hope that amongst the detail, this blog has communicated that deep learning is not a magic box. It remains essential for a scientist to carefully consider what the nuances of a problem are; how to rigorously evaluate performance; where blind spots in a method are; and which data will shine light on them.

Next week, DNAnexus will be at the Plant and Animal Genome conference (PAG) in San Diego (booth 431). As part of an ongoing effort to expand our visualization capabilities, we will present an open-source tool called Dot that helps scientists visualize genome-genome alignments through a rich, interactive dot plot.

In addition to its scientific contribution, Dot encourages community development of new visualization tools by providing a template that can be used for new visualization tools in other areas of bioinformatics. This would allow bioinformaticians to focus on the bioinformatics and visualization without needing to master web programming intricacies such as reading data from local and remote servers, which is all handled by Dot’s modular and reusable inner workings.

Importance of Dot Plots

Constructing a genome assembly is fundamental to studying the biology of a species. In recent years, advances in long-read sequencing and scaffolding technologies have led to unprecedented quality and quantity of genome assemblies. Better reference genomes contribute to better gene annotations, evolutionary understanding, and biotech opportunities.

Comparing new assemblies to existing genomes of related species is crucial to understanding differences between organisms across the tree of life. Genome assemblies are never perfect and always have to be evaluated critically by comparing against other assemblies or reference genomes, whether of the same or a closely related species. Comparative genomics is also how assemblies of two species’ genomes can be compared and contrasted to look for features that represent functional differences or inform the study of their evolution.

The classic method for visualizing genome-genome alignments is the dot plot, which provides an excellent overview of alignments from the perspective of both genomes. Dot plots place the reference genome on one axis and the query genome (that is aligned against the reference) on the other axis. Alignments between the two genomes are placed according to their coordinates on both genomes. Whereas genome browsers (such as IGV and the UCSC Genome Browser) plot data in one dimension on one genome, dot plots use two dimensions to show alignments in two genomes’ coordinates spaces simultaneously. This is necessary when representing large genome alignment data where the query coordinates matter just as much as the reference coordinates for a particular alignment.

However, dot plots have barely changed in the past decade and are still generated from the command-line as static images, limiting detailed investigation. We decided to tackle this problem as an open-source science project at DNAnexus.

Introducing Dot

Here we present Dot, an interactive dot plot viewer that allows genome scientists to visualize genome-genome alignments in order to evaluate new assemblies and perform exploratory comparative genomics.

Dot supports the output of MUMmer’s nucmer aligner the most commonly used software method for aligning genome assemblies. A quick script called DotPrep.py converts the delta file to a more streamlined coordinates file with an index that enables Dot to read in more alignments in certain regions on demand.

Interactivity and features

Dot adds a number of useful features on top of the classic dot plot concept. The index enables a quick plot of an overview that includes the longest 1000 alignments. From here, users can zoom in to look at particular regions and load all the alignments for regions of interest.

In addition to showing alignments, Dot allows scientists to load annotations for either or both genomes to show additional context (e.g. understanding how sequence differences map to gene differences). Annotation tracks are a common feature of one-dimensional genome browsers, but to translate this concept to the two-dimensional dot plot, we enable annotation tracks on both axes. This is a major benefit of Dot that makes it possible to compare gene annotations visually alongside the alignments of the DNA sequences.

Moreover, users can jump to the same region of the reference genome in the UCSC Genome Browser to quickly see additional context for a region of interest. This allows scientists to explore how known repetitive elements in the reference genome could potentially affect assembly quality in specific regions.

Details for developers

By leveraging D3 and canvas in JavaScript, Dot combines the benefits of interactivity with scalability, enabling scientists to explore large genomes. The UI on the right side panel is built using an open-source SuperUI.js [https://github.com/marianattestad/superui] plugin, and the input handling and basic page navigation is set up through a special VisToolTemplate [https://github.com/MariaNattestad/VisToolTemplate] plugin we developed to enable others to create new visualization tools more easily. We encourage developers to utilize and build on Dot and these open-source projects to create their own visualization tools. Dot is very modular and can be used as a template to build new visualization tools. The template handles complex and necessary components like reading input data files from various sources, thereby letting developers focus only on the visualization itself.

Dot is open source

Dot is free to use online at [https://dnanexus.github.io/dot/] and open source at [https://github.com/dnanexus/dot]. For DNAnexus users, there is a package available among the featured projects with (1) an applet for running MUMmer’s nucmer aligner that includes DotPrep.py, (2) a shortcut to Dot to send files from DNAnexus quickly, and (3) example data and results.

Categories

Subsidiary Sidebar

About DNAnexus

DNAnexus provides a global network for sharing and management of genomic data and tools to accelerate genomic medicine. The DNAnexus cloud-based platform is optimized to address the challenges of security, scalability, and collaboration, for organizations that are pursuing genomic-based approaches to health, in the clinic and in the research lab.

The DNAnexus team is made up of experts in computational biology and cloud computing who work with organizations to tackle some of the most exciting opportunities in human health, making it easier—and in many cases feasible—to work with genomic data. With DNAnexus, organizations can stay a step ahead in leveraging genomics to achieve their goals. The future of human health is in genomics. DNAnexus brings it all together.