Category: Announcement

Primary Menu

Breadcrumbs

On January 3rd, a new class of security flaw was reported that impacts most processors including those that are used by Cloud Service Providers (CSPs), such as Amazon AWS and Microsoft Azure. The issue exploits the speculative execution optimizations in processors as a side-channel attack that leak kernel memory (Meltdown, CVE-2017-5754) or user memory (Spectre, CVE-2017-5715, CVE-2017-5753).

At this point, we have no evidence that this flaw has been exploited at DNAnexus.

Patching Process and Status

We are actively working to address this flaw while minimizing any interruption in the DNAnexus service. We are working with our CSPs and vendors to receive, test, and deploy patches efficiently and reliably. Once available, patches are rapidly deployed in our staging environment where automated functional and scalability tests are performed. When the patch is verified, it is deployed into our production environment without any expected downtime for the DNAnexus service.

On January 3rd, the CSPs have patched their hypervisors to prevent this class of flaw from leaking information between their cloud virtual instances. This required a reboot of all DNAnexus servers which was completed that same day.

We have been working with Canonical, the organization that supports the Ubuntu operating system used at DNAnexus. Canonical has released a Meltdown patch that we are in the process of testing. We will be deploying the patch in two phases. To ensure Meltdown cannot be exploited by a malicious DNAnexus user app, we will patch the worker fleet across all regions and clouds followed by deployment of the patch across all supporting systems. Once the patch has been verified, it will be deployed into the worker fleet within 1 hour. All new worker instances will take the patch. All currently executing jobs will be allowed to complete to minimize disruption. Then, we will initiate the patching process of our supporting systems, which is expected to take 1 week.

To address Spectre, given the nature of the flaw, we expect to receive multiple patches in the future. We will work closely with our vendors to ensure the patches are deployed quickly while maintaining our high quality of service.

Profiling the Impact on Compute Performance for Standard Genomics Tools

The patches developed to mitigate this security flaw may cause certain applications to run slower. This will impact all patched work, whether conducted in DNAnexus, on local machines, or in other cloud environments.

Typical guidance from non-genomics areas is a slowdown from 5-30%, depending on domain. The degree of impact depends on the type of computational operations and the only way to reliably determine this is empirically. We have performed the exact same analyses on Meltdown patched machines with several popular genomic tools to assess the impact.

Our initial analysis indicates that most genomic analyses require around 5% more compute with the Meltdown patch, with a range of 5%-10%. We expect this to generalize to the most common types of genomic analysis. Fortunately, this suggests genomic workflows are less impacted than some other reported areas.

At DNAnexus, we are excited about supporting open, portable, and reproducible ways to share not only these new best practices workflows, but also general bioinformatics workflows written in WDL. As such, to execute explicit GATK4 workflow definitions written in WDL and maintained by the Broad, we use a new utility that we developed called dxWDL. With this tool, a GATK WDL workflow can be used just like any other workflow on the platform with all of the additional benefits our platform provides (e.g. provenance tracking, reproducibility, organization management, project collaboration, and security). As we did with GATK3, we are in the process of optimizing the performance of GATK4 on our platform and future posts will go into more detail about how it performs in terms of efficiency and accuracy. In the interim, we are pleased to announce the launch of the DNAnexus GATK4 Pilot Program, to be offered to a limited number of interested users, with broader access to the tool in the coming months. To request early access to GATK4 on DNAnexus please sign up here.

After execution, the timeline of tasks can be easily visualized showing Broad’s more complex production germline variant calling workflow:

Using dxWDL for GATK4 marks a change in how we will be executing these and other workflows written to be portable across platforms. In contrast to our previous approach of maintaining our own GATK applications, we will be directly supporting open and portable languages, such as WDL and CWL. Portability through languages such as WDL not only enables research in our field to be better critiqued and improved upon, but it also significantly reduces friction when communicating method details to collaborators and regulatory agencies. Often times while the details of a specific method in a workflow are not changed, some subtleties of in the workflow definition do change leading to reproducibility challenges, such as the changes we needed to make in the production pipeline described above. With our adoption of open workflow languages like WDL, we will more easily share these workflow-level differences with the community and work with one another towards a single representation that runs portably across a variety of execution platforms.

DNAnexus is proud to be one of the first genome informatics platforms to support WDL. As a member of the core team to govern future developments in WDL, we look forward to continuing work with the Broad and the broader community so that the the best practices WDL workflows can be run as efficiently and portably as possible.

We’re excited by this new method and are making it available to our customers on the DNAnexus Platform. We’ve done an evaluation of DeepVariant to assess its performance relative to other variant calling solutions. In this post, we will present that evaluation as well as a brief discussion of deep learning and the mechanics of DeepVariant.

We are pleased to announce the launch of the DeepVariant Pilot Program, to be offered to a limited number of interested users, with broader access to the tool in the coming months. To request access to DeepVariant on DNAnexus please signup here.

What is Deep Learning?

Recent advancements in computing power and data scale have allowed multi-layer, complex architecture – or “deep” Neural Networks to demonstrate that their “learning plateau” is significantly higher than other statistical methods that previously supplanted them.

Generally, deep learning networks are fed relatively raw data. Early layers in the network learn “coarse” features on their own (for example, edge detection in vision). Later layers contain abstract/higher-level information. The ability of these deep networks to perform well is highly dependent on the architecture of the neural network – only certain configurations allow information to combine in ways that build meaning.

DeepVariant applies the Inception TensorFlow framework, which was originally developed to perform image classification. DeepVariant converts a BAM into images similar to genome browser snapshots and then classifies the positions as variant or non-variant. Conceptually, it uses the idea that if a person can leverage a genome browser to determine if a call is real, a sufficiently smart framework should be able to make the same determination.

The first part is to make examples that represent candidate sites. This involves finding all of the positions that have even a small chance of being variants with a very sensitive caller. In addition, DeepVariant performs a type of local reassembly which serves as a more thorough version of indel realignment. Finally, multi-dimensional pileup images are produced for the image classifier.

The second part is to call variants using the TensorFlow framework. This passes the images through the Inception architecture that has been trained to recognize the signatures of SNP and Indel variant positions.

Both components are computationally intensive. Because care was taken to plug into the TensorFlow framework for GPU acceleration, the call variants step can be accomplished much faster if a GPU machine is available. When Google’s specially designed TPU hardware becomes available, this step may become dramatically faster and cheaper.

The make examples component uses several more traditional approaches, which are more difficult to accelerate, and also computationally intensive. As efficiency gains from GPU or TPU improve call variants, the make examples step may limit the ultimate speed and cost. However, given the attractiveness of a fully deep learning approach, the genomics team at Google Brain would not have included these steps lightly. The genomics team at Google Brain includes some of the pioneers (Mark DePristo and Ryan Poplin) of Indel Realignment and Haplotype construction from the development of GATK.

The Inception framework is a “heavy-weight” deep learning architecture, meaning it is computationally expensive to train and to apply. It should not be assumed that all problems in genomics will require the application of Inception. Currently in the field of deep learning, building customized architecture to solve a problem is challenging and time consuming – so the application of a proven architecture makes sense. In the long term, custom-built architectures for genomics may become more prevalent.

To understand how DeepVariant performs on real samples, we compared it against several other methods in diverse WGS settings. To quickly summarize, its accuracy represents a significant improvement over current state of the art across a diverse set of tests.

Assessments on our standard benchmark sets

At DNAnexus, we have a standard benchmarking set on HG001, HG002, and HG005. These are built from the Genome in a BottleTruth Sets. We use this internally to assess methods and to make the best recommendations on tool selection and use for our customers. In each case, we assess on the confident regions for the respective genomes. The assessment is done via the same app as on PrecisionFDA, using hap.py from Illumina. In all cases, except where explicitly mentioned, the reads used represent 35X coverage WGS samples achieved through random downsampling.

The following charts show the number of SNP and Indel errors on several samples (lower numbers are better in these graphs). *Samtools not shown in Indel plots due to high indel error rate

DeepVariant dramatically outperforms other methods in SNPs on this sample, with almost a 10-fold error reduction. SNP F-measure is 0.9996. For indels, DeepVariant is also the clear winner.

When DeepVariant is applied to a different human genome – the Ashkenazim HG002 set from Genome in a Bottle – its performance is similarly strong.

Assessments on Diverse Benchmark Sets

Following our standard benchmarks, we sought to determine whether we could identify samples where DeepVariant would perform poorly. With machine-learning models, there is some concern that they may over-fit to their training conditions.

Early Garvan HiSeqX runs – In 2014, the Garvan Institute made the first public release of a HiSeqX Genome available through DNAnexus. As occurs with new sequencers, the first runs from HiSeqX machines were generally of lower quality compared to runs produced after years of improvements to experience, reagents, and process. In 2016, Garvan produced a PCR-free HiSeqX run as a high-quality data set for the PrecisionFDA Consistency Challenge.

To better assess the performance of DeepVariant on samples of varying polish, we applied it and other open-source methods to each of these genomes.

In the 2014 Garvan HiSeqX run, DeepVariant retains a significant advantage in SNP calling. However, it performs worse in indel calling. Note that all callers had difficulty calling indels in this sample, with more than 100,000+ errors in each caller.

Low-Coverage NovaSeq Samples

To further challenge DeepVariant, we applied the method to data from the new NovaSeq instrument. We used the NA12878-I30 run publicly available from BaseSpace. The NovaSeq instrument uses aggressive binning of base quality values and its 2-color chemistry is a departure from the HiSeq2500 and HiSeqX. We made it harder and downsampled from 35X coverage to 19X coverage.

Even in a sample as exotic as low-coverage NovaSeq, DeepVariant outperforms other methods. At this point, DeepVariant has demonstrated superior performance (often by significant margins) across different human genomes, different machines and run qualities, as well as different coverages.

Other Samples

In addition to the benchmarks presented here, we also ran on: 35X NovaSeq data, the high-quality 2016 HiSeqX Garvan Sample, and our HG005 benchmark. In the interest of space, we will skip these charts here. Qualitatively, they are similar to the other graphs shown.

How Computationally Intensive is DeepVariant?

As previously discussed, DeepVariant’s superior accuracy comes at the price of computational intensity. When available, GPU (and someday TPU) machines may ease this burden, but it remains high.

The following charts capture the number of CPU hours to complete the HG001 sample running the pipeline without GPUs (lower numbers are better):

Fortunately, the DNAnexus Platform enables extensive parallelism to cloud resources at a much lower cost. Through the use of many machines, 830 core-hours can be completed in a few hours of wall-clock time. DeepVariant Pilot Program is currently offered to a limited number of interested users, with broader access to the tool in the coming months. To request access to DeepVariant on DNAnexus, please signup here.

In Conclusion

Experts have been refining approaches for the problem of SNP and Indel calling in NGS data for a decade. Through thoughtful application of a general deep learning framework, the authors of DeepVariant have managed to exceed the accuracy of traditional methods in only a few years time.

The true power of DeepVariant lies not in its ability to accurately call variants – the field is mature with solutions to do so. The true power is as a demonstration that with similar thoughtfulness, and some luck, we could rapidly achieve decades of similar progress in fields where the bioinformatics community is just beginning to focus effort.

We look forward to working with the field in this process, and hope to get the chance to collaborate with many of you along the way.

Categories

Subsidiary Sidebar

About DNAnexus

DNAnexus provides a global network for sharing and management of genomic data and tools to accelerate genomic medicine. The DNAnexus cloud-based platform is optimized to address the challenges of security, scalability, and collaboration, for organizations that are pursuing genomic-based approaches to health, in the clinic and in the research lab.

The DNAnexus team is made up of experts in computational biology and cloud computing who work with organizations to tackle some of the most exciting opportunities in human health, making it easier—and in many cases feasible—to work with genomic data. With DNAnexus, organizations can stay a step ahead in leveraging genomics to achieve their goals. The future of human health is in genomics. DNAnexus brings it all together.