Topics

Featured in Development

Understandability is the concept that a system should be presented so that an engineer can easily comprehend it. The more understandable a system is, the easier it will be for engineers to change it in a predictable and safe manner. A system is understandable if it meets the following criteria: complete, concise, clear, and organized.

Featured in Architecture & Design

Sonali Sharma and Shriya Arora describe how Netflix solved a complex join of two high-volume event streams using Flink. They also talk about managing out of order events and processing late arriving data, exploring keyed state for maintaining large state, fault tolerance of a stateful application, strategies for failure recovery, data validation batch vs streaming, and more.

Featured in Culture & Methods

Tim Cochran presents research gathered from ThoughtWorks' varied clients and projects, and shows some of the metrics their teams have identified as guides to creating the platform and the culture for high performing teams.

Last April Google Research followed up on a topic covered at the Google Cloud Platform (GCP) Next conference in San Francisco, CA. The Broad Institute of MIT and Harvard announced they’d fully migrated their pipeline to the GCP. Dr. Stacey Gabriel, Director of the Genomics Platform at the Broad Institute detailed the scale of their genomics pipeline, adding to previous coverage by Kris Cibulskis of the Broad Institute during the conference.

Broad manages one of the largest genome sequencing centers in the world and historically thought of itself as a hub for data generation, but now plans on expanding into offering gene sequencing and data as a service. To give the bioinformatics, data science and software engineering communities a sense of their data volume and growth rate Broad noted that their

DNA sequencers produce more than 20 Terabytes (TB) of genomic data per day, and they run 365 days a year…. the output increased more than two-fold last year, and nearly two-fold the previous year.

Broad uses Nearline as a cost-effective medium for storing DNA sequence segments used infrequently, saving a reported $1.5M, or 50% over their pre-GCP storage and access architecture. They also noted that their Whole Genome Sequencing Pipeline is completely ported to GCP and features like preemptible vm’s cut cost associated with idle CPU time. As part of plans to fully migrate to cloud services Broad noted they’re

migrating each of our own pipelines to the cloud to meet our own needs… and plan to make them available to the greater genomics community through a Software-as-a-Service model.

The scalability and data access infrastructure Broad built on the GCP is open sourced as the FireCloud platform. Cost savings and runtime optimizations are based on using the Genome Analysis Toolkit (GATK) and the relative cost of various steps in the pipeline. Broad noted that they parallelized computationally intensive steps like aligning DNA sequences against a reference genome to reduce overall wallclock runtime.