About this blog.

This blog is a forum for disseminating the science, culture, and life of the Broad Institute. Reflecting the unique, collaborative community that is the Broad, you have the opportunity to hear from – and respond to – a variety of contributors. Click here to read our Community Guidelines.

Archives

Categories

Behind the scenes of The Cancer Genome Atlas: part 2

Yesterday on the blog, we introduced you to some of the Broad researchers who built tools, teams, and resources to generate and analyze a massive flood of data and analytical code for The Cancer Genome Atlas (TCGA). Today we give you a look at the system they built to manage data analysis for the project: Firehose.

In addition to analyzing data generated here, the Broad serves as a Genome Data Analysis Center, led by Gaddy Getz, director of Cancer Genome Computational Analysis, and Broad senior associate member Lynda Chin, which coordinates data generated by other research centers in the TCGA. Even during the TCGA’s first pilot project on glioblastoma, the Broad team recognized the need for a robust analysis management system to handle the flood of data and algorithms. The phrase “drinking from the firehose” aptly describes the scale of the challenge, so the researchers named their solution “Firehose.”

“There are 20 centers in the TCGA, each with its own role and perspective,” says software engineering manager Mike Noble of the Cancer Program, who oversees Firehose, built during the ovarian cancer project by a team led by primary software developer Douglas Voet. “This leads to a pretty significant challenge for coordination and data standards. So we wrestle with this on a daily basis.”

Firehose addresses several issues, one of which Mike calls “the Babel problem.” On a massive, multi-institute effort like the TCGA, scientists often have trouble coordinating. “Everyone’s not really speaking the same language with respect to data,” he says. Firehose “versions” the analytical code and the data, keeping snapshots of each as they evolve so that researchers can efficiently collaborate and reproduce experiments. “We need to be able to say on Thursday, we ran version X of the code on version Y of the data, and this is the result. Because that’s the hallmark of all science: reproducibility.”

Firehose incorporates another Broad-built software package, GenePattern, which shuttles data through analytical modules. Firehose tells GenePattern which analytical codes to run on which set of data, serving as a bookkeeping system for the data. As Mike explains, “Firehose drives GenePattern.” But it’s more complex than that. “Really, the Firehose pipeline is a metapipeline of pipelines.” Some of the modules within the system, such as GISTIC, which looks for driver mutations in cancer, are themselves pipelines that have taken years to develop, and sometimes continue to be developed. “You can imagine this is a really complicated problem to solve,” Mike says. “The codes themselves are evolving underneath your feet as you’re trying to run it.” Software engineers on Mike’s team work to keep the pipeline stable while computational biologists tinker with analysis codes.

Firehose has been operational for less than a year, and went live near the end of the ovarian project. Before then, TCGA scientists analyzed their data in a more ad hoc manner, storing it in local files and relying upon their note-taking skills to keep track of data and code versions. Scientific results that took two to three years to discover and iteratively refine for the ovarian work can now be replicated within two to three days through Firehose, greatly accelerating the pace of research for the TCGA’s full phase targeting more than a dozen cancer types. “Firehose has really evolved its role in the TCGA,” Mike says. “It’s become something people are starting to really rely on, because it does this stuff reasonably well.”

For Mike and others on the TCGA team, the monumental effort serves a noble goal: to better understand cancer and pave the way for new treatments. “Hopefully the point of all this is you take raw data and turn it into discovery,” he says.