Crunch Time for the Arvados Project

By Aaron Krol

December 18, 2013 | Adam Berrey has an expansive vision for big data’s role in everyday medical care. Genetic sequencing and computer imaging techniques are edging closer and closer to routine clinical use, and he wants clinical centers to be prepared for the deluge of data coming their way. In the future, said Berrey in an interview with Bio-IT World, “if you run a hospital, if you run a major medical center, you will be running two core platforms. You’ll be running an electronic medical record system, which will be used to keep track of your patient trait data, reports, doctor notes, basic patient management and workflow. And then you’ll be running a biomedical big data platform. And that big data platform you’ll use to manage and store and analyze all of your sequencing data… You’ll also use it for imaging data that you want to analyze, and for sensor data.”

“Our vision,” he continues, “is that Arvados can be that infrastructure. It can become… the computational storage fabric for precision medicine.” Berrey is the CEO of Curoverse, a biomedical informatics company that is both an offshoot of, and the major contributor to, the Arvados data storage platform. Until today, Curoverse was known as Clinical Future – to the few who knew of the small, Boston-based company whose seven employees have been laboring on an ambitious project out of all proportion to their numbers. Today, however, Curoverse announced a new name and a first round of funding that has raised $1.5 million to expand the company and launch its first commercial products in 2014.

Although Curoverse is only a fledgling company, it has illustrious origins that helped attract this venture capital. Many of Curoverse’s founding members come from the Harvard Medical School lab of George Church, the renowned geneticist whose work has been foundational to both the next-generation sequencing techniques that dominate the genetics industry today, and the third-generation nanopore-based sequencing techniques that may succeed them. A centerpiece of Church’s lab is the Personal Genome Project (PGP), the ongoing effort to sequence the whole genomes of 100,000 volunteers and make their genetic codes publicly available. It was this effort that gave rise to the Arvados platform.

“The scale of that project is massive,” says Berrey, “and in order to make it a success, George realized that… you needed to build a software platform that could store those data, provide an API that informaticians and computational biologists could use to analyze those data, and make it possible to effectively share those data around the world.” Alexander Wait Zaranek, now scientific director of Curoverse, led the engineering of that platform at Harvard. The particular constraints of the PGP demanded that Arvados be not only a repository for massive amounts of data – 300 terabytes so far in the version used at the Church lab – but also a site for cross-institutional data sharing and a toolkit of analytical algorithms. Like many commercial platforms for storing and sharing genomic data (DNAnexus and Illumina’s BaseSpace come to mind), Arvados is cloud-based, allowing collaborators to access one another’s workflows remotely and take advantage of huge storage capacity and computational power even when working out of smaller facilities. However, Berrey sees two major distinctions that make Arvados unique among cloud platforms.

The first is its commitment to open access. Since the PGP’s ambitions include sharing its data with any researcher who wants to contribute to genomic knowledge, a proprietary platform would have been an unacceptable barrier between collaborators. Instead, Arvados is open source and freely available for download, and users are not required to host their data offsite. “You are never in a position where you are locked to a single vendor,” says Berrey. “The other cloud platform providers require you to move all of your data into their cloud. But with Arvados, you can run a private cloud in your own data center. So you can have private clouds on premise, and you can use public cloud resources, and you can do hybrid configurations. And that provides a great deal of flexibility.”

While individual open source tools are widely used by bioinformaticians, Curoverse will be one of the first companies to offer a core storage and analytics platform for genetic data on an open source model. Arvados is compatible with popular online tools like GATK and BWA, and the company is actively encouraging other organizations to dig into the source code and develop new workflows. It’s a business model that Berrey sees as not only commercially viable, but also a spur to innovation in the industry as a whole. “It’s been so successful for infrastructure in the IT market,” he says, “and if you look at other industry segments, everybody is building on open source projects like Linux and OpenStack and Xen and Hadoop… It’s a model that has the potential to have a really big impact here, and the time is right for it to happen.”

Open Source, Open Science

A second standout feature of Arvados is its mechanism for delivering reproducible results. Arvados is divided into Keep, the storage layer where data sits in the cloud, and Crunch, the computational layer where analytics is performed. Crunch isn’t just a storehouse for analytical tools. It’s designed to keep fastidious track of every computational job performed in Arvados. Crunch individually labels every data file with a cryptographic hash to verify its origin. It also stores metadata with its files, retracing the complete history of workflows performed, how long they took, and what resources it would require to duplicate them. Says Berrey, “You can take any file in the Arvados system, and you can ask it, where did you come from?”

“Ideally,” he adds, “because Arvados is open source… when [users] do research, they can publish that research, and they can say, essentially, here’s a link that will take you to the Crunch scripts that were run, all the algorithms that were run, and the cryptographic hashes for all the data that were used.” Other researchers who want to verify the results, or modify the workflows, can easily look up the original data files and see exactly how they were manipulated, as long as the data files are publicly stored. This makes open science with Arvados remarkably simple: users don’t have to make any special effort to release their data or document the fine points of their computational methodology.

Currently a beta version of Arvados is available for download, for any organizations that want to run their own clouds with the platform. Meanwhile, at the same time that Curoverse announced its first round of funding, the company also unveiled a new website, www.curoverse.com, where interested parties can sign up for the private beta launch. This service will be Curoverse’s first commercial product, a hosted version of Arvados for which Curoverse will provide both storage and computational capacity. Berrey anticipates that further products will follow in 2014, especially as the company’s new financing – provided by Hatteras Venture Partners, Point Judith Ventures, Common Angels, MassVentures, and Boston Global Ventures – will allow the hiring of new engineers.

For now, Arvados will mainly serve to foster research projects, like the PGP, that require the handling of extraordinary amounts of biomedical data. Says Berrey, “We see the first market as being the clinical research market… But we see a second market that’s coming on very quickly, which is diagnostic testing. And some of the same features in Arvados, like the reproducibility capabilities, are very valuable in diagnostic testing.” These extended applications, using patient data in real time to guide medical treatment, may be the ultimate future of Arvados, which, with its open access model and ability to quickly integrate third-party tools, could find it easier to build out to clinical use than similar proprietary platforms.

It’s a high ambition, but after whetting their teeth on the PGP, the founders of Curoverse are accustomed to ambitious aims. The coming year, as the first customers approach the platform, will be a key test of whether Arvados can encompass that vision.