DNAnexus to Host Short Read Archive (SRA) Database in Google Cloud

By Kevin DaviesOctober 12, 2011 | MONTREAL : DNAnexus, a San Francisco-based software company offering web-based tools to manage next-generation genome sequence (NGS) data, will offer free hosting and access to the Short Read Archive (SRA), the funding-challenged trove of NGS read data hosted by the National Center for Biotechnology Information (NCBI).

The news was announced at the International Congress of Human Genetics, taking place this week in Montreal.

Earlier this year, NCBI announced that it would be phasing out funding for the SRA. “When we read about NCBI phasing out support, we realized this would be a huge loss for the community, but a great opportunity for us to step up and preserve it,” said DNAnexus co-founder and CEO, Andreas Sundquist.

DNAnexus will host a copy of the SRA repository on Google’s Cloud Storage infrastructure. This new community resource was made publically available today at sra.dnanexus.com

The relationship grew in part from investment interest in DNAnexus from Google Ventures, which separately announced a funding deal (see below). Sundquist says the relationship shows Google’s commitment to help “democratize DNA data.”

Sundquist stresses that access to the SRA data will be free: there is no registration or fee structure. “We just want to make sure people can access this resource,” he says. “We think ultimately, there’s value in helping promote data exchange, preserving these data sets, and growing this space faster.”

“We’re doing this to help NCBI with their mission. We’ve been in close communication and they’re very supportive,” says Sundquist. He projects the SRA repository to grown tenfold each year, “and that wasn’t in the mission of NCBI… The worst-case scenario is the datasets disappear, can’t be downloaded, The 1000 Genomes dataset [wouldn’t] be there.”

“The SRA has been an invaluable resource to the research community,” commented Rick Myers, president and director of the HudsonAlpha Institute for Biotechnology in Huntsville, Alabama. “However, the ever increasing size of datasets being submitted and the need to easily integrate them into downstream analyses has tested the limits of its utility. I am very pleased to see private entities such as DNAnexus step in to keep this resource freely accessible and provide a more intuitive and user-friendly portal for searching and retrieving these important genomic datasets.”

“No-one thinks the Government should be providing access in perpetuity,” says Sundquist. “When the SRA was originally built, it was a different era, a different volume of data.” Given the explosion in NGS data, Sundquist expects to see the archive swell to hundreds, possibly thousands of times its present size in the years ahead. “[SRA] will be a tiny bit of data compared to five years from now. Think what it will be like when we’re sequencing millions or tens of millions of genomes!”

All the data in the public SRA will be hosted in Google Cloud Storage. “The DNAnexus SRA website is an example of a ‘big data’ initiative that benefits from rethinking the interface in a 100% web-enabled world,” says Eric Morse, head of business development, Google Cloud Storage. “Combining Google’s massively scalable data storage infrastructure with DNAnexus’ expertise in web-based interfaces, genomics data analysis, and visualization, researchers can quickly access the world’s genomic information from any web browser.”

Sundquist says DNAnexus has also cleaned up the SRA interface as “it’s been a little cumbersome to use.” Researchers can submit data direct to DNAnexus to host in the SRA. “There is no sign up required for anyone who wants to use SRA, but if you want to do analysis, we’ll provide unlimited access” to DNAnexus tools for a limited time.

Fresh Funds

DNAnexus also announced $15 million in new, led by Google Ventures – “the best ‘big data’ investor out there,” says Sundquist – and TPG Biotech. Since the company’s first round of just $1.5 million, it has grown to 25 employees and Sundquist hopes to double the headcount in the next 9-12 months.

The Google tie-in is interesting, as most of the DNAnexus infrastructure is built on Amazon’s EC2 Cloud. “Now we’re working with both Amazon and Google on providing access to large genomic datasets,” notes Sundquist.

Sundquist also announced a significant cut in pricing for academic customers. “In some ways, the academic community is the key to driving this space forward. Because of the great response, we’ve slashed our prices substantially by half for academia, effective immediately.”

Sundquist says DNAnexus is “absolutely focused” on genome interpretation, recognizing a huge opportunity for growth. “For us, it’s not just about the ‘$1 million interpretation’ for one genome, you have to think about this interpretation and scale it up to thousands of genomes. That’s a whole different domain, a huge space that no-one has built anything around.”