DNAnexus to Mirror SRA Database in Google Cloud

Mirror site for NCBI sequence repository facing funding challenges.

By Kevin Davies

November 15, 2011 | DNAnexus will offer free hosting and access to the Short Read Archive (SRA), the funding-challenged trove of NGS read data hosted by the National Center for Biotechnology Information (NCBI).

Earlier this year, NCBI announced that it would be phasing out funding for the SRA. “We realized this would be a huge loss for the community, but a great opportunity for us to step up and preserve it,” said DNAnexus co-founder and CEO, Andreas Sundquist.

DNAnexus will host a publicly available copy of the SRA repository on Google’s Cloud Storage infrastructure (see sra.dnanexus.com).

According to NCBI’s Jim Ostell, NCBI remains the primary archive for SRA, but he welcomed the offer from a reliable commercial source to provide an alternative hosting environment. “We agreed with them that there was certainly a need for nice packaged tool sets for people working with high-throughput sequence,” said Ostell. “If anyone finds it useful, either to explore and analyze the public data, or to work on pre-release data of their own, then that’s good.”

However, Ostell stressed that DNAnexus is not taking over SRA. “They are not an archive, they don’t issue accession numbers, and are not part of any official NIH data publishing process… It’s been a strictly technical issue of transferring data, working with Google, and getting their platform in place.”

While central NIH funding for the SRA is ending this month, NCBI will still accept certain classes of SRA data that don’t necessarily generate massive amounts of data, but are important for the scientific record. Some individual NIH institutes have agreed to fund NCBI directly to keep SRA going for their studies, says Ostell.

“We’re doing this to help NCBI with their mission,” explained Sundquist. “Our hope is that the hosted version of the SRA will provide a complementary way for researchers to access these data.” He projects that the SRA repository will grow tenfold each year. “The SRA has done a tremendous service to the research community by capturing these data and we want to help preserve it.”

Sundquist says DNAnexus has cleaned up the SRA interface as “it’s been a little cumbersome to use.” Eventually, researchers might be able to submit data direct to DNAnexus to host in the SRA. “There is no sign up required for anyone who wants to use SRA.”

“No-one thinks the government should be providing access in perpetuity,” says Sundquist. “When the SRA was originally built, it was a different era, a different volume of data.” Sundquist expects the archive to swell to hundreds, possibly thousands of times its present size in the years ahead. “[SRA] will be a tiny bit of data compared to five years from now. Think what it will be like when we’re sequencing millions or tens of millions of genomes!”

Google Backing

DNAnexus also announced it had received funding from Google Ventures (see below) in a $15-million round, a relationship that Sundquist says shows Google’s commitment to help “democratize DNA data.” A copy of all the data in the public SRA will be hosted in Google Cloud Storage. “The DNAnexus SRA website is an example of a ‘big data’ initiative that benefits from rethinking the interface in a 100% web-enabled world,” said Eric Morse, head of business development, Google Cloud Storage.

The Google tie-in is interesting, as most of the DNAnexus infrastructure is built on Amazon’s EC2 Cloud. “Now we’re working with both Amazon and Google on providing access to large genomic datasets,” notes Sundquist.

Sundquist also announced a significant cut in pricing for academic customers. “In some ways, the academic community is the key to driving this space forward. Because of the great response, we’ve slashed our prices substantially by half for academia, effective immediately.”

Sundquist says DNAnexus is “absolutely focused” on genome interpretation, recognizing a huge opportunity for growth. “For us, it’s not just about the ‘$1 million interpretation’ for one genome, you have to think about this interpretation and scale it up to thousands of genomes. That’s a whole different domain, a huge space that no-one has built anything around.” •

This article also appeared in the November-December 2011 issue of Bio-IT World magazine. Subscribe today!