The case for cloud computing in genome informatics

PubMed Commons is an experimental system of commenting on PubMed abstracts, introduced in October 2013. Comments are displayed on the abstract page, but during the initial closed pilot, only registered users can read or post comments. Any researcher who is listed as an author of an article indexed by PubMed is entitled to participate in the pilot. If you would like to participate and need an invitation, please email info@biomedcentral.com, giving the PubMed ID of an article on which you are an author. For more information, see the PubMed Commons FAQ.

community and cloud computing

Dawn Field
(2011-01-20 15:05) Centre for Ecology and Hydrology, UK

Great paper and love the idea of the genomic informatics world as an ecosystem. Yes,
it will certainly change and perhaps the biggest benefit will be more of a chance
to share expertise/tools/data. The cloud should help naturally support collaboration
and community-led projects. For example, great to see JCVI Cloud Bio-Linux cited
- we provide NEBC Bio-Linux* upon which the cloud image is built: http://nebc.nox.ac.uk/biolinux.html
Very happy to see our project/packages being put into the cloud by a third party.
This is the vision of the new ecosystem.

Response to Andreas Sundquist

I wish to thank Andreas Sundquist and colleagues for identifying a careless and significant
error in my calculations of network transfer time for a 100 gigabyte sequencing file.
I have repeated the calculation and confirm Sundquist's estimates.

Network transfer miscalculation

Andreas Sundquist
(2010-08-05 17:06) DNAnexus

Genome Biology �� Letter to Editor

Dear Editor,

Lincoln Stein's excellent article on Cloud Computing in the latest issue of Genome
Biology is a timely and insightful analysis of the promise of cloud computing for
bioinformatics. As founders of a cloud-based DNA sequence analysis service, DNAnexus.com,
we wholeheartedly agree that cloud bioinformatics is here to stay, as it translates
cost-effectiveness and scalability into real-world time and resource savings for anyone
dealing with large genomics datasets.

A key parameter for the viability of the cloud model for bioinformatics, for academic
and commercial efforts alike, is whether standard networks are fast enough to support
the upload of the large data files produced by sequencing machines. Dr. Stein's calculations
suggest that network speeds are the major obstacle to widespread adoption. He states:

"For genomics, the biggest obstacle to moving to the cloud may well be network bandwidth.
A typical research institution will have network bandwidth of about a gigabit/second
(roughly 125 megabytes/second). On a good day this will support sustained transfer
rates of 5 to 10 megabytes/second across the internet. Transferring a 100 gigabyte
next-generation sequencing data file across such a link will take about a week in
the best case."

We were struck by these numbers. For us, uploading a 1 gigabyte file, which corresponds
to a typical single-lane fastq file from an Illumina GAIIx machine, takes little more
than a minute. That 100 gigabytes should take a week seemed inconsistent with our
first-hand experience. Indeed, upon further examination, we noticed that the calculations
are off by a factor of about 60. To keep with the quoted example, transferring a
100 gigabyte next-generation sequencing data file across a network that supports real
speeds of 10 megabytes per second will take 10,000 seconds, or about 3 hours, which
is about 1/60 of a week.

In our experience, network speed is the first issue potential users mention, usually
in a skeptical way, when they learn about DNAnexus. We hope that our clarification,
in the context of Dr. Stein's article, will help to dissipate this skepticism by making
users realize that bandwidth is not an issue in most settings, where a modest number
of sequencing machines feed data over a standard network.