Illumina – HPCwirehttps://www.hpcwire.com
Since 1987 - Covering the Fastest Computers in the World and the People Who Run ThemFri, 09 Dec 2016 21:51:05 +0000en-UShourly1https://wordpress.org/?v=4.760365857The Revolution in the Lab is Overwhelming IThttps://www.hpcwire.com/2015/10/05/the-revolution-in-the-lab-is-overwhelming-it/?utm_source=rss&utm_medium=rss&utm_campaign=the-revolution-in-the-lab-is-overwhelming-it
https://www.hpcwire.com/2015/10/05/the-revolution-in-the-lab-is-overwhelming-it/#respondMon, 05 Oct 2015 13:11:49 +0000http://www.hpcwire.com/?p=21520Sifting through the vast treasure trove of data spilling from modern life science instruments is perhaps the defining challenge for biomedical research today. NIH, for example, generates about 1.5PB of data a month, and that excludes NIH-funded external research. Not only have DNA sequencers become extraordinarily powerful, but also they have proliferated in size and […]

]]>Sifting through the vast treasure trove of data spilling from modern life science instruments is perhaps the defining challenge for biomedical research today. NIH, for example, generates about 1.5PB of data a month, and that excludes NIH-funded external research. Not only have DNA sequencers become extraordinarily powerful, but also they have proliferated in size and type, from big workhorse instruments like the Illumina HiSeq X Ten, down to reliable bench-top models (MiSeq) suitable for small labs, and there are now in advanced development USB-stick sized devices that plug into a USB port.

“The flood of sequence data, human and non-human that may impact human health, is certainly growing and in need of being integrated, mined, and understood. Further, there are emerging technologies in imaging and high resolution structure studies that will be generating a huge amount of data that will need to be analyzed, integrated, and understood,”[i] said Jack Collins, Director of the Advanced Biomedical Computing Center at the Frederick National Laboratory for Cancer Research, NCI.

Here are just a few of the many feeder streams to the data deluge:

DNA Sequencers. An Illumina (NASDAQ: ILMN) top-of-the-line HiSeq Ten Series can generate a full human genome in just 18 hours (and generate 3TB) and deliver 18000 genomes in a year. File size for a single whole genome sample may exceed 75GB.

Live cell imaging. High throughput imaging in which robots screen hundreds of millions of compounds on live cells typically generate tens of terabytes weekly.

Confocal imaging. Scanning 100s of tissue section, with sometimes many scans per section, each with 20-40 layers and multiple fluorescent channels can produce on the order of 10TB weekly.

Structural Data. Advanced investigation into form and structure is driving huge and diverse datasets derived from many sources.

Broadly, the flood of data from various LS instruments stresses virtually every part of most research computational environment (cpu, network, storage, system and application software). Indeed, important research and clinical work can be delayed or not attempted because although generating the data is feasible, the time required to perform the data analysis can be impractical. Faced with these situations, research organizations are forced to retool the IT infrastructure.

“Bench science is changing month to month while IT infrastructure is refreshed every 2-7 years. Right now IT is not part of the conversation [with life scientists] and running to catch up,” noted Ari Berman, GM of Government Services, the BioTeam consulting firm and a member of Tabor EnterpriseHPC Conference Advisory Board.

The sheer volume of data is only one aspect of the problem. Diversity in files and data types further complicates efforts to build the “right” infrastructure. Berman noted in a recent presentation that life sciences generates massive text files, massive binary files, large directories (many millions of files), large files ~600Gb and very many small files ~30kb or less. Workflows likewise vary. Sequencing alignment and variant calling offer one set of challenges; pathway simulation presents another; creating 3D models – perhaps of the brain and using those to perform detailed neurosurgery with real-time analytic feedback

“Data piles up faster than it ever has before. In fact, a single new sequencer can typically generate terabytes of data a day. And as a result, an organization or lab with multiple sequencers is capable of producing petabytes of data in a year. The data from the sequencers must be analyzed and visualized using third-party tools. And then it must be managed over time,” said Berman.

An excellent, though admittedly high-end, example of the growing complexity of computational tools being contemplated and developed in life science research is presented by the European Union Human Brain Project[ii] (HBP). Among its lofty goals are creation of six information and communications technology (ICT) platforms intended to enable “large-scale collaboration and data sharing, reconstruction of the brain at different biological scales, federated analysis of clinical data to map diseases of the brain, and development of brain-inspired computing systems.”

HPC Infrastructure: hardware and software to support the other Platforms.

(Tellingly HBP organizers have recognized the limited computational expertise of many biomedical researchers and also plan to develop technical support and training programs for users of the platforms.)

There is broad agreement in the life sciences research community that there is no single best HPC infrastructure to handle the many LS use cases. The best approach is to build for the dominant use cases. Even here, said Berman, building HPC environments for LS is risky, “The challenge is to design systems today that can support unknown research requirements over many years.” And of course, this all must be accomplished in a cost-constrained environment.

“Some lab instruments know how to submit jobs to clusters. You need heterogeneous systems. Homogeneous clusters don’t work well in life sciences because of the varying uses cases. Newer clusters are kind of a mix and match of things we have fat nodes with tons of cpus and thin nodes with really fast cpus, [for example],” said Berman.

Just one genome, depending upon the type of sequencing and the coverage, can generate 100GB of data to manage. Capturing, analyzing, storing, and presenting the accumulating data requires a hybrid HPC infrastructure that blends traditional cluster computing with emerging tools such as iRODS (Integrated Rule-Oriented Data System) and Hadoop. Unsurprisingly the HPC infrastructure is always a work in progress

Here’s a snapshot of the two of the most common genomic analysis pipelines:

DNA Sequencing. DNA extracted from tissue samples is run through the high-throughput NGS instruments. These modern sequencers generate hundreds of millions of short DNA sequences for each sample, which must then be ‘assembled’ into proper order to determine the genome. Researchers use parallelized computational workflows to assemble the genome and perform quality control on the reassembly—fixing errors in the reassembly.

Variant Calling. DNA variations (SNPs, haplotypes, indels, etc) for an individual are detected, often using large patient populations to help resolve ambiguities in the individual’s sequence data. Data may be organized into a hybrid solution that uses a relational database to store canonical variations, high-performance file systems to hold data, and a Hadoop-based approach for specialized data-intensive analysis. Links to public and private databases help researchers identify the impact of variations including, for example, whether variants have known associations with clinically relevant conditions.

The point is that life science research – and soon healthcare delivery – has been transformed by productivity leaps in the lab that now are creating immense computational challenges. (next Part 2: Storage Strategies)

]]>https://www.hpcwire.com/2015/10/05/the-revolution-in-the-lab-is-overwhelming-it/feed/021520Illumina Establishes $1,000 Genomehttps://www.hpcwire.com/2014/01/15/illumina-establishes-1000-genome/?utm_source=rss&utm_medium=rss&utm_campaign=illumina-establishes-1000-genome
https://www.hpcwire.com/2014/01/15/illumina-establishes-1000-genome/#commentsWed, 15 Jan 2014 20:38:10 +0000http://www.hpcwire.com/?p=3190Genomics moves rapidly. Not long after the first human genome was sequenced in 2003 at a cost of $3 billion, the biotech industry set its sights on the $1,000 genome mark. Realizing this vision has been a primary goal of the sequencing community for nearly a decade, and now it seems that the $1,000 genome […]

]]>Genomics moves rapidly. Not long after the first human genome was sequenced in 2003 at a cost of $3 billion, the biotech industry set its sights on the $1,000 genome mark. Realizing this vision has been a primary goal of the sequencing community for nearly a decade, and now it seems that the $1,000 genome is finally within reach, thanks to a new high-end DNA supercomputer designed by Illumina for “factory scale” sequencing of human genomes.

Currently, it costs about $10,000 to sequence a human genome. San Diego-based Illumina says its HiSeq X Ten (pronounced “High Seek 10”) system can process 20,000 genomes per year at a cost of $1,000 each. The accomplishment rests on economies of scale and advanced features like faster chemistry and better optics that bring down costs.

Illumina CEO Jay Flatley announced the new sequencing machine, built specifically for human genomes, at the J.P. Morgan Healthcare Conference on Tuesday. The machine has been optimized to achieve high-throughput without sacrificing quality. The sequencer can identify DNA variants ten times faster than its predecessor, another Illumina model. And while there are faster machines out there, they don’t operate with the same quality standards, according to Flatley.

The CEO further explained that the system could partially sequence five human genomes in one day, while a complete run takes three days. In those three days, the machine can complete 16 high-quality human genomes.

Proponents of personalized medicine who were waiting for the $1,000 genome to make customized medical testing and diagnostics possible may have to wait a little longer. The HiSeq X Ten System is intended for population-scale projects to further researchers’ understanding of human health. Also, the systems themselves aren’t cheap. They will be sold in sets of at least 10 machines at a base price of $10 million.

“With the HiSeq X Ten, we’re delivering the $1,000 genome, reshaping the economics and scale of human genome sequencing, and redefining the possibilities for population-level studies in shaping the future of healthcare,” says the CEO. “The ability to explore the human genome on this scale will bring the study of cancer and complex diseases to a new level. Breaking the ‘sound barrier’ of human genetics not only pushes us through a psychological milestone, it enables projects of unprecedented scale. We are excited to see what lies on the other side.”

At least three customers have already placed orders for the new sequencers. These include Macrogen, a next-gen sequencing service organization, based in Seoul, South Korea, with a laboratory in Rockville, Maryland; the well-regarded Broad Institute in Cambridge, Massachusetts; and the Garvan Institute of Medical Research in Sydney, Australia, also a leader in the biomedical research field. The systems are expected to ship in March.

]]>https://www.hpcwire.com/2014/01/15/illumina-establishes-1000-genome/feed/13190Floating Genomics to the Cloud with AWShttps://www.hpcwire.com/2013/06/05/floating_genomics_to_the_cloud_with_aws/?utm_source=rss&utm_medium=rss&utm_campaign=floating_genomics_to_the_cloud_with_aws
https://www.hpcwire.com/2013/06/05/floating_genomics_to_the_cloud_with_aws/#respondWed, 05 Jun 2013 07:00:00 +0000http://www.hpcwire.com/?p=8537David Pellerin and Jafar Shameen, both of HPC Business Development at Amazon Web Services, gave a presentation at AWS Summit 2013 to discuss which industries and companies are using the cloud service to run HPC applications. The talk mostly centered on applications in genomics and the life sciences, as highlighted by a third speaker in Alex Dickinson, SVP of Cloud Genomics at Illumina.

]]>As more institutions implement cloud strategies to supplement their best HPC practices, it is important to consider the extent to which companies run HPC applications in the cloud and for which applications it is particularly useful.

David Pellerin and Jafar Shameen, both of HPC Business Development at Amazon Web Services, gave a presentation at AWS Summit 2013 to discuss which industries and companies are using the cloud service to run HPC applications. Not surprisingly, the talk mostly centered on applications in genomics and the life sciences, as highlighted by a third speaker in Alex Dickinson, SVP of Cloud Genomics at Illumina.

“What you end up doing is building a cluster for the worst, nastiest problem you have,” said Pellerin on the risks and costs of building in-house HPC clusters. “You get this big, expensive cluster that for most of the workload, it doesn’t need to be there.” No company should know this better than Amazon, as they started being a cloud services provider as a result of having an excess of computing resources that were only put to use at certain peak times.

Scientific disciplines such as genomics and high energy particle physics turn to cloud computing for certain HPC applications for a fairly basic reason: cloud computing is optimal for experimentation. For Pellerin, computing on AWS allows ‘the ability to fail fast.’ An in-house system is subject to job queue and scheduling limitations that generally prove both costly and time-consuming.

Again, ‘the ability to fail fast’ is an important one for a researcher looking to initially test several hypotheses he or she may have given their large dataset. This capability doesn’t exclusively help those in the sciences, as financial services are running risk analytics on AWS while engineering firms run CAD and CAE simulations for aerospace, according to Pellerin. However, those terms of ‘risk analytics’ and ‘CAD simulations’ imply a theoretical, experimental approach to computing, where the value of running multiple scenarios in a short amount of time is considerable.

The focus here, though, was on the life sciences and on genomics in particular. The advances over the last decade have turned genome sequencing from a problem of actually performing the procedure to storing the relevant data. As Dickinson explained, “When we ask our customers where do they spend their time…the actual time they spend sequencing is relatively small. What really kills them is the bioinformatics, which is comprised of a lot of computationally intensive processing and also now interpretation.”

Ten years ago, the Human Genome was completed after 13 years and a $4 billion investment. Today, that same process takes only a day and about a thousand dollars to complete.

As such, genomic sequencing has scaled faster than Moore’s Law over the last decade, as seen in the figure below. This presents an obvious storage issue, especially when policy requires for that information to be kept for several years.

Last week, we highlighted the work being done in BonFIRE to test angles of incidence to maximize the destruction of cancer rays while harming as few working cells as possible. Illumina isn’t working on this problem exactly but they are working on individual genomes to determine cancer causes. Dickinson argued that since everyone clearly has a different genome and that tumor growth is sparked by a malfunction in the cells processing genetic instructions, personalizing cancer treatment means running individual genomes.

“Our solution was to build something called BaseSpace,” Dickinson explained as he delved deeper into how Illumina works with AWS. “In the labs we connect the instruments to BaseSpace using standard internet connections. It turns out that even though they produce a lot of data, they do it at a relatively steady pace.”

Scientists like to keep the raw data of every genome that is sequenced, a commitment that requires approximately 120 GB of data. One might expect for a genome, which consists of about 3 billion bases, to require significantly more than 120 GB to unravel. However, since humans are quite genetically similar to each other, with variances among individuals only representing about 0.1 percent of the genetic signature, they are able to pare the dataset down to that 120 GB level. Once that’s done, according to Dickinson, Illumina can comfortably transfer that data to AWS through BaseSpace at a rate of about 7 Mbps.

Beyond storing genomes and running experimental tests on them, cloud and AWS in particular hope to be a facilitator of scientific collaboration. Today, the top method for sending massive datasets is by sending physical hard drives through the mail, according to Dickinson. The hope is that someday the cloud will become the first choice in delivering massive datasets such that exist in genome sequencing to other facilities, and Illumina is one of the life science companies pushing that paradigm.

Of course, there are more examples of institutions performing HPC applications in AWS, as explained by Shameen. Among such is Pfizer, who uses the Amazon Virtual Private Cloud to run pharmaceutical computational experiments in an extra secure environment, according to Shameen. Globus is a genomics company who, similar to Illumina, transfers their data to AWS, but this time over the Amazon implemented Galaxy platform. Further, Shameen pointed to the Harvard Medical School as an early adopter of AWS for excess and experimental HPC workloads.

As shown by Illumina, running experimental HPC applications in a cloud service like AWS is gaining more traction, especially in the life sciences and genomics department.