DNA: forensic tool need, cloud use & data deluge

Friday 2nd December 2011

Courtesy:geneticsandsociety.org

The National Policing Improvement Agency (NPIA) is to launch a new UK initiative to help police forces save up to £3m yearly which is spent analysing crime scene forensic samples. In Finland, CSC boosts its biomedical performance with cloud computing, as the China and the US reveal sequencing is deluged with so much data as to threaten retaining data records one the analytical findings are concluded.

The Agency is seeking private sector partners to help develop cutting-edge technology that will enable crime scene investigators to identify quickly whether forensic evidence contains human DNA. This will enable forces to decide whether to send the sample to an approved forensic laboratory to produce a DNA profile that can be searched against the National DNA Database The aim is to have the new technology operational in spring 2012.

Currently, forces spend millions of pounds sending crime scene samples to laboratories for screening only to learn that no human DNA is present.

More details about the NPIA project will be unveiled at a supplier event on the 15 December at the agency's CSI training centre (right) at Harperley Hall, in Durham. The NPIA is urging interested companies to attend the event.

Simon Bramble, Head of Police Science and Forensics at the NPIA, said: "This represents a great opportunity for private sector expertise to be involved in developing a major technical innovation that will help the police service dramatically save time and money in analysing crime scene evidence.

"One of the most important aspects of any crime scene investigation is to determine whether human DNA is present in forensic evidence collected at crime scenes so that it can then be searched against the National DNA Database if needed. This can provide a crucial lead in a crime investigation.

"The challenge for would-be suppliers will be to produce easy-to-use, portable technology that can produce results in less than an hour."

Companies interested in attending the meeting on 15 December or who would like more information about the project should contact NPIA project team (adapt@npia.pnn.police.uk) or 0203 113 7177.

BOOSTING PERFORMANCE WITH BIOMEDICAL CLOUDBiomedical sciences have become data and computationally intensive disciplines. In Finland CSC is expanding its capacity to deliver cluster computing through a cloud interface to a total of 2880 cores dedicated for cloud computing pilots in the biomedical sector.

The concept for the service is being able to encapsulate complex software environments built by individual laboratories and run the entire environment seamlessly in a remote compute cluster at CSC.
In practice, maintenance staff can add or remove remote virtual nodes in the local cluster that appear to the computer end users similar as the physical nodes.

The new nodes are Intel Xeon based scale-out servers with 10GE interconnect and up to 96GB of memory per node. The installation of the new capacity is part of the emerging Finnish European Life Science I nfrastructure for biological information (ELIXIR) node.

CSC’s biomedical cloud service pilot is facilitating next-generation biomedical data analysis needs with high-performance computing. “The expansion will help us to meet the resource demands of the existing pilots as well as enabling us to grow the user base. Furthermore the technical design has been further developed in order to satisfy the users’ requirements and to bring us one step closer to a production cloud computing service”, says Danny Sternkopf, Systems specialist at CSC.

The cloud pilot demonstrates a distributed infrastructure solution targeted for biomedical community. Sampsa Hautaniemi, Academy Research Fellow,(right) is one of the researchers at the University of Helsinki whose group is piloting the CSC cloud cluster.

“Modern biomedical research is data intensive, which requires large memory and storage resources, in addition to pure computing power. CSC cloud pilot is a significant leap to harness CSC’s resources to analyze massive amounts of medical data. The cloud pilot helps us to develop and analyze hundreds of next-generation sequencing samples, which would have been impossible without this pilot.”

Kristoffer Rapacki, (left) Head of System Administration of the Center for Biological Sequence Analysis at the Technical University of Denmark is a service provider who has participated in the service pilot.

“The emerging Danish ELIXIR node will provide infrastructure for integration and interoperability of computational tools in life sciences, assisting both the users and the providers of tools in making them easier to discover, use and combine. Specifically, many popular and/or computationally demanding tools are likely to need resources extending beyond the capacity of the original tool provider.

"Therefore, it is important that mechanisms for on-demand network access to shared pools of configurable computing resources are put in place. Our collaboration with CSC in Finland aims at providing a fast-to-implement and easy-to-manage solution to this problem and constitutes an important part of our tool integration effort.”

The emerging Finnish ELIXIR node is an integrated part of the national Biomedinfra consortium with bio-banking and translational research. Ministry of Education and Culture of Finland funds Biomedinfra http://www.biomedinfra.fi/ via Academy of Finland research infrastructure grant to participate in the building of the European research infrastructures for biomedical research.

GENOMIC DATA OUSTRIPS ANALYSIS
The one-time Beijing Genome Institute BGI China, the world’s largest genomics research institute, with 167 DNA sequencers producing the equivalent of 2,000 human genomes daily, reports the New York Times, to show DNA sequencing being deluged by data, so that genome, meta-genomics and the micro-biome analysis projects are turning to cloud computing and sophisticated analytical software for data analysis solutions as they abandon

BGI actually has so much data it often cannot transmit its results to clients or collaborators over the Internet or other communications lines because it would take weeks. Instead, it has to resort to sending computer disks containing the data, via FedEx “ an analog solution in a digital age,” agreed BGI's head of cloud computing Sifei He, formerly known as the Beijing Genomics Institute.

DNA sequencing is becoming faster and cheaper so the ability to determine DNA sequences is starting to outrunthe researcher ability to store, transmit and especially to analyse the data.

“Data handling is now the bottleneck,” David Haussler, (left) director of the center for biomolecular science and engineering at the University of California, Santa Cruz is quoted saying “It costs more to analyze a genome than to sequence [it].”

The cost of determining a person’s complete DNA blueprint is expected to fall below $1,000 in two years, but excludes the price of making sense of data, a bigger part of the total cost, as sequencing costs drop.

The data challenge does create new manufacturing opportunities as with the UK's need for a forensic field tool. Demand for people trained in bioinformatics, and biology-computing convergence can be seen in numerous bioinformatics companies, Eagle Genomics, Alere Technologies Gmbh, SoftGenetics, DNAStar, DNAnexus and NextBio offering software and services to help analyze data, not to overlook the data storage equipment business, and new publisher for data-heavy life scient papers.

The cost of sequencing a human genome — all three billion bases of DNA in a set of human chromosomes — plunged to $10,500 last July from $8.9m in July 2007, according to the National Human Genome Research Institute.

The corresponding data explosion has threatened the survival of a federal online archive of raw sequencing data which has more than tripled just since the beginning of the year, reaching 300 trillion DNA bases and taking up nearly 700 trillion bytes of computer memory. Talk of closure did not come into effect, but certain big sequencing project will have to pay to store data there

Compounding the problem is the meta-genomics field, sequencing the DNA found in an environment, soil sample or the human gut taking a census of what microbial species are present. The Human Microbiome Project, sequencing the microbial populations in the human digestive tract, has generated about a million times as much sequence data as a single human genome, said (left) C. Titus Brown, bioinformatics specialist at Michigan State University notes, “Doing a comprehensive analysis of it is essentially impossible at the moment.”

Researchers are increasingly turning to cloud computing so they do not have to buy so many of their own computers and disk drives. Researchers are trying to apply Google techniques to genomics data
Google’s venture capital arm recently invested in DNAnexus, a bioinformatics company and together plan to host their own copy of the federal sequence archive that had once looked as if it might be closed.

There is now so much raw data it is non-feasible to re-analyze it. Increasingly only the final results will be stored, and perhaps even less, as the difference between a particular genome and some reference genome.`

FOOTNOTE: IT'S A BRAIN CONSTRAINT

Professor Claudius Gros (right), Gregor Kaczor, Dimitrije Markovic of Goethe University, Frankfurt, Germany researched signs of the Weber-Fichner law in the size distribution of files on the internet and examined the factors underlying human information production on a global level studying some 252-633m publicly available data files on the Internet corresponding to an overall storage volume of 284-675 Terabytes.

Analyzing the file size distribution for several distinct data types they foundindications that the neuropsychological capacity of the human brain to process and record information may constitute the dominant limiting factor for the overall growth of globally stored information, with real-world economic constraints having only a negligible influence.

This supposition draws support from the observation that the files size distributions follow a power law for data without a time component, like images, and a log-normal distribution for multimedia files, for which time is a defining qualia.

Global information it appears, cannot grow any faster than our ability to absorb or monitor it. It raises some interesting avenues, notes Technology Review for future research too. For example, it will be interesting to see how machine intelligence might change this equation. It may be machines can be designed to distort this relationship with information.