Outline

Introduction: Advances in biomedical instruments is rapidly reducing acquisition time, increasing data accuracy and consequently increasing the amount of data gathered significantly. This brings the big data challenges into the domain of computational medicine and bioinformatics. In this manuscript we describe one example for a methodology of handling big data in biomedicine using off-the-shelf open source tools to outsource and manage computationally intensive tasks. We demonstrate this for high-throughput proteomics, bioinformatics, and image processing [1].

Material and Methods: Stepwise processing of data is a trait of many biomedical data analysis applications, often requiring multiple pieces of software, algorithms, and data-formats. The different steps require varying computational capacities that can depend on results in previous steps. Grids have been used in the past to deal with big data, however the cumbersome usage patterns create difficulties in the agile big data environment [2]. Although some solutions like Galaxy [3] and the EBI services [4] offer scientists access to on demand tools, it is not trivial to include own algorithms in their processing pipeline.

The combination of Cloud computing and scientific workflow engines [5] enable the connection of modular processing steps, the automation of analysis pipelines, and importantly the sharing of analyses in a reproducible way. They facilitate the orchestration of decomposition, processing, and re-composition of big data between local and cloud resources. We have been using cloud resources available for research based on OpenNebula [6], and the scientific workflow engine Taverna [7].

Results: We used workflows both to imbed the logic for data analysis and to manage access to the different resources. Our cloud can include any machine to which the workflow engine has ssh access; i.e. a server, a desktop, Amazon EC2, or OpenNebula. Our scientific workflows for data analysis in proteomics include local format conversions using proprietary software, remote peptide identification using open source search engines, and local or remote statistical analysis using R. For instance, using decomposing and composing of data we searched 213788 scans (1.3GB mzXML) against the uniprot canonical sequence data of Homo-sapiens, with an error window of -0.5-2.5 Dalton and dynamic Carbamidomethyl modification. Using only idle campus Cloud resources the analysis was completed in 12.5 minutes (a 26 fold speedup compared to a local run).

Discussion: Computational methods are struggling to keep pace with the torrent of data produces by biomedical instruments. Big data demands faster, more agile, scalable and reliable solutions than local workstations or Grids. Data analysis is increasingly becoming problem specific and solutions cannot be generalized. Our experiments show that developing specific ad-hoc solutions based on open source scientific workflow engines and cloud infrastructures is a flexible and convenient solution to big data driven science. In our setup all data is encrypted during transport between the different campus Cloud resources. We are currently conducting trials with homomorphic cryptography to actually compute on encrypted data [8]. This will allow us to extend the Cloud to untrusted resources since the data and the algorithms are never decrypted and thus are safe throughout the entire process.