Cancer genomics resources are growing at an unprecedented pace. However, a comprehensive analysis of the cancer genome still remains a daunting challenge. This is in part due to the difficulties in visualizing, integrating, and analyzng cancer genomics data with current technologies. We propose to develop a cloud-based platform to empower researchers with the ability to host, visualize and analyze their own data. The platform is composed of a set of Cancer Analytics Virtual Machines (CAVMs). The main component of each CAVM is a data server which functions to store and serve user data to applications, such as the UCSC Cancer Genomics Browser, to provide data visualization. The second component is a modified Galaxy workflow system to provide data analysis capability. UCSC's suite of analysis tools for nextgen sequencing data analysis and pathway inference will be prepackaged with the system. The two components will be highly integrated to allow tightly coupled cycles of data visualization and analysis. The data server component will be modular such that it can provide data independently to applications besides the Cancer Browser and Galaxy. We will deliver virtual machine images that can be easily initiated in a commercial cloud such as Amazon, or can be installed within a user's own institution. The CAVM also functions as a way for users to Integrate with external large-scale databases. We will deliver a UCSC CAVM that other CAVM instances can connect to, to provide authorized data access from the UCSC cancer genomics data repository. The system allows the dynamic formation of new datasets composed of data slices from multiple sources. This ability to combine data into larger samples will provide the statistical power to allow discoveries that would otherwise not be possible.
We aim to eliminate, or significantly reduce, the overhead of system configuration and software installation. Our tools will provide users the capability to access a cloud-based cluster computing environment, which will make sophisticated, computationally intensive analyses accessible to researchers who might not, have access to compute servers. The software platform we develop can be used by individual bench biologists, and also by large projects to serve data to individual users or to other projects. This design has the potential to form an expansive federated database accessible through the same software interface.

Public Health Relevance

Currently, clinicians and bench biologists typically depend on external collaborators for data analysis. The proposed system will provide these scientists with data analysis and visualization methods that are both powerful and easy to use. This will accelerate research in the understanding and treatment of cancer, the second-leading cause of death in the U.S.