Using one cancer to help defeat many: Mapping Cancer Markers makes progress

By: The Mapping Cancer Markers research team

12 Feb 2015

SummaryResults from the first stage of the Mapping Cancer Markers project are helping the researchers identify the markers for lung cancer, as well as improve their research methodology as they move on to analyze other cancers.

The Mapping Cancer Markers research team

Once again, the Mapping Cancer Markers (MCM) team would like to extend a huge thank you to the World Community Grid members. Although we publish this thank you each update, we are truly grateful for your contribution to this project.

The MCM project has continued to process lung cancer data, exploring fixed-length random gene signatures. This long stage of the project is nearly over, and we are preparing to transition our focus to a narrower set of genes of interest. Target genes will be chosen by a process combining statistics from the initial results, with pathway and biological-network analysis.

Analytics

In our previous update, we reported the adoption of a new package, the IBM® InfoSphere® Streams real-time analytics platform, to process our World Community Grid data. The majority of our work since the last update has concentrated on continued development and expansion of our Streams system in order to handle the incoming data more robustly and efficiently.

There are two main reasons why stream-processing design is better for processing MCM results than a batch-computing approach. One reason relates to the nature of World Community Grid: a huge computing resource that continuously consumes work units and produces compute results. Data is best processed as it arrives, to avoid backlogs or storage limitations.

Importantly, as we transition to the new focus, this enables us to make the process of designing new work units based on partial results more effective. MCM will soon focus on genes of interest revealed by our broad survey of gene-signature space in the first stage. To narrow the focus, we will take an iterative approach, where we design small batches of work units (e.g., 100,000 units), submit them to World Community Grid, analyze the results, and then incorporate the new analysis into designing the next batch. In this way, we will slowly converge towards the answers we are seeking. Because of the continuous nature of the MCM project, and the volume of data we receive on a daily basis, it is imperative that our analysis system processes results quickly enough to generate the next set of work units.

New stage in lung cancer signature discovery

The MCM project has continued to process lung cancer data, exploring random fixed-length signatures of between 5 and 25 biomarkers. This computational component of the “landscape” stage is winding down, and we are preparing to transition our focus to a narrower set of genes of interest. Target genes will be selected by integrating results from several methods, carefully combining statistics from the initial results with pathway and biological-network analysis.

Network analysis/integration of pathway knowledge

One of the most exciting (and crucial) parts of this project is the integration of other research to help understand the results we are collecting. We already know that in most cancers no single biomarker is sufficient, we can find thousands of clinically-relevant signatures, and, most importantly, many seemingly weak markers when combined with others provide highly useful information. Therefore, we have been trying to find these “best supporting actors” and then the best signatures through “integrative network analysis”.

Figure 1: An iterative strategy for biomarker discovery. Work units are processed on World Community Grid. The results are analyzed via a Streams pipeline. This generates a list of high-scoring genes, which combined with biological network information (NAViGaTOR) are used to design new MCM work units targeting areas of interest in signature space.

We know that disease is more accurately described in terms of altered signaling cascades (pathways): higher-level patterns composed of multiple genes in a biological network. A pathway can be defined as a series of reactions (“steps”) that result in a certain biochemical process. For example, one could consider the electrical and mechanical systems in a car as a set of interrelated pathways. These systems are important for the overall function of the car; however, some are clearly more important than others. In the same way, a particular cancer occurrence could have a single catastrophic cause (a missing engine block) or smaller, multiple causes affecting the same system (e.g., the bolts holding the exhaust system together).

Around the world, researchers are continually finding, publishing and curating biological pathways and their building blocks (protein interactions). We are taking this information and applying it to high-scoring genes and gene signatures identified from Mapping Cancer Marker results. For example, if the first part of our landscape study identified a certain gene as a potential target, we can see via our network analysis (NAViGaTOR) as well as other external sources if that same gene is involved in known pathways. We can then gather information about those pathways and refine our findings by resubmitting work units to World Community Grid. In essence, we are identifying genes of interest by combining top-scoring genes with pathway and network context. Those investigations will continue to refine our search space and converge on better and better solutions. Below, we list some examples of this work, but especially Kotlyar et al., Nature Methods, 2015 work provides comprehensive in silico prediction of these signaling cascades. Wong et al., Proteomics, 2015 introduces systematic approach to derive important information about cancer-related structures in these networks. Fortney et al., PLoS Computational Biology uses results of this work to identify potential new treatment options for lung cancer.

Transition to the targeted stage

We expect a gradual and seamless transition to the new stage of MCM, with no interruption in the supply of work units, and no changes to the visualization or code. Both stages will overlap for a period as the last statistics from the first stage are gathered, and the initial, targeted work units are sent out. Average work unit run-time should remain the same. The consistency of run-times should remain the same or improve.

Some related published work

Hoeng J, Peitsch MC, Meyer, P. and Jurisica, I. Where are we at regarding Species Translation? A review of the sbv IMPROVER Challenge, Bioinformatics, 2015. In press.