How Big Data Is Improving Breast Cancer Prediction Rates

Better predicting health outcomes through the analysis of big data are the goals of healthcare analytics. Now, a new method for differentiating between “noisy” and predictive variables in big data has been shown to significantly improve the prediction rate of breast cancer.

Developed by a Princeton-led research team, the method called the influence score (I-score) was found to improve the prediction rate in real-world disease data.

Results of their study, published last week in the journal Proceedings of the National Academy of Sciences, showed that the I-score improved the prediction rate in breast cancer data from 70 percent to 92 percent. To put it another way, I-score was able to reduce the error rate for correctly predicting breast cancer from 30 percent to just 8 percent.

Applying the approach to real disease data “has not only been quite successful in finding variable sets (thus encompassing higher-order interactions, traditionally rather tricky in big data), but has also resulted in finding variable sets that are very predictive that do not necessarily show up as significant through traditional significance testing,” states the article.

In their study, researchers emphasized the use of I-score with genetic data but they say the methods proposed “are easily tailored to other high-dimensional data in the natural and social sciences.”

Adeline Lo, a postdoctoral researcher in Princeton's Department of Politics and lead author of the article, contends that the I-score has a number of applications outside of healthcare such as terrorism, civil war, elections and financial markets. However, she adds the fact that “it does very well with large datasets such as genetic data means it is a potentially very useful tool for the practitioner or research analyst to have.”

Lo says the team’s overarching research agenda was “trying to figure out how to predict better with particularly complex and big data” by identifying highly predictive variables. She asserts that I-score fares especially well in high-dimensional data with many complex interactions between variables such as genetic data that can help predict important health outcomes of interest like whether a patient might develop breast cancer or whether they might relapse.

“The number of genes you have to consider as candidates are in the millions,” according to Lo, who says “having methods that can handle that type of data is incredibly important” as the quantity and complexity of available data continues to grow.

“As a new field of inquiry, the search for measures that maximize predictivity may do much in the way of living up to the hopes of advancing predicting outcomes of interest, such as disease status,” conclude the researchers.

In addition, Lo sees I-score as being valuable for analyzing other big healthcare data such as those generated by the widespread use of electronic health records as well as wearable sensors/trackers, mHealth apps, and social media.

“We’ve been overwhelmed by big data, particularly in healthcare. The question is: how can we use it in the most effective way,” she adds. “The information that is being collected from these health gadgets is really great because there’s a lot of very rich data that’s constantly being gathered about individuals.”