Big data and new knowledge in medicine: The thinking , training , and tools needed for a learning health system

Abstract

Big data in medicine—massive quantities of health care data accumulating from patients and populations and the advanced analytics that can give those data meaning—hold the prospect of becoming an engine for the knowledge generation that is necessary to address the extensive unmet information needs of patients, clinicians, administrators, researchers, and health policy makers. This article explores the ways in which big data can be harnessed to advance prediction, performance, discovery, and comparative effectiveness research to address the complexity of patients, populations, and organizations. Incorporating big data and next-generation analytics into clinical and population health research and practice will require not only new data sources but also new thinking, training, and tools. Adequately utilized, these reservoirs of data can be a practically inexhaustible source of knowledge to fuel a learning health care system. N etflix, the popular entertainment company, is known for making useful movie suggestions to its customers. In 2006 the company embarked on a project to further improve its ability to predict which movies its customers would like. 1 Through an open compe-tition, Netflix offered a $1 million prize to the group that most improved on Netflix's tradition-al approach, which was based on conventional statistics. The Netflix strategy for improving service was interesting, in part, for what it did not do. Netflix did not hire psychologists to develop conceptual models of the factors that influence an individ-ual's viewing experience. It did not test hypoth-eses about the theory of choice or the determi-nants of genre preference. It did not perform randomized controlled trials to compare ways of presenting information to customers. Instead, Netflix chose to exploit its data. Net-flix provided competitors for the prize with 100 million ratings submitted on almost 18,000 movie titles by almost 500,000 people. The winning teams not only focused on how each person rated movies but also, importantly, dis-covered that an individual's ratings were influ-enced by factors such as whether the person ranks many movies at a time (which tended to accentuate positive or negative preferences) or by the overall popularity of a movie across raters at a particular point in time. Ultimately, the win-ners produced an algorithm that increased the accuracy of predicting ratings by 10 percent. The Netflix competition exemplifies a data-driven approach that is emerging from a new era of big data. Big data has been described as the rapidly increasing size of available data, the speed with which those data are produced, and the ways in which the data are represented. 2 It also can refer not only to the data but also to the possibilities of discovering new knowledge by leveraging massive data collections in novel ways. The analytic methods for big data typically depart from traditional statistics and hypothesis testing; they incorporate techniques such as