The MAGAZINE of the Johns Hopkins Bloomberg School of Public HealthSpecial Issue 2012 | www.jhsph.edu

Big Data - Overload

The quest for knowledge in an era flooded with information

by Jim Schnabel

Every step you take, every move you make… Science can learn from you.

The tech revolution that has put iPhones in our pockets and a world of Google-able data at our fingertips has also been ushering in a golden age of health research. Take, for example, work being done by Thomas Glass, PhD, and Ciprian Crainiceanu, PhD, and their teams. They recently clipped accelerometers—smaller than iPhones—onto the hips of elderly research subjects. The devices can record people’s motions in detail, for indefinite periods and in real time if needed. The immediate aim, says Crainiceanu, a Biostatistics associate professor, is to devise a truer method of recording the physical activity of the elderly. But it’s the kind of approach that could turbocharge a lot of other health-related science. No more questionnaires, no more biased recollections, no more droopy-lidded grad students analyzing hours of grainy video. Just the cold, hard facts, folks. Just the data.

Blizzards of information in new studies yield great insights only after investigators solve the big problems of big data.

Related

“In principle, we could take inputs from a wide variety of sensors—say, heat sensors, or portable heart monitors sending data by Wi-Fi or cell phones,” Crainiceanu says. “Our imagination is the limit.” And it’s not just portable gadgets that are making this possible. Brain imaging technology is still big and expensive, but its use is becoming more routine, and it now can deliver information on neural activity and density and connectivity at volumes on the order of a cubic millimeter. Next-gen genomics technologies can catalog DNA and gene-expression levels rapidly and with base-pair precision. Medical records are migrating to the digital and Web realms and containing ever more numeric and imagery-related detail. This gold rush of data gathering represents “an opportunity not just in terms of improving public health but also within biostatistics, for it gives us this tremendous new set of problems to work with,” says Karen Bandeen-Roche, PhD, MS, the Frank Hurley and Catharine Dorrier Professor and Chair of Biostatistics.

And the problems can be considerable. It’s not unusual for a public health study dataset nowadays to require a storage capacity on the order of 10 trillion bytes (10 terabytes)—the equivalent of tens of millions of 1970s-era floppy disks. Larger datasets are inherently better in the sense that they have greater statistical power to overcome random variations (known as noise) in data—just as 1,000 coin flips will be better than five coin flips at revealing the true 50/50 nature of a coin flip. In practice, though, large health-related datasets often contain a grab bag of information that isn’t always relevant and is distorted (biased) by hidden factors that may confound the savviest statistician. Moreover, traditional data collection, storage and analysis techniques can’t always be straightforwardly scaled up to terabyte levels. “How to design data collection properly, how to avoid bias, how best to represent a population of interest—these sorts of challenges may be even greater for the ultra-large datasets than for the more manageable ones with which we’ve traditionally dealt,” says Bandeen-Roche.

For Crainiceanu and his team, the goal was to turn days of raw, wiggly, three-axis accelerometer voltage readouts into meaningful interpretations of human movements. Such a task essentially attempts to reproduce—with an artificial sensor system plus software processing—the ability of higher organisms like mice or people to recognize individual movements amid the vast, noisy streams of visual and somatosensory signals coming into their nervous systems. It’s a big-data-processing skill that took us mammals tens of millions of years to develop, and even in furry, small-brained ones it involves myriad wetware layers of filtering and logic.

Crainiceanu saw the parallels to neural processing right away, and chose speech perception as a guiding analogy. “Movement is essentially like speech,” he says. “It involves units like words, which combine into meaningful sequences that are like sentences and paragraphs. So we started by processing the accelerometer data into the smallest meaningful movement units, which we called movelets.”