Clinical discovery in the age of “Big Data”

Modern data processing tools, many of them open source, allow more clinical studies at lower costs

This guest posting was written by Yadid Ayzenberg (@YadidAyzenberg on Twitter). Yadid is a PhD student in the Affective Computing Group at the MIT Media Lab. He has designed and implemented cloud platforms for the aggregation, processing and visualization of bio-physiological sensor data. Yadid will speak on this topic at the Strata Rx conference.

A few weeks ago, I learned that the Framingham Heart Study would lose $4 million (a full 40 percent of its funding) from the federal government due to automatic spending cuts. This seminal study, begun in 1948, set out to identify the contributing factors to Cardiovascular Disease (CVD) by following a group of 5,209 men and woman and tracking their life style habits, performing regular physical examinations and lab tests. This study was responsible for finding the major risk factors for CVD, such as high blood pressure and lack of exercise. The costs associated with such large-scale clinical studies are prohibitive, making them accessible only to organizations with sufficient financial resources or through government funding.
One of the major cost drivers in these types of studies is data acquisition and management. These high costs, which inhibit the collection of data, create what I like to call the chicken and egg problem of clinical data collection: if you do not know whether a piece of data is clinically relevant, you do not collect it. But if you do not collect it, it will be difficult to determine whether it is clinically relevant. Cost and complexity prevent us from gathering all of the data, so we gather only data that is widely known to be relevant, which limits its usefulness in discovering new types of correlations. This makes it very challenging to determine which data is pertinent to a specific clinical state on top of what is already scientifically proven. On the other hand, the collection of additional data would increase its statistical power and enable the discovery of new correlations.

Historically, clinical scientific discovery was mostly done in small and incremental steps: a hypothesis was formed, data was collected from a group, and the hypothesis was proved or nullified. This resulted in small data sets that contained limited statistical power. What further added to this problem was data ”silofication.” Relevant data may have been collected but would be stored in separate systems that would not enable easy access for analytical purposes. Take the case of a hospital, for example: patient history would be stored in one system, imaging in another, and prescription refills in yet another. This barrier was mostly generated due to technological discrepancies. The nature of the data dictated the type of database that would be used to store it.

However, technological advances in sensor technology and the pervasiveness of mobile phones decrease the costs of data collection, making it possible to collect large data sets from the population and perform exploratory analysis. Modern data storage architectures can store different types of data in a single database. What’s more, the current, widely available big data computation infrastructures are enabling the analytics of enormous quantities of data in ways that were never possible before. By gathering and analyzing individual data and comparing it to data of a population, we may be able to classify disease occurrences and predict health outcomes.

Utilizing the plethora of technologies that are powering today’s Internet may reduce a significant portion of the costs associated with clinical studies:

Data collection

It is possible to collect biophysiological data using wearable sensors, and environmental and behavior data using smart phone sensors.

Data aggregation

This can be performed using high throughput distributed queuing systems such as Kafka, Kestrel, or RabbitMQ.

Data Archives

Storage of structured, semistructured, and unstructured data can be performed using NoSQL data bases such as HBASE, Cassandra, MongoDB, or redis.

Real-time data analytics

Analytics can be performed in real time for intervention purposes. These can be preformed with a distributed processing system such as Storm, Spark Streaming, or Druid.

Offline data analytics

Large amounts of data may be processed by one of the many computation platforms available today. Tools such as Hadoop, Spark, Sphinx, Elastic search, Ferret, and Solr may be used for text mining and time series analysis.

Developing new clinical data aggregation and collections platforms that rely on these underlying technologies could be the dawn of a whole new era of clinical discovery. Hopefully, we will see many more ground-breaking studies of the Framingham type soon—at a much lower cost. I will be discussing some of these technologies and their various use cases in my upcoming Strata Rx talk.