News

Distributed System Architectures for Healthcare and Life Sciences

The insideBIGDATA Guide to Healthcare & Life Sciences is a useful new resource directed toward enterprise thought leaders who wish to gain strategic insights into this exciting new area of technology. This guide is a useful new resource directed toward enterprise thought leaders who wish to gain strategic insights into this exciting new area of technology. The guide provides an overview of the utilization of big data technologies as an emerging discipline in healthcare and life sciences. It explores the characteristics of this business strategy and the benefits of leveraging big data technologies within these sectors. It also touches on the challenges and future directions of big data and analytics in the healthcare and life sciences industries. The complete insideBIGDATA Guide to Healthcare & Life Sciences is available for download from the insideBIGDATA White Paper Library.

Distributed System Architectures

Many organizations in the healthcare and life sciences industries are just now exploring the opportunities surrounding the dominant entries in the distributed processing architectures, Hadoop and Spark, and are looking for ways for getting started. Even though Hadoop has been in the market since 2011, it is still a relatively new technology in these sectors, while oftentimes Spark is for phase 2, for different use cases and for the advanced user.

Difficult challenges and choices face today’s healthcare industry—researchers, clinicians and administrators have to make important decisions—often without sufficient data. Distributed systems like Hadoop and Spark offer open source platforms to make healthcare data available and actionable —researchers explore the genetic architecture of cancer cells; nurses and physicians monitor intensive care patients; administrators submit reimbursement claims before patients leave the hospital. Distributed computing systems are transforming healthcare.

Hadoop is a strong example of a technology that allows healthcare to store data in its native form. If Hadoop didn’t exist, decisions would have to be made about what can be incorporated into the data warehouse or the electronic medical record (and what cannot). Now everything can be brought into Hadoop, regardless of data format or speed of ingest. If a new data source is found, it can be stored immediately. No data is left behind. By the end of 2017, the number of health records of millions of people is likely to increase into tens of billions. Thus, the computing technology and infrastructure must be able to render a cost efficient implementation of:

parallel data processing that is unconstrained

provide storage for billions and trillions of unstructured data sets

fault tolerance along with high availability of the system

Hadoop technology is successful in meeting the above challenges faced by the healthcare industry as the MapReduce engine and Hadoop Distributed File System (HDFS) have the capability to process thousands of terabytes of data. Hadoop makes use of highly optimized, yet inexpensive commodity hardware making it a budget friendly investment for the healthcare industry.

Hadoop Use Cases

In conventional IT environments, clinical, operational and financial data are managed in data silos. Meanwhile, with the movement from paper-based to electronic health records, and with the increase in usage of machines and medical devices that produce steady streams of data, the volume of data that healthcare institutions capture and analyze has skyrocketed, while the variety of that data has grown.

The Hadoop platform allows healthcare organizations to process and manage an ever larger influx of data in a secure and cost-effective manner to improve quality and affordability. They can leverage the platform to bring together large volumes of detailed data from diverse sources, in a variety of formats, and consolidate it into a single flexible and scalable system for long-term storage and analysis.

Cancer treatments and genomics – There has been an uptake in adopting Hadoop in the life sciences community, mostly targeting next-generation sequencing, and simple read mapping because what developers discovered was that a number of bioinformatics problems transferred very well to Hadoop, especially at scale.

Monitoring patient vitals – There are several hospitals across the world that use Hadoop to help the hospital staff work efficiently with big data. Without Hadoop, most patient care systems could not even imagine working with unstructured data for analysis.

Hospital networks – Hadoop technology is used to help medical experts analyze high velocity data in real time from diverse sources such as financial data, payroll data, and EHRs.

Fraud detection and prevention – Using Hadoop technology, insurance companies have been successful in developing predictive models to identify fraud by making use of real-time and historical data of medical claims, weather data, wages, voice recordings, demographics, cost of attorneys and call center notes. Hadoop’s capability to store large unstructured data sets in NoSQL databases and using MapReduce to analyze this data helps in the analysis and detection of patterns in the field of fraud detection.

Spark Use Cases

Precision medicine – The promise of precision medicine is a far-reaching goal that will require sweeping changes to the ways physicians treat patients, health data is collected, and global collaborative research is performed. Precision medicine typically describes an approach for treating and preventing disease that takes into account a patient’s individual variation in genes, lifestyle, and environment. Achieving this mission relies on the intersection of several technology innovations and a major restructuring of health data to focus on the genetic makeup of an individual. The healthcare ecosystem has chosen a variety of tools and techniques for working with big data, but one tool that comes up again and again in many of the new architectures is Spark. Spark is already known for being a major player in big data analysis, but it is additionally uniquely capable in advancing genomics algorithms given the complex nature of genomics research.

Genomics algorithms – The transitioning of today’s popular genomics algorithms to Spark is one path that researchers are taking to take advantage of the distributed processing capabilities of the cloud. Many of these are already being built on top of Spark. Although Spark provides many infrastructure advantages, Spark still speaks the languages that are popular with the research community. Languages like SparkR make for an easy transition into the cloud.

Computational neuroscience – One example of a research project taking advantage of Spark is the Howard Hughes Brain Institute. The project’s goal is to understand brain function by monitoring and interpreting the activity of large networks of neurons during behavior. An hour of brain imaging for a mouse can yield 50-100 gigabytes of data. The researchers developed a library of analytical tools called Thunder which is based on Spark using the Python API along with existing libraries for scientific computing and visualization. The core of Thunder is expressing different neuroscience analyses in the language of RDD operations. Many computations such as summary statistics, regression and clustering can be parallelized using MapReduce.

If you prefer, the complete insideBigData Guide to Healthcare & Life Sciences is available for download in PDF from the insideBIGDATA White Paper Library, courtesy of Dell and Intel.