FOCUS on technologyMachine Learning in HEOR

Managing data volume with new methodologies

Damion Nero, PhD
Manager, HEOR

Editor's note: The 21st Century Cures Act emboldens real-world evidence researchers to increase their contributions to the healthcare landscape as the demands of value-based care require us to reach beyond the limitations of randomized controlled trials. What 21st century research methodologies will be needed by our health economists to succeed in meeting these objectives? This article postulates that machine learning may be just such a tool.

The past 20 years has seen a dramatic increase in the availability of data that are accessible by stakeholders within the healthcare industry, which has provided ever–increasing opportunities for healthcare economics and outcomes research (HEOR).1 This increase in data is represented in the increased volume of data from payer claims as well as from electronic medical records (EMR) and other sources.2 Claims and EMR data are also being linked to sociodemographic and consumer data, which gives a broader perspective on patients and physicians and how healthcare services are being provided. EMR data can be updated in near real time, whereas some claims data can be refreshed weekly. The speed at which this data can be accessed allows for new opportunities for understanding the healthcare system and addressing questions from various stakeholders and vendors regarding how best to position products in order to serve consumers.

One of the challenges with this data is how it can best be analyzed. Traditionally, standard comparative statistical methods (eg, t tests, chi-square analyses) have been used to assess populations and make statistically significant comparisons within specific groups of patients or within segments of the data. However, with the wealth of data now available, new techniques developed from machine learning are starting to be leveraged in order to gain a better understanding of complex patterns within the data. Although the technology of machine learning emerged more than 5 years ago, its impact on healthcare remains to be determined. While these new techniques offer the possibility of gaining new insights into the healthcare environment, there are certain caveats that should be considered when using these methodologies.

What Is Machine Learning?

Machine learning refers to a large group of analytic and statistical techniques, including supervised methods (eg, co-training [training of regression models], Bayesian statistics, and decision trees) and unsupervised methods (eg, clustering, neural networks, and principal component analysis), that are focused on predicting outcomes or events based on the available data.3 These techniques provide computers with the ability to learn how to work with the data at their disposal without being explicitly programmed to do so via the use of computer programs that can change on their own when exposed to new sets of data.

These techniques generally look for patterns within the data in order to understand what might be leading to a specific event or outcome. In theory, the more complete or “explanatory” the source data is, the more robust the prediction of outcomes will be and the easier it becomes to isolate specific factors (eg, age, race, weight, clinical history, etc.) that might be predictive of certain outcomes. Conversely, causal relationships between these factors and the outcomes cannot always be determined even when a factor is predictive of a specific outcome. The question is how to disambiguate an association between a specific factor and an outcome from factors that might actually be leading to specific outcomes and can provide actionable intelligence. Current methodologies are still being developed in order to make this distinction by incorporating clinical input and real-world evidence to guide conclusions based on these analyses. There are currently a number of applications for machine learning that have been published using HEOR data as well as certain caveats to applying these methodologies.

Machine Learning Applications in HEOR Data

Several recent studies have used varying techniques and data sources to try to understand patient populations.4-6
Some of these analyses are based on specific hypotheses, while others take the form of data mining, in which large datasets are used to generate new hypotheses based on patterns seen within the data. Both methods have produced valuable insights and have provided unique perspectives in research that have exponentially increased our understanding of certain patient populations and disease-related outcomes.

Understanding a Patient Population

An example of a data-mining technique being used to understand a patient population was described in a recent study by Razavian et al.4
In this study, the researchers used administrative claims, pharmacy claims, healthcare resource utilization records, and laboratory results to build a model that would identify predictors of type 2 diabetes. This data-driven model was built using known as well as novel factors associated with type 2 diabetes. The data included 42,000 variables and 4.1 million individual patients over a 5-year period. By applying machine learning techniques, the researchers were able to segment the data in order to successively train the models to reduce the significant predictors of type 2 diabetes to a manageable number. They then compared their models to previous ones to come to a final model with 400 variables that performed better than the model with known factors alone. The superiority of the new model was determined through the use of area under the curve (AUC) techniques, which measure the goodness-of-fit of a model: the AUC was greater (0.8 or 80%) in the new model compared with the model with known factors alone (0.75 or 75%).

Understanding Treatment Options

Another study by Devinsky et al focused more specifically on treatment options within a disease population, with a specific hypothesis in mind.5
In this study, administrative claims data were collected for patients with epilepsy over a 5-year period. A subset of patient claims for those who switched antiepileptic drugs (AEDs) were used to train the prediction model by retrospectively evaluating the factors that contributed to the change in treatment. Based on the trained model, a model predicted AED regimen with the lowest likelihood of treatment change was assigned to each patient in the group of test claims, and outcomes were evaluated to test model validity. The results of this study were the generation of a model with a 0.72 AUC that was used to guide treatment options based on the factors that were found to lead to treatment change. Patients given the assigned regimen that had the lowest likelihood of changing had longer durations of time before treatment changes compared to controls, and they also had lower healthcare resource utilization.

Evaluating Treatment Patterns and Outcomes

More recent studies have started to use relatively new data sources in order to understand treatment patterns and outcomes within patient populations. In a study by Freedman et al, social media networks, message boards, patient communities, and other online portals for patient interaction were examined to determine barriers to breast cancer treatment.6
Data were aggregated over a 1-year period and categorized by topic, race, and other categories where possible. Using machine learning software, the researchers identified nearly 400,000 out of 1,000,000 posts that were specifically related to barriers to treatment. They found that emotional issues, religious beliefs, physical limitations, limited resources, and healthcare perceptions were some of the key barriers to treatment. Furthermore, race was also shown to present barriers to treatment. While this study did not provide any direct quantitative outcomes, it did show how actionable intelligence can be gleaned from machine learning and could provide the basis for future studies on this topic.

Considerations When Using Machine Learning

While the aforementioned examples show some of the advantages and applications of machine learning, there are certain considerations that researchers should keep in mind when using these tools in their research. First, there are the questions of data quality and data interpretation. While there are myriad datasets available from various sources that contain large amounts of data, not all of these data are useable for machine learning. Most datasets are not created for HEOR purposes, but rather to meet other specific needs. Administrative claims data, for example, is designed primarily for the purpose of supporting reimbursement; the available fields are structured in a way that facilitates claims adjudication but is not necessarily conducive to research. Before using this type of data, one must take care to understand both the structure of the data and certain artifacts of the data that could skew or create erroneous results (eg, miscoding, order of diagnoses, upcoding, etc).

These limitations also apply to EMR claims data. While these data provide a wide variety of information that was previously not available in claims data, with a frequency that allows for the evaluation of real-world analyses, universal standards for the information within an EMR are lacking; thus, data can vary dramatically even between treatment settings within the same area. Although some standardization has been implemented, it is difficult to create a standard form that covers the myriad healthcare providers and settings across the country.7 Additionally, there are still many fields that exist as free text, which can’t easily be incorporated into an analytic dataset for the purposes of research. There are also still concerns regarding the consistency of this data from patient to patient. Inconsistencies due to missing records and incorrect entries abound, and this fact requires additional curation of the data before using it for research purposes.

Novel data sources, such as social media, pose additional problems as well. Some of the data from these sources is opinion-based and might not be useful beyond a qualitative assessment of a population unless care has been taken to follow rigorous methodology for the elicitation of information (eg, patient-reported outcomes measurement tools). It is also difficult to determine the accuracy of this data, especially in an online setting that allows for anonymity in many cases.

Consideration of these limitations provides a clearer perspective of machine learning. In order to produce accurate results, most machine learning techniques rely on prior information as well as the quality of information in terms of completeness and accuracy of data. Without these two components—completeness and accuracy—any results must be viewed with some skepticism and might require additional input in the form of clinical expertise or additional data sources before conclusions can be made. Even with training and model-fitting techniques, a researcher must always be careful not to over-interpret results from machine learning, especially for techniques that require little human input.

Conclusion

The advent of new data sources and the increase in the volume and availability of up-to-date data have presented increasing opportunities to HEOR researchers for understanding various aspects of healthcare. Machine learning provides tools that can help interpret patterns within large datasets and identify factors that are associated with specific outcomes. However, it is still difficult to prove causal relationships from most results. Various methods have been developed to evaluate individual data sources and link data in novel ways in order to gain different perspectives on the data and regarding outcomes within specific patient populations.

Using this type of machine learning has already generated valuable intelligence for the healthcare community and has the potential to elucidate new insights into patient populations with specific diseases and conditions. The aforementioned examples demonstrate the potential for machine learning to uncover new variables and provide better guidance on treatment practices and patient outreach using existing data.4-6 Furthermore, while most analytic approaches are hypothesis-driven, machine learning opens up new avenues for generating questions from the data rather than mapping the data to existing questions that are often generated from qualitative observations.

The current trend in machine learning is moving increasingly toward less human involvement in the generation of models and the resulting conclusions from an analysis. With the advent of artificial intelligence, more complex algorithms are being implemented to address specific questions within data as well as generate new hypotheses for research. The caveat remains that machines are limited in their ability to interpret data. Computers will often search for the simplest solution given the data at hand, which follows the principle of parsimony, but they may not take into account the complexity of the data or the inherent limitations on data that are still, for the most part, being generated by humans. It is, therefore, paramount that researchers continue to work to validate and assess models before drawing conclusions from them. This practice will better guide healthcare practitioners and provide useful insights for the healthcare community.