Authors

Abstract

In this paper, we present a heart disease prediction use case showing how synthetic data can be used to address privacy concerns and overcome constraints inherent in small medical research data sets. While advanced machine learning algorithms, such as neural networks models, can be implemented to improve prediction accuracy, these require very large data sets which are often not available in medical or clinical research. We examine the use of surrogate data sets comprised of synthetic observations for modeling heart disease prediction. We generate surrogate data, based on the characteristics of original observations, and compare prediction accuracy results achieved from traditional machine learning models using both the original observations and the synthetic data. We also use a large surrogate data set to build a neural network model (Perceptron) and compare the prediction results to the traditional machine learning algorithms (Logistic Regression, Decision Tree and Random Forest). Using traditional Machine Learning models with surrogate data, we achieved improved prediction stability within 2 percent variance at around 81 percent using ten fold validation. Using the neural network model with surrogate data we are able to improve the accuracy of heart disease prediction by nearly 16 percent to 96.7 percent while maintaining stability at 1 percent. We find the use of surrogate data to be a valuable tool, as a means to anonymize sensitive data and improve classification prediction.