Predicting advanced coronary calcium using machine learning

One goal of personalized medicine is using data science tools to guide medical decision-making. Here, Cihan Oguz and colleagues describe in an article published as part of the Systems Medicine thematic series in BMC Systems Biologyhow they used machine learning tools to develop a predictive model of coronary artery disease.

The risk of developing coronary artery disease (CAD) varies greatly among individuals in the general population. Clinical variables like LDL cholesterol and systolic blood pressure do not always tell the whole story regarding an individual’s risk of developing CAD.

Past research has shown that the level of coronary artery calcium (CAC) of a patient is a strong predictor of CAD, as well as lethal cardiac events, such as heart attacks. Identifying markers predictive of high CAC levels can be very helpful for identifying patients who are at greater risk and preventing accelerated progression of heart disease especially at an early age.

Single nucleotide polymorphisms (SNPs) represent a particularly rich source of genetic variation (about 10 million SNPs are present in the human genome) making them ideal for establishing links between genetic variation and complex diseases.

How can one identify such markers that can predict individuals that are at a high risk of advanced CAC? With the recent advances in genomics, one possible route is utilizing genomic information from a pool of patients that include two subgroups representing the two extremes of the phenotypic distribution in the general population (i.e., no disease vs. advanced disease).

Single nucleotide polymorphisms (SNPs) represent a particularly rich source of genetic variation (about 10 million SNPs are present in the human genome) making them ideal for establishing links between genetic variation and complex diseases. A major challenge in building predictive models of complex diseases is their multifactorial nature that involves interactions between several genes.

Recently, there has been increasing interest in the application of machine learning tools for disease predictions. These methods provide increased ability for integrating multiple data sources (e.g., clinical, genotypic, and transcriptomic) while utilizing potential linear and non-linear interactions between disease predictors.

To this end, we integrated clinical data and SNP genotype data into machine learning models to identify SNPs that are predictive of advanced CAC levels. We found 56 highly predictive SNPs in a discovery cohort, which were then tested in an independent replication cohort.

These two cohorts from ClinSeq® and the Framingham Heart Studies were composed of middle-aged Caucasian men due to their higher risk of advanced CAC in comparison with the rest of the population in the United States. The two extremes of the CAC distribution were equally represented in both cohorts (i.e., no CAC vs. extremely high levels of CAC).

21 of the 56 SNPs identified from the discovery cohort generated optimal predictive performance in both cohorts with two machine learning based modeling approaches, namely random forests and neural networks. When we tested these SNPs with patients who had intermediate CAC levels, the predictive performance dropped significantly. Hence, the high performance was specific to advanced CAC.

Finally, we utilized the GeneMANIA database to create a functional interaction network composed of genes on which the optimal subset of 21 SNPs were located, as well as additional genes previously reported to interact with these genes. Several genes involved in the production and inhibition of reactive oxygen species (a major driver of CAC and vascular aging) were present in this network.

Disclaimer: The views expressed in this blog post are those of the author and do not necessarily represent the views of the National Heart, Lung, and Blood Institute; National Human Genome Research Institute; the National Institutes of Health; or the U.S. Department of Health and Human Services.

Dr. Cihan Oguz is a research fellow in the Genomics of Metabolic, Cardiovascular and Inflammatory Diseases Branch of the National Human Genome Research Institute at the National Institutes of Health in Bethesda, Maryland. His research focuses on machine learning applications in genomics of advanced cardiovascular disease with an emphasis on developing predictive disease models and networks.