Background: Cardiovascular disease (CVD) is the leading cause of death in the developed world. Human genetic studies, including genome-wide sequencing and SNP-array approaches, promise to reveal disease genes and mechanisms representing new therapeutic targets. In practice, however, identification of the actual genes contributing to disease pathogenesis has lagged behind identification of associated loci, thus limiting the clinical benefits.

Results: To aid in localizing causal genes, we develop a machine learning approach, Objective Prioritization for Enhanced Novelty (OPEN), which quantitatively prioritizes gene-disease associations based on a diverse group of genomic features. This approach uses only unbiased predictive features and thus is not hampered by a preference towards previously well-characterized genes. We demonstrate success in identifying genetic determinants for CVD-related traits, including cholesterol levels, blood pressure, and conduction system and cardiomyopathy phenotypes. Using OPEN, we prioritize genes, including FLNC, for association with increased left ventricular diameter, which is a defining feature of a prevalent cardiovascular disorder, dilated cardiomyopathy or DCM. Using a zebrafish model, we experimentally validate FLNC and identify a novel FLNC splice-site mutation in a patient with severe DCM.

Fig1: A decision tree-based approach for causal gene prediction. (A) Mapping of SNPs to neighboring genes using a combination of linkage disequilibrium (LD) information and the location of recombination hotspots. (B) Workflow applying OPEN for causal gene prediction at GWA loci. GWA loci are represented by horizontal bars with individual genes represented by vertical bars. The bar height represents the probability that a gene is causal for the phenotype of interest. Initially, all probabilities are equal. Probabilities are then preliminarily updated based on physical distance from index variant or, optionally, if any prior experimental evidence links them to the phenotype of interest. These probabilities are used in the sampling of positive training examples at each locus during the construction of decision trees. After a 'burn-in' phase, only genes meeting a probability threshold are used as positive training examples. Through cross-validation, the output of the analysis is the log-odds of disease association for all genes in the genome. GBM, gradient boosting machine. (C) Representation of a sample decision tree used for partitioning positive and negative training examples. A classifier consists of multiple decision trees combined in an additive manner.

Mentions:
To perform OPEN, we start with a list of genomic loci associated with a disease of interest, each represented by a single tag SNP (Figure 1), or, in the case of Mendelian disease, a list of genes previously implicated by linkage or whole exome analysis. For GWA, we initially map each tag SNP to neighboring genes in two steps. First, we identify an associated ‘block’ of SNPs in linkage disequilibrium with the tag SNP (using a threshold r2 value of 0.5). Second, we identify all genes overlapping this block (Figure 1A; Materials and methods). To account for the fact that enhancers may act at a great distance, we define genes by an inclusive interval extending 250 kbp to either side of a transcription start site. SNPs that reside within the gene body, as defined by transcription start and stop sites, are also assigned to the corresponding gene. Additional SNPs are included on the basis of linkage disequilibrium and nucleotide distance. For each phenotype we identify loci containing tagSNPs, and the gene(s) overlapping these loci represent our initial positive training examples. At each locus, we then apply a weight to each gene that reflects proximity to the tag SNP as well as prior experimental evidence that implicates it in a relevant biological process. Such evidence might include association with a Mendelian form of the disease, an ortholog in a mouse model that exhibits a cognate phenotype, or annotation with a GO term that is held by other genes associated with the disease of interest. We then apply two rounds of machine learning. The purpose of the first round is to limit and refine the list of training examples from among genes at disease-associated loci, while the second round aims to score candidate genes according to likelihood of disease association. The output of the first round is a reduced subset of positive training examples that stand out relative to their peers at each locus (see Materials and methods). Genes are not selected from loci which contain many genes unless they are high-scoring outliers, while genes at sparse loci (loci containing a small number of genes within the linkage disequilibrium window) have a high likelihood of being included. The second round uses this enriched set of positive examples for training to derive a predictive model of disease association. For predictions related to Mendelian disease association, where there is no ambiguity in SNP-to-gene mapping, the first round is skipped, and all training examples are retained for the second round.Figure 1

Fig1: A decision tree-based approach for causal gene prediction. (A) Mapping of SNPs to neighboring genes using a combination of linkage disequilibrium (LD) information and the location of recombination hotspots. (B) Workflow applying OPEN for causal gene prediction at GWA loci. GWA loci are represented by horizontal bars with individual genes represented by vertical bars. The bar height represents the probability that a gene is causal for the phenotype of interest. Initially, all probabilities are equal. Probabilities are then preliminarily updated based on physical distance from index variant or, optionally, if any prior experimental evidence links them to the phenotype of interest. These probabilities are used in the sampling of positive training examples at each locus during the construction of decision trees. After a 'burn-in' phase, only genes meeting a probability threshold are used as positive training examples. Through cross-validation, the output of the analysis is the log-odds of disease association for all genes in the genome. GBM, gradient boosting machine. (C) Representation of a sample decision tree used for partitioning positive and negative training examples. A classifier consists of multiple decision trees combined in an additive manner.

Mentions:
To perform OPEN, we start with a list of genomic loci associated with a disease of interest, each represented by a single tag SNP (Figure 1), or, in the case of Mendelian disease, a list of genes previously implicated by linkage or whole exome analysis. For GWA, we initially map each tag SNP to neighboring genes in two steps. First, we identify an associated ‘block’ of SNPs in linkage disequilibrium with the tag SNP (using a threshold r2 value of 0.5). Second, we identify all genes overlapping this block (Figure 1A; Materials and methods). To account for the fact that enhancers may act at a great distance, we define genes by an inclusive interval extending 250 kbp to either side of a transcription start site. SNPs that reside within the gene body, as defined by transcription start and stop sites, are also assigned to the corresponding gene. Additional SNPs are included on the basis of linkage disequilibrium and nucleotide distance. For each phenotype we identify loci containing tagSNPs, and the gene(s) overlapping these loci represent our initial positive training examples. At each locus, we then apply a weight to each gene that reflects proximity to the tag SNP as well as prior experimental evidence that implicates it in a relevant biological process. Such evidence might include association with a Mendelian form of the disease, an ortholog in a mouse model that exhibits a cognate phenotype, or annotation with a GO term that is held by other genes associated with the disease of interest. We then apply two rounds of machine learning. The purpose of the first round is to limit and refine the list of training examples from among genes at disease-associated loci, while the second round aims to score candidate genes according to likelihood of disease association. The output of the first round is a reduced subset of positive training examples that stand out relative to their peers at each locus (see Materials and methods). Genes are not selected from loci which contain many genes unless they are high-scoring outliers, while genes at sparse loci (loci containing a small number of genes within the linkage disequilibrium window) have a high likelihood of being included. The second round uses this enriched set of positive examples for training to derive a predictive model of disease association. For predictions related to Mendelian disease association, where there is no ambiguity in SNP-to-gene mapping, the first round is skipped, and all training examples are retained for the second round.Figure 1

Bottom Line:
In practice, however, identification of the actual genes contributing to disease pathogenesis has lagged behind identification of associated loci, thus limiting the clinical benefits.Using a zebrafish model, we experimentally validate FLNC and identify a novel FLNC splice-site mutation in a patient with severe DCM.Our approach stands to assist interpretation of large-scale genetic studies without compromising their fundamentally unbiased nature.

Background: Cardiovascular disease (CVD) is the leading cause of death in the developed world. Human genetic studies, including genome-wide sequencing and SNP-array approaches, promise to reveal disease genes and mechanisms representing new therapeutic targets. In practice, however, identification of the actual genes contributing to disease pathogenesis has lagged behind identification of associated loci, thus limiting the clinical benefits.

Results: To aid in localizing causal genes, we develop a machine learning approach, Objective Prioritization for Enhanced Novelty (OPEN), which quantitatively prioritizes gene-disease associations based on a diverse group of genomic features. This approach uses only unbiased predictive features and thus is not hampered by a preference towards previously well-characterized genes. We demonstrate success in identifying genetic determinants for CVD-related traits, including cholesterol levels, blood pressure, and conduction system and cardiomyopathy phenotypes. Using OPEN, we prioritize genes, including FLNC, for association with increased left ventricular diameter, which is a defining feature of a prevalent cardiovascular disorder, dilated cardiomyopathy or DCM. Using a zebrafish model, we experimentally validate FLNC and identify a novel FLNC splice-site mutation in a patient with severe DCM.