Abstract

Autism spectrum disorder (ASD) is a complex neurodevelopmental disorder with a strong genetic basis. Yet, only a small fraction of potentially causal genes-about 65 genes out of an estimated several hundred-are known with strong genetic evidence from sequencing studies. We developed a complementary machine-learning approach based on a human brain-specific gene network to present a genome-wide prediction of autism risk genes, including hundreds of candidates for which there is minimal or no prior genetic evidence. Our approach was validated in a large independent case-control sequencing study. Leveraging these genome-wide predictions and the brain-specific network, we demonstrated that the large set of ASD genes converges on a smaller number of key pathways and developmental stages of the brain. Finally, we identified likely pathogenic genes within frequent autism-associated copy-number variants and proposed genes and pathways that are likely mediators of ASD across multiple copy-number variants. All predictions and functional insights are available at http://asd.princeton.edu.

Genome-wide prediction of autism-associated genes. Our ASD-gene predictions are based on a machine learning approach that (1) uses a gold standard of known disease genes, those linked to autism with varying levels of evidence (E1–E4) as positives and other genes linked to non-mental-health diseases as negatives, in the context of (2) a human brain-specific functional interaction network to (3) build an evidence-weighted, network-based classifier capturing autism-specific gene interaction patterns and (4) predict the probability of autism association of each gene across the genome. We demonstrated the accuracy and utility of our genome-wide complement of autism-associated genes by (5) validating these predictions with de novo autism-associated mutations from an independent sequencing study, elucidating the spatiotemporal developmental gene-expression patterns of top-ranked autism-associated genes, laying out the landscape of autism-associated brain-specific functional modules (network clusters) and prioritizing candidate causal genes within large intervals of recurrent autism-associated copy-number variants.

Evaluation of autism-associated gene predictions. (a) Performance of autism-associated gene prediction with different training gold standards. Each boxplot corresponds to the distribution of the AUC obtained from 50 evaluations (10× five-fold cross-validation; ends and center line: 25th, 50th and 75th percentiles of AUC, respectively; notches: 95% confidence interval around the median; whiskers: 1.5× interquartile range above and below the 25th and 75th percentiles; dots: outliers). Evidence-weighted classifier (purple) significantly outperforms all the other classifiers (Wilcoxon rank-sum test, U ≥ 2,399, P ≤ 1 × 10−14). (b) Rank-based enrichment of three gene sets from an independent sequencing study toward the top of our genome-wide ASD gene-ranking (summarized as z-scores; number of genes (n) given below; Online Methods). Enrichment P-values atop each bar were calculated using a permutation test (Online Methods). (c–e) Evaluation of the overlap of mutation and functional gene sets within the first decile (top 10%) of our predictions (one-sided binomial test). Fractions of genes in the gene set (y axis) that occurred within each decile of the genome-wide ranking (x axis; first decile colored red; number of genes in top decile/total and enrichment P-value in parentheses). (c) Decile enrichments of mutation data (used in b) were consistent with the trend observed in rank-based tests (in b). (d) Experimentally determined targets of the major ASD-associated regulatory proteins FMRP, CHD8 and TOP1 were significantly enriched among the top-decile predictions. (e) Members of major ASD-associated pathways and complexes (Wnt signaling, MAPK signaling, and the postsynaptic density complex) showed similar significant enrichments.

ASD-associated genetic changes in the spatiotemporal development of the brain. The heat map shows the enrichment of spatiotemporal gene expression signatures toward the top of the genome-wide ranking of ASD genes. The 16 brain regions and 13 developmental stages considered label the rows and columns, respectively. The regions are further marked on illustrations of the human brain at the top right. Each cell (row, column) in the heat map corresponds to a spatiotemporal signature: a set of genes highly expressed specifically in that region (row) at that developmental stage (column). The intensity of the color in each cell (scale below) represents the log-transformed significance of ASD-association of that signature. A rank-based scoring followed by a permutation test was used to calculate P-values, which were then converted to Q-values to account for multiple hypothesis testing (Online Methods). The heat map shows a striking prenatal signal suggesting a major effect of ASD-associated mutations on fetal brain development.

Autism-associated brain-specific functional modules. The network of brain-specific functional interactions among the top 2,500 ASD-associated genes were clustered using a shared-nearest-neighbor–based community-finding algorithm (Online Methods) to elucidate several modules of genes (left). Nine of the clusters that contained 10 or more genes, labeled C1 through C9, were tested for functional enrichment using genes annotated to Gene Ontology biological process terms. Representative processes and pathways enriched within each cluster are presented here alongside the cluster label. Since C6 and C7 shared a number of strong links across the clusters, they were merged before calculating functional enrichment. The enriched functions provide a landscape of cellular functions potentially dysregulated by ASD-associated mutations.

Prioritization of genes within eight recurrent ASD-associated CNVs. Each plot corresponds to one of the eight CNVs ordered based on their observed frequency in persons with ASD (given above each plot). The plots show the genes in each CNV interval in their genomic order (y axis) and each gene’s ASD-association rank in our genome-wide predictions (x axis) from top (low) ranks on the left to bottom (high) ranks on the right. The points corresponding to the genes are colored based on whether there exists previously known (direct) genetic (red) or (indirect) functional (blue) links between the genes and autism, independently curated by an ASD expert. The detailed rankings and evidence for CNV genes are in . The dashed lines are visual aids to read the gene names (in bold) of the colored points. Across CNVs, genes with independent genetic evidence (red) and those with functional evidence (blue) are more likely to be ranked near the top of our genome-wide predictions (toward the left of each plot) than other genes for which there is no such evidence (gray).

Convergence of cellular functions disrupted by multiple CNVs identified through key intermediate genes in the brain network. The network diagram illustrates intermediate genes linking the eight CNVs (gray rectangles) to the molecular phenotype of ASD. The dotted lines represent high-confidence functional links in the brain that mediate the linkage of top CNV genes to ASD genes; these linkages go through key intermediate genes (black circles; Online Methods). Enrichment analysis groups the intermediate genes into a small number of autism-related processes (illustrated as colored clouds). For visual clarity, only representative examples of processes associated with at least two CNVs are included. The functions of these intermediate genes illustrate the hypothesis that multiple CNVs might disrupt a core group of ASD-related biological processes and pathways.