Using prior knowledge and genome-wide association to identify pathways involved in multiple sclerosis

The efforts of the Human Genome Project are beginning to provide important findings for human health. Technological advances in the laboratory, particularly in characterizing human genomic variation, have created new approaches for studying the human genome - genome-wide association studies (GWAS). However, current statistical and computational strategies are taking only partial advantage of this wealth of information. In the quest for susceptibility genes for complex diseases in GWAS data, several different analytic strategies are being pursued. In a recent report, Baranzini and colleagues used a pathway- and network-based analysis to explore potentially interesting single locus association signals in a GWAS of multiple sclerosis. This and other pathway-based approaches are likely to continue to emerge in the GWAS literature, as they provide a powerful strategy to detect important modest single-locus effects and gene-gene interaction effects.

In the search for susceptibility genes for common complex diseases, we are faced with enormous challenges. The past decade's paradigm of focusing a study on just one or a few candidate genes limits our ability to identify novel genetic effects associated with disease. In addition, many susceptibility genes can show effects that are partially or solely dependent on interactions with other genes and/or the environment. Genome-wide association studies (GWAS) have been proposed as a solution to these problems; however, the analysis of whole-genome data is problematic because we must separate the one or few true but modest signals from the extensive background noise. Moreover, with GWAS data alone, the ability to elucidate gene-environment interactions is limited. GWAS researchers must embrace the abundant clinical and environmental data available to complement the rich genotypic data, with the ultimate goal of revealing the genetic and environmental factors that are important for disease risk. So far, GWAS have taken a simplistic, 'one SNP at a time' analysis approach. This approach is ignoring the complexity of common complex diseases.

Recent technological advances enable the genotyping of hundreds of thousands of human single-nucleotide polymorphisms (SNPs) on thousands of samples. We are hindered in exploiting these laboratory advances because strategies for analyzing the data have not kept pace with technological progress. Even with these challenges, successful reports of GWAS have emerged in the literature [1–6]. In fact, the National Human Genome Research Institute (NHGRI) keeps an updated GWAS catalog on their website [7], which lists over 273 published GWAS so far. Unfortunately, as expected, only the strongest associations can be detected using these traditional approaches, and there are many more genes still to be found [8, 9].

The majority of these studies analyzed one SNP at a time, meaning that they have barely scratched the surface of interesting information within these datasets. Ultimately, supplementary data, replication datasets, or multiple analytical approaches must be used to filter the results down to a manageable number of the 'most likely' genes. In their recent report in Human Molecular Genetics, Baranzini et al. [10] developed and applied a pathway- and network-based analysis to exploit interesting association signals in the SNPs that fell between the thresholds of P = 0.05 and P = 10-8 in the original single-SNP association analysis. It is methods such as this that are likely to allow us to better characterize and exploit GWAS signals in the so-called 'gray region' between genome-wide significance (P = 10-8) and the typical P = 0.05. In the field, genome-wide significance has been determined to be at P = 10-8 because that is the Bonferroni correction for P = 0.05 for 1 million tests.

Because some of the replicating, positive results in GWAS fell below the level of genome-wide significance (that is, had P-values over 10-8), Baranzini and colleagues [10] propose a protein interaction and network-based analysis (PINBPA) for the study of a multiple sclerosis (MS) dataset. This approach is similar to those in microarray studies in which gene ontologies are used for analysis [11]. The idea of using prior knowledge for GWAS has been used successfully in studies of diseases such as Parkinson's disease, age-related macular degeneration, bipolar disorder, rheumatoid arthritis, and Crohn's disease [12–14].

The first step of PINBPA is to compute a gene-wise P-value by choosing the lowest P-value of all SNPs mapping to a given gene. These genes are then mapped onto a curated protein interaction network. Any markers that do not map to genes or unannotated genes are eliminated from this analysis. Next, using a plug-in for the Cytoscape [15] software, searches are conducted to extract potentially meaningful sub-networks associated with the phenotype of interest. Finally, a test is performed to determine the extent to which significant network modules could be obtained by chance. Baranzini and colleagues [10] applied PINBPA to two MS datasets: (i) the International Multiple Sclerosis Genetics Consortium (IMSGC) GWAS [16], consisting of 334,923 SNPs passing quality control from the Affymetrix Human Mapping 500 K Array in 931 family trios, and (ii) the GeneMSA study [17], with 551,642 SNPs passing quality control from the Illumina HumanHap550 bead chip in 978 cases and 883 controls. After single-locus analysis using logistic regression, 78 and 87 SNPs had a P-value of less than 1 × 10-4 in the IMSGC and GeneMSA datasets, respectively.

Using PINBPA analysis, 346 significant modules were identified on the basis of their aggregate degree of association with MS. Because of the nature of the algorithm, many modules overlap extensively; thus, the modules with the highest scores were selected. Module I included several human leukocyte antigen (HLA) genes, including the known risk factor for MS HLA-DRB1. Interestingly, this module shows HLA-DRA as the most significant node. HLA-DRB1 and HLA-DRA are in high linkage disequilibrium and some SNPs in HLA-DRA serve as proxies for HLA-DRB1 with high sensitivity. Module II includes an extensive pattern of immunity-related genes, including several HLA genes: CD4, CD82, ITGB2, IL2Ra, and CD58. Finally, modules III and IV suggest a neural component, including genes expressed in neurons and glia, such as NCK2, EPHA3, EPHA4 (module III) and glutamate receptor genes (module IV) and many more. The results of this study [10] provide insights into the role of several immunological pathways, including cell adhesion, signaling, and communication, and, more importantly, neural pathways in MS. In particular, signals for axon guidance and synaptic potentiation were over-represented in MS. This is very exciting, as it is one of the first reports demonstrating genetic associations in a neural pathway contributing to the susceptibility of MS. Because the pathophysiology of MS suggests that neural pathways are likely to have a role, these results provide enormous potential for follow-up research.

Baranzini et al. [10] demonstrate the utility of protein interaction network information in the analysis of MS data. Several GWAS [12–14, 18, 19] have proposed the use of prior knowledge in the form of pathway databases, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Biocarta, or gene ontology databases. Baranzini et al. [10] suggest that the network-based approach not only reduces the number of relevant interactions found but also increases the likelihood that proteins that interact are part of the same biological pathway [10]. This approach was certainly successful for MS.

Consistent with this line of thought, Bush et al. [20] have constructed the Biofilter as another alternative approach for detecting interactions in GWAS data. The Biofilter combines six sources of disease-independent information (information that is not related to the phenotype of interest) from the public domain: KEGG, Reactome, Gene Ontology, Database of Interacting Proteins (DIP), Protein Families Database (PFAM), and Netpath. It also includes disease-dependent information in the nature of previous linkage regions, association studies, and microarray expression results. All these sources are combined specifically to prioritize the search for gene-gene interactions in GWAS data [20].

Pathway-based approaches are continuing to emerge in the literature as a more comprehensive approach to the analysis of GWAS data. This trend is likely to continue as we learn more about the optimal strategies for incorporating prior knowledge into analyses. In fact, as we move to using next-generation sequencing data, such approaches may also expand into the next-generation sequencing arena: looking for rare variants in a particular pathway that are present in a higher proportion of disease cases than healthy controls. As more biological knowledge and genomic data become publicly available and more easily accessible, we will continue to see methodological developments exploit this information to better dissect the genetic architecture of common, complex disease.