From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Sunday, November 26, 2006

MDR 101 - Part 2 - Filtering

MDR has traditionally carried out an exhaustive search of all possible 2-way, 3-way, up to n-way combinations of attributes (e.g. SNPs). This works well for candidate gene studies with 100 or fewer SNPs because the combinations can be completely enumerated in a reasonable amount of time on a desktop computer. As a side note, it is important to remember that the MDR software is designed to take advantage of PCs with multiple processors. Thus, if you have a PC with two processors that have threading turned on MDR will run approximately 4x faster that a single CPU with no threading. Make sure the next PC you buy has one or more multi-core chips in it!

When you have more than 100 SNPs an exhaustive search may not be practical unless you are willing to wait days or even weeks for a run to finish. When the number of SNPs exceeds 10,000 an exhaustive search of all 3-way and higher combinations may be infeasible, even with a parallel computer. As GWAS and whole-genome sequence data become ubiquitous there will be an increasing interest in using MDR to carry out a genome-wide analysis of epistasis. As a side note, we have released a C library (libMDR) to help you write fast MDR software for high-performance computing. We have also released a version of MDR that will run on GPUs (MDR-GPU). This GPU method was described in a 2009 paper in BMC Research Notes and a 2010 Bioinformatics paper. We have also reviewed the general use of GPU computing for bioinformatics and computational biology here.

We are actively developing two different strategies for genome-wide analysis using MDR. The first general strategy is to select a subset of 100 or 1000 SNPs, for example, from the list of 1,000,000 or more variants. This is the filter approach. The second approach is to use a stochastic search algorithm such as simulated annealing or genetic programming to search for an optimal combination. This is the wrapper approach. In this essay we will cover the filter approach since it is more of a pre-processing step. Previous blog posts (e.g. March 14, June 12, Sept. 12) have talked about our wrapper approaches. We will discuss this in more detail in a later MDR 101 post.

There are many different filters that can be applied to SNP data to reduce the number of attributes that need to be systematically evaluated using MDR. The first approach is to use statistical and computational algorithms to select SNPs that are most likely to interact. We have been evaluating and developing modifications to the Relief family of algorithms for this purpose. Relief was originally developed by Kira and Rendell (1992) and later extended in the ReliefF algorithm by Kononenko (1994). A good review of these algorithms can be found in Robnik-Sikonja and Kononenko (2003). The nice thing about ReliefF is that it can assign a single quality value to each SNPs in a way that captures dependencies or interactions between SNPs. We have developed several extensions of ReliefF that improve the power to filter interacting SNPs. Our first approach is called Tuned ReliefF or TuRF. A paper describing this approach was published in Lecture Notes in Computer Science. A second approach is called Evaporative ReliefF. This approach was developed in collaboration Dr. Brett McKinney and was published in Bioinformatics. A newer and much more powerful approach is called Spatially Uniform ReliefF or SURF. SURF is described in a 2009 BioData Mining paper. A newer SURF algorithm, SURF*, was published in 2010 in Lecture Notes in Computer Science. The key is to find a fast algorithm that can assign high quality values to SNPs involved in non-additive interactions without explicitly modeling the interactions which leads to the combinatorial problem that MDR faces. You don't want to use univariate approaches such as chi-square unless you explicitly want to condition on main effects. Interactions in the absence of statistically significant main effects are likely to be missed. For a review of filters and wrapper for epistasis analysis see our 2010 Bioinformatics paper.

The MDR software currently allows you to select ReliefF, TuRF, SURF, SURF*, chi-square, and odds ratio as filters. For ReliefF, it is probably best to set the number of nearest neighbors to 25% of your total sample size. This a lot more power than the default that most people use of 10. SURF* will have the most power unless your case-control ratio is imbalanced. Feel free to play with the settings if you want. Note that the odds ratio filter reports the 'best' odds ratio using several different reference groups. It should thus be seen as a 'maximized' odds ratio. For the chi-square test, the results can be viewed as either the chi-square values or the p-values. For all the filters, you can select the size of the subset you want by specifying the number you want, the percentage you want, or the subset that meets some threshold criteria (e.g. p-value). As an exercise, compare the list of best SNPs returned by ReliefF and chi-square. Those SNPs that have high ReliefF scores but non-significant chi-square scores are candidates for nonlinear interactions. In the papers on atrial fibrillation by Tsai et al. (2004) and Moore et al. (2006), the ACE I/D and T174M polymorphisms detected by MDR have the highest ReliefF scores of any polymorphism in the dataset but have non-significant chi-square values and thus would have been missed by conditioning on main effects. The M235T polymorphism has a main effect and is detected by both ReliefF and chi-square.

In addition to the algorithmic approach, we think it will be very useful to reduce the number of SNPs examined by using domain-specific knowledge about biochemical pathways, Gene Ontology, chromosomal location, published relationships, etc.. Why not use the years of accumulated knowledge about the disease you are studying and likely genes that might be involved? We think an optimal filter will combine algorithms such as ReliefF with knowledge about biological systems. As a side note, we have previously used pathway information to help organize our MDR analyses. See Williams et al. (2004) for an example. We review the use of biological knowledge in GWAS in our 2010 Bioinformatics paper. Marylyn Ritchie's group at Penn State has created a bioinformatics tool called BioFilter that can select SNPs prior to MDR analysis. We have also recently published a paper in BioData Mining carrying out gene set enrichment analysis of MDR results in GWAS data for ALS. This study found replicable pathway effects that were missed by previous GWAS.

In summary, statistical and biological filters can help reduce the number of SNPs that need to be exhaustively analyzed by MDR. The advantage of the filter approach is that it is computationally tractable. The disadvantage is that your list of filtered SNPs is only as good as the filter that is being used. Surely some good SNPs will be excluded from the analysis.

1 Comments:

I could not understand what exactly is "seed" in the software. Why does it show different dendograms on changing the seed? It showed a redundant interaction between two SNPs in our dataset, when the seed was fed as 0, but with a significant p value <0.01. But if the seed value is equal to or more than 1, we see a positive/synergistic interaction between the same SNPs. Please help

About Me

Edward Rose Professor of Informatics,
Director of the Institute for Biomedical Informatics, Director of the Division of Informatics in the Department of Biostatistics and Epidemiology,
Senior Associate Dean for Informatics,
The Perelman School of Medicine,
University of Pennsylvania