Bottom Line:
However, a proper filtering procedure is critical to reduce the search space prior to the computationally intensive gene-gene interaction identification step.We propose a simple and effective ensemble approach in which the results from multiple runs of an unstable filter are aggregated based on the general theory of ensemble learning.Furthermore, the ensemble of TuRF achieved the highest success rate in comparison to many state-of-the-art algorithms as well as traditional χ2-test and odds ratio methods in terms of retaining gene-gene interactions.

Affiliation: School of Information Technologies, University of Sydney, NSW 2006, Australia. yangpy@it.usyd.edu.au

ABSTRACT

Background: Complex diseases are commonly caused by multiple genes and their interactions with each other. Genome-wide association (GWA) studies provide us the opportunity to capture those disease associated genes and gene-gene interactions through panels of SNP markers. However, a proper filtering procedure is critical to reduce the search space prior to the computationally intensive gene-gene interaction identification step. In this study, we show that two commonly used SNP-SNP interaction filtering algorithms, ReliefF and tuned ReliefF (TuRF), are sensitive to the order of the samples in the dataset, giving rise to unstable and suboptimal results. However, we observe that the 'unstable' results from multiple runs of these algorithms can provide valuable information about the dataset. We therefore hypothesize that aggregating results from multiple runs of the algorithm may improve the filtering performance.

Results: We propose a simple and effective ensemble approach in which the results from multiple runs of an unstable filter are aggregated based on the general theory of ensemble learning. The ensemble versions of the ReliefF and TuRF algorithms, referred to as ReliefF-E and TuRF-E, are robust to sample order dependency and enable a more informative investigation of data characteristics. Using simulated and real datasets, we demonstrate that both the ensemble of ReliefF and the ensemble of TuRF can generate a much more stable SNP ranking than the original algorithms. Furthermore, the ensemble of TuRF achieved the highest success rate in comparison to many state-of-the-art algorithms as well as traditional χ2-test and odds ratio methods in terms of retaining gene-gene interactions.

Figure 1: Correlation comparison. The correlation between SNP ranking (log10 transformed) generated by two runs of ReliefF, TuRF, ReliefF-E, and TuRF-E using a simulated datasets (400 and 800 samples) and the AMD dataset in which each run use a different sample order.

Mentions:
We found that the SNP ranking generated by ReliefF and TuRF are sensitive to the order in which the samples are presented in the dataset. Figure 1(a) shows the Pearson correlation of the ranking of the SNPs in two separate runs of ReliefF and TuRF using a dataset containing 1000 SNPs and 400 samples (200 controls and 200 cases). Figure 1(b) is the result of the same analysis applied to a simulated dataset containing 800 samples. It is clear that both ReliefF and TuRF algorithms are sensitive to the order of samples presented in datasets, causing the rank of each SNP inconsistent between the original dataset and the randomly re-ordered dataset. While such an inconsistency is relatively small for the ReliefF algorithm, the problem is much more severe in TuRF. The Pearson’s correlation coefficient of two runs of TuRF is r = 0.43 for the dataset with 400 samples and a r = 0.36 for the dataset with 800 samples. As we shall demonstrate later, such an instability caused by sample order dependency has led TuRF to perform suboptimally.

Figure 1: Correlation comparison. The correlation between SNP ranking (log10 transformed) generated by two runs of ReliefF, TuRF, ReliefF-E, and TuRF-E using a simulated datasets (400 and 800 samples) and the AMD dataset in which each run use a different sample order.

Mentions:
We found that the SNP ranking generated by ReliefF and TuRF are sensitive to the order in which the samples are presented in the dataset. Figure 1(a) shows the Pearson correlation of the ranking of the SNPs in two separate runs of ReliefF and TuRF using a dataset containing 1000 SNPs and 400 samples (200 controls and 200 cases). Figure 1(b) is the result of the same analysis applied to a simulated dataset containing 800 samples. It is clear that both ReliefF and TuRF algorithms are sensitive to the order of samples presented in datasets, causing the rank of each SNP inconsistent between the original dataset and the randomly re-ordered dataset. While such an inconsistency is relatively small for the ReliefF algorithm, the problem is much more severe in TuRF. The Pearson’s correlation coefficient of two runs of TuRF is r = 0.43 for the dataset with 400 samples and a r = 0.36 for the dataset with 800 samples. As we shall demonstrate later, such an instability caused by sample order dependency has led TuRF to perform suboptimally.

Bottom Line:
However, a proper filtering procedure is critical to reduce the search space prior to the computationally intensive gene-gene interaction identification step.We propose a simple and effective ensemble approach in which the results from multiple runs of an unstable filter are aggregated based on the general theory of ensemble learning.Furthermore, the ensemble of TuRF achieved the highest success rate in comparison to many state-of-the-art algorithms as well as traditional χ2-test and odds ratio methods in terms of retaining gene-gene interactions.

Affiliation:
School of Information Technologies, University of Sydney, NSW 2006, Australia. yangpy@it.usyd.edu.au

ABSTRACT

Background: Complex diseases are commonly caused by multiple genes and their interactions with each other. Genome-wide association (GWA) studies provide us the opportunity to capture those disease associated genes and gene-gene interactions through panels of SNP markers. However, a proper filtering procedure is critical to reduce the search space prior to the computationally intensive gene-gene interaction identification step. In this study, we show that two commonly used SNP-SNP interaction filtering algorithms, ReliefF and tuned ReliefF (TuRF), are sensitive to the order of the samples in the dataset, giving rise to unstable and suboptimal results. However, we observe that the 'unstable' results from multiple runs of these algorithms can provide valuable information about the dataset. We therefore hypothesize that aggregating results from multiple runs of the algorithm may improve the filtering performance.

Results: We propose a simple and effective ensemble approach in which the results from multiple runs of an unstable filter are aggregated based on the general theory of ensemble learning. The ensemble versions of the ReliefF and TuRF algorithms, referred to as ReliefF-E and TuRF-E, are robust to sample order dependency and enable a more informative investigation of data characteristics. Using simulated and real datasets, we demonstrate that both the ensemble of ReliefF and the ensemble of TuRF can generate a much more stable SNP ranking than the original algorithms. Furthermore, the ensemble of TuRF achieved the highest success rate in comparison to many state-of-the-art algorithms as well as traditional χ2-test and odds ratio methods in terms of retaining gene-gene interactions.