Abstract

Background

The 27k Illumina Infinium Methylation Beadchip is a popular high-throughput technology
that allows the methylation state of over 27,000 CpGs to be assayed. While feature
selection and classification methods have been comprehensively explored in the context
of gene expression data, relatively little is known as to how best to perform feature
selection or classification in the context of Illumina Infinium methylation data.
Given the rising importance of epigenomics in cancer and other complex genetic diseases,
and in view of the upcoming epigenome wide association studies, it is critical to
identify the statistical methods that offer improved inference in this novel context.

Results

Using a total of 7 large Illumina Infinium 27k Methylation data sets, encompassing
over 1,000 samples from a wide range of tissues, we here provide an evaluation of
popular feature selection, dimensional reduction and classification methods on DNA
methylation data. Specifically, we evaluate the effects of variance filtering, supervised
principal components (SPCA) and the choice of DNA methylation quantification measure
on downstream statistical inference. We show that for relatively large sample sizes
feature selection using test statistics is similar for M and β-values, but that in
the limit of small sample sizes, M-values allow more reliable identification of true
positives. We also show that the effect of variance filtering on feature selection
is study-specific and dependent on the phenotype of interest and tissue type profiled.
Specifically, we find that variance filtering improves the detection of true positives
in studies with large effect sizes, but that it may lead to worse performance in studies
with smaller yet significant effect sizes. In contrast, supervised principal components
improves the statistical power, especially in studies with small effect sizes. We
also demonstrate that classification using the Elastic Net and Support Vector Machine
(SVM) clearly outperforms competing methods like LASSO and SPCA. Finally, in unsupervised
modelling of cancer diagnosis, we find that non-negative matrix factorisation (NMF)
clearly outperforms principal components analysis.

Conclusions

Our results highlight the importance of tailoring the feature selection and classification
methodology to the sample size and biological context of the DNA methylation study.
The Elastic Net emerges as a powerful classification algorithm for large-scale DNA
methylation studies, while NMF does well in the unsupervised context. The insights
presented here will be useful to any study embarking on large-scale DNA methylation
profiling using Illumina Infinium beadarrays.