Abstract

Conventional methods for differential gene expression analysis perform well when intra-class heterogeneity is low and inter-class heterogeneity is high. However, many problems in biology and biomedicine, such as drug resistance in cancer, contain samples that show both significant intra- and inter-class heterogeneity. Conventional methods, which use means and variances to compute test statistics, lack power to identify genes differentially expressed between classes in these cases. To address this challenge, we developed EMDomics, a new method for differential gene expression analysis designed to perform well in the setting of intra-class heterogeneity.

EMDomics uses the Earth mover's distance (EMD) to measure the overall difference between the distribution of a gene's expression in two classes of samples and uses permutations to estimate false discovery rates and obtain q-values for each gene. To evaluate the theoretical basis for EMDomics, we model heterogeneous mechanisms of drug resistance using simulated data and compare the performance of EMDomics to the commonly used conventional methods (SAM and Limma), in terms of sensitivity and specificity for identifying genes truly associated with drug resistance in the simulation. To test EMDomics on real biological data, we applied it to the challenging problem of identifying genes associated with drug response in ovarian cancer, using data from The Cancer Genome Atlas.

In both the simulated and real biological data, EMDomics outperformed the competing approaches for the identification of differentially expressed genes. Using simulated data, EMDomics yielded higher sensitivity and precision for highly heterogeneous data. Using real data, EMDomics was able to identify genes that are highly relevant for ovarian cancer biology, which were not identified by the conventional methods. Also, applying gene set enrichment analysis showed that most highly enriched gene sets includes pathways known to play critical roles in ovarian cancer pathogenesis. The most enriched gene set identified by the EMDomics analysis is a set of genes down-regulated in cancer cell lines with mutated TP53. Other gene sets identified as highly enriched by the EMDomics analysis, include gene sets related to LEF1, BMI1, KRAS, EZH2, and PTEN, and pathways related to cell-cell junction organization, cell-cell communication, WNT signaling, and extracellular matrix organization.

EMDomics represents a new approach for the identification of genes differentially expressed between heterogeneous classes. It is a robust non-parametric method, which does not make any assumptions about the distributions or differences between the two classes being compared, and thus has significantly more power than conventional approaches for identifying differential ‘Omics features between heterogeneous classes. The method can be applied in a wide variety of settings to compare distributions of ‘Omics data between two classes.