In this vignette, we explain the underlaying algorithmic details of our
implementation of the Wilcoxon-Mann-Whitney test. The source code used
to produce this document can be found in the github repository
BioQC.

BioQC is a R/Bioconductor package to detect tissue heterogeneity from
high-throughput gene expression profiling data. It implements an
efficient Wilcoxon-Mann-Whitney test, and offers tissue-specific gene
signatures that are ready to use 'out of the box'.

Algorithmic improvements {#algorithmic-improvements}

The Wilcoxon-Mann-Whitney (WMW) test is a non-parametric statistical
test to test if median values of two population are equal or not. Unlike
the t-test, it does not require the assumption of normal distributions,
which makes it more robust against noise.

We improved the computational efficiency of the Wilcoxon-Mann-Whitney
test in comparison to the native R implementation based on three
modifications:

The core algorithm is implemented in C instead of R as
programming language.

BioQC avoids futile expensive sorting operations.

While (1) and (2) are straightforward, we elaborate (3) in the
following.

Let Wa, b be the approximative WMW test of two gene
vectors a, b, where a is the gene set of interest, typically
containing less than a few hundreds of genes, and b is the set of all
genes outside the gene set (background genes) typically containing
>10000 genes. In the context of BioQC, the gene sets are referred to
as tissue signatures.

Given an m × n input matrix of gene expression data with m genes
and n samples s1, …, sn, and k gene sets
d1, …, dk, the WMW-test needs to be applied
for each sample si, i ∈ 1..n and each gene set
dj, j ∈ 1..k. The runtime of the WMW-test is
essentially determined by the sorting operation on the two input
vectors. Using native R wilcox.test, the vectors a and b are
sorted individually for each gene set. However, in the context of gene
set analysis, this is futile, as the (large) background set changes
insignificantly in relation to the (small) gene set, when testing
different gene sets on the same sample.

Therefore, we approximate the WMW-test by extending b to all genes in
the sample, keeping the background unchanged when testing multiple gene
sets. Like this, b has to be sorted only once per sample. The
individual gene sets still need to be sorted, which is not a major
issue, as they are small in comparison to the set of background genes.

Figure 1: BioQC speeds up the Wilcoxon-Mann-Whitney test by avoiding
futile sorting operations on the same sample.

Time benchmark {#time-benchmark}

To demonstrate BioQC's superior performance, we apply both BioQC and the
native R wilcox.test to random expression matrices and measure the
runtime.

We setup random expression matrices of 20514 human protein-coding genes
of 1, 5, 10, 50, or 100 samples. Genes are i.i.d distributed
following 𝒩(0, 1). The native R and the BioQC implementations of the
Wilcoxon-Mann-Whitney test are applied to the matrices respectively.

The numeric results of both implementations, bioqcNumRes (from
BioQC) and rNumRes (from R), are equivalent, as shown by the next
command.

expect_equal(bioqcNumRes, rNumRes)

The BioQC implementation is more than 500 times faster: while it takes
about one second for BioQC to calculate enrichment scores of all 155
signatures in 100 samples, the native R implementation takes about 20
minutes:

Figure 2: Time benchmark results of BioQC and R implementation of
Wilcoxon-Mann-Whitney test. Left panel: elapsed time in seconds
(logarithmic Y-axis). Right panel: ratio of elapsed time by two
implementations. All results achieved by a single thread on in a RedHat
Linux server.

Conclusion {#conclusion}

We have shown that BioQC achieves identical results as the native
implementation in two orders of magnitude less time. This renders
BioQC a highly efficient tool for quality control of large-scale
high-throughput gene expression data.