False discovery rate

The false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the expected proportion of "discoveries" (rejected null hypotheses) that are false (incorrect rejections of the null).[1] FDR-controlling procedures provide less stringent control of Type I errors compared to familywise error rate (FWER) controlling procedures (such as the Bonferroni correction), which control the probability of at least one Type I error. Thus, FDR-controlling procedures have greater power, at the cost of increased numbers of Type I errors.[2]

The modern widespread use of the FDR is believed to stem from, and be motivated by, the development in technologies that allowed the collection and analysis of a large number of distinct variables in several individuals (e.g., the expression level of each of 10,000 different genes in 100 different persons).[3] By the late 1980s and 1990s, the development of "high-throughput" sciences, such as genomics, allowed for rapid data acquisition. This, coupled with the growth in computing power, made it possible to seamlessly perform hundreds and thousands of statistical tests on a given data set. The technology of microarrays was a prototypical example, as it enabled thousands of genes to be tested simultaneously for differential expression between two biological conditions.[4]

As high-throughput technologies became common, technological and/or financial constraints led researchers to collect datasets with relatively small sample sizes (e.g. few individuals being tested) and large numbers of variables being measured per sample (e.g. thousands of gene expression levels). In these datasets, too few of the measured variables showed statistical significance after classic correction for multiple tests with standard multiple comparison procedures. This created a need within many scientific communities to abandon FWER and unadjusted multiple hypothesis testing for other ways to highlight and rank in publications those variables showing marked effects across individuals or treatments that would otherwise be dismissed as non-significant after standard correction for multiple tests. In response to this, a variety of error rates have been proposed—and become commonly used in publications—that are less conservative than FWER in flagging possibly noteworthy observations.

The FDR concept was formally described by Yoav Benjamini and Yosef Hochberg in 1995[1] (BH procedure) as a less conservative and arguably more appropriate approach for identifying the important few from the trivial many effects tested. The FDR has been particularly influential, as it was the first alternative to the FWER to gain broad acceptance in many scientific fields (especially in the life sciences, from genetics to biochemistry, oncology and plant sciences).[3] In 2005, the Benjamini and Hochberg paper from 1995 was identified as one of the 25 most-cited statistical papers.[5]

Prior to the 1995 introduction of the FDR concept, various precursor ideas had been considered in the statistics literature. In 1979, Holm proposed the Holm procedure,[6] a stepwise algorithm for controlling the FWER that is at least as powerful as the well-known Bonferroni adjustment. This stepwise algorithm sorts the p-values and sequentially rejects the hypotheses starting from the smallest p-values.

Benjamini (2010)[3] said that the false discovery rate, and the paper Benjamini and Hochberg (1995), had its origins in two papers concerned with multiple testing:

The first paper is by Schweder and Spjotvoll (1982)[7] who suggested plotting the ranked p-values and assessing the number of true null hypotheses (m0{\displaystyle m_{0}}) via an eye-fitted line starting from the largest p-values. The p-values that deviate from this straight line then should correspond to the false null hypotheses. This idea was later developed into an algorithm and incorporated the estimation of m0{\displaystyle m_{0}} into procedures such as Bonferroni, Holm or Hochberg.[8] This idea is closely related to the graphical interpretation of the BH procedure.

The second paper is by Branko Soric (1989)[9] which introduced the terminology of "discovery" in the multiple hypothesis testing context. Soric used the expected number of false discoveries divided by the number of discoveries (E[V]/R){\displaystyle \left(E[V]/R\right)} as a warning that "a large part of statistical discoveries may be wrong". This led Benjamini and Hochberg to the idea that a similar error rate, rather than being merely a warning, can serve as a worthy goal to control.

The BH procedure was proven to control the FDR for independent tests in 1995 by Benjamini and Hochberg.[1] In 1986, R. J. Simes offered the same procedure as the "Simes procedure", in order to control the FWER in the weak sense (under the intersection null hypothesis) when the statistics are independent.[10]

where Q{\displaystyle Q} is defined to be 0 when R=0{\displaystyle R=0}.
One wants to keep FDR below a threshold q. To include the case when R=0{\displaystyle R=0}, formally FDR=E[V/R|R>0]⋅P(R>0){\displaystyle \mathrm {FDR} =\mathrm {E} \!\left[V/R|R>0\right]\cdot \mathrm {P} \!\left(R>0\right)}.[1]

The following table defines the possible outcomes when testing multiple null hypotheses.
Suppose we have a number m of null hypotheses, denoted by: H1, H2, ..., Hm.
Using a statistical test, we reject the null hypothesis if the test is declared significant. We do not reject the null hypothesis if the test is non-significant.
Summing each type of outcome over all Hi yields the following random variables:

Null hypothesis is true (H0)

Alternative hypothesis is true (HA)

Total

Test is declared significant

V

S

R

Test is declared non-significant

U

T

m−R{\displaystyle m-R}

Total

m0{\displaystyle m_{0}}

m−m0{\displaystyle m-m_{0}}

m

m is the total number hypotheses tested

m0{\displaystyle m_{0}} is the number of true null hypotheses, an unknown parameter

The settings for many procedures is such that we have H1…Hm{\displaystyle H_{1}\ldots H_{m}} null hypotheses tested and P1…Pm{\displaystyle P_{1}\ldots P_{m}} their corresponding p-values. We list these p-values in ascending order and denote them by P(1)…P(m){\displaystyle P_{(1)}\ldots P_{(m)}}. A procedure that goes from a small p-value to a large one will be called a step-up procedure. In a similar way, in a "step-down" procedure we move from a large corresponding test statistic to a smaller one.

Geometrically, this corresponds to plotting P(k){\displaystyle P_{(k)}} vs. k (on the y and x axes respectively), drawing the line through the origin with slope αm{\displaystyle {\frac {\alpha }{m}}} , and declaring discoveries for all points on the left up to and including the last point that is below the line.

The BH procedure is valid when the m tests are independent, and also in various scenarios of dependence, but is not universally valid.[11] It also satisfies the inequality:

If an estimator of m0{\displaystyle m_{0}} is inserted into the BH procedure, it is no longer guaranteed to achieve FDR control at the desired level.[3] Adjustments may be needed in the estimator and several modifications have been proposed.[12][13][14][15]

Note that the mean α{\displaystyle \alpha } for these m tests is α(m+1)2m{\displaystyle {\frac {\alpha (m+1)}{2m}}}, the Mean(FDR α{\displaystyle \alpha }) or MFDR, α{\displaystyle \alpha } adjusted for m independent or positively correlated tests (see AFDR below). The MFDR expression here is for a single recomputed value of α{\displaystyle \alpha } and is not part of the Benjamini and Hochberg method.

Using a multiplicity procedure that controls the FDR criterion is adaptive and scalable. Meaning that controlling the FDR can be very permissive (if the data justify it), or conservative (acting close to control of FWER for sparse problem) - all depending on the number of hypotheses tested and the level of significance.[3]

The FDR criterion adapts so that the same number of false discoveries (V) will have different implications, depending on the total number of discoveries (R). This contrasts with the family wise error rate criterion. For example, if inspecting 100 hypotheses (say, 100 genetic mutations or SNPs for association with some phenotype in some population):

If we make 4 discoveries (R), having 2 of them be false discoveries (V) is often very costly. Whereas,

If we make 50 discoveries (R), having 2 of them be false discoveries (V) is often not very costly.

The FDR criterion is scalable in that the same proportion of false discoveries out of the total number of discoveries (Q), remains sensible for different number of total discoveries (R). For example:

If we make 100 discoveries (R), having 5 of them be false discoveries (q=5%{\displaystyle q=5\%}) may not be very costly.

Similarly, if we make 1000 discoveries (R), having 50 of them be false discoveries (as before, q=5%{\displaystyle q=5\%}) may still not be very costly.

Controlling the FDR using the linear step-up BH procedure, at level q, has several properties related to the dependency structure between the test statistics of the m null hypotheses that are being corrected for. If the test statistics are:

If all of the null hypotheses are true (m0=m{\displaystyle m_{0}=m}), then controlling the FDR at level q guarantees control over the FWER (this is also called "weak control of the FWER"): FWER=P(V≥1)=E(VR)=FDR≤q{\displaystyle \mathrm {FWER} =P\left(V\geq 1\right)=E\left({\frac {V}{R}}\right)=\mathrm {FDR} \leq q}, simply because the event of rejecting at least one true null hypothesis {V≥1}{\displaystyle \{V\geq 1\}} is exactly the event {V/R=1}{\displaystyle \{V/R=1\}}, and the event {V=0}{\displaystyle \{V=0\}} is exactly the event {V/R=0}{\displaystyle \{V/R=0\}} (when V=R=0{\displaystyle V=R=0}, V/R=0{\displaystyle V/R=0} by definition).[1] But if there are some true discoveries to be made (m0<m{\displaystyle m_{0}<m}) then FWER ≥ FDR. In that case there will be room for improving detection power. It also means that any procedure that controls the FWER will also control the FDR.

Q′{\displaystyle Q'} is the proportion of false discoveries among the discoveries", suggested by Soric in 1989,[9] and is defined as: Q′=E[V]R{\displaystyle Q'={\frac {E[V]}{R}}}. This is a mixture of expectations and realizations, and has the problem of control for m0=m{\displaystyle m_{0}=m}.[1]

FDR−1{\displaystyle \mathrm {FDR} _{-1}}(or Fdr) was used by Benjamini and Hochberg,[3] and later called "Fdr" by Efron (2008) and earlier.[20] It is defined as: FDR−1=Fdr=E[V]E[R]{\displaystyle \mathrm {FDR} _{-1}=Fdr={\frac {E[V]}{E[R]}}}. This error rate cannot be strictly controlled because it is 1 when m=m0{\displaystyle m=m_{0}}.

FDR+1{\displaystyle \mathrm {FDR} _{+1}} was used by Benjamini and Hochberg,[3] and later called "pFDR" by Storey (2002).[21] It is defined as: FDR+1=pFDR=E[VR|R>0]{\displaystyle \mathrm {FDR} _{+1}=pFDR=E\left[\left.{\frac {V}{R}}\right|R>0\right]}. This error rate cannot be strictly controlled because it is 1 when m=m0{\displaystyle m=m_{0}}.

FDCR (False Discovery Cost Rate). Stemming from statistical process control: associated with each hypothesis i is a cost ci{\displaystyle \mathrm {c} _{i}} and with the intersection hypothesis H00{\displaystyle H_{00}} a cost c0{\displaystyle c_{0}}. The motivation is that stopping a production process may incur a fixed cost. It is defined as: FDCR=E(c0V0+∑ciVic0R0+∑ciRi){\displaystyle \mathrm {FDCR} =E\left(c_{0}V_{0}+{\frac {\sum c_{i}V_{i}}{c_{0}R_{0}+\sum c_{i}R_{i}}}\right)}

The false coverage rate (FCR) is, in a sense, the FDR analog to the confidence interval. FCR indicates the average rate of false coverage, namely, not covering the true parameters, among the selected intervals. The FCR gives a simultaneous coverage at a 1−α{\displaystyle 1-\alpha } level for all of the parameters considered in the problem. Intervals with simultaneous coverage probability 1−q can control the FCR to be bounded by q. There are many FCR procedures such as: Bonferroni-Selected–Bonferroni-Adjusted,[citation needed] Adjusted BH-Selected CIs (Benjamini and Yekutieli (2005)),[23] Bayes FCR (Yekutieli (2008)),[citation needed] and other Bayes methods.[24]

Colquhoun (2014)[31] used the term "false discovery rate" to mean the probability that a statistically significant result was a false positive. This was part of an investigation of the question "how should one interpret the P value found in a single unbiased test of significance". In subsequent work,[32][33] Colquhoun called the same thing the false positive risk, rather than the false discovery rate in order to avoid confusion with the use of the latter term in connection with the problem of multiple comparisons. The methods for dealing with multiple comparisons described above aim to control the type 1 error rate. The result of applying them is to produce a (corrected) P value. The result is, therefore, subject to the same misinterpretations as any other P value.