LIGO Echoes, P-values and the False Discovery Rate

Today is our staff Christmas lunch so I thought I’d get into the spirit by posting a grumbly article about a paper I found on the arXiv. In fact I came to this piece via a News item in Nature. Anyway, here is the abstract of the paper – which hasn’t been refereed yet:

In classical General Relativity (GR), an observer falling into an astrophysical black hole is not expected to experience anything dramatic as she crosses the event horizon. However, tentative resolutions to problems in quantum gravity, such as the cosmological constant problem, or the black hole information paradox, invoke significant departures from classicality in the vicinity of the horizon. It was recently pointed out that such near-horizon structures can lead to late-time echoes in the black hole merger gravitational wave signals that are otherwise indistinguishable from GR. We search for observational signatures of these echoes in the gravitational wave data released by advanced Laser Interferometer Gravitational-Wave Observatory (LIGO), following the three black hole merger events GW150914, GW151226, and LVT151012. In particular, we look for repeating damped echoes with time-delays of 8MlogM (+spin corrections, in Planck units), corresponding to Planck-scale departures from GR near their respective horizons. Accounting for the “look elsewhere” effect due to uncertainty in the echo template, we find tentative evidence for Planck-scale structure near black hole horizons at 2.9σ significance level (corresponding to false detection probability of 1 in 270). Future data releases from LIGO collaboration, along with more physical echo templates, will definitively confirm (or rule out) this finding, providing possible empirical evidence for alternatives to classical black holes, such as in firewall or fuzzball paradigms.

I’ve highlighted some of the text in bold. I’ve highlighted this because as written its wrong.

I’ve blogged many times before about this type of thing. The “significance level” quoted corresponds to a “p-value” of 0.0037 (or about 1/270). If I had my way we’d ban p-values and significance levels altogether because they are so often presented in a misleading fashion, as it is here.

What is wrong is that the significance level is not the same as the false detection probability. While it is usually the case that the false detection probability (which is often called the false discovery rate) will decrease the lower your p-value is, these two quantities are not the same thing at all. Usually the false detection probability is much higher than the p-value. The physicist John Bahcall summed this up when he said, based on his experience, “about half of all 3σ detections are false”. You can find a nice (and relatively simple) explanation of why this is the case here (which includes various references that are worth reading), but basically it’s because the p-value relates to the probability of seeing a signal at least as large as that observed under a null hypothesis (e.g. detector noise) but says nothing directly about the probability of it being produced by an actual signal. To answer this latter question properly one really needs to use a Bayesian approach, but if you’re not keen on that I refer you to this (from David Colquhoun’s blog):

One problem with all of the approaches mentioned above was the need to guess at the prevalence of real effects (that’s what a Bayesian would call the prior probability). James Berger and colleagues (Sellke et al., 2001) have proposed a way round this problem by looking at all possible prior distributions and so coming up with a minimum false discovery rate that holds universally. The conclusions are much the same as before. If you claim to have found an effects whenever you observe a P value just less than 0.05, you will come to the wrong conclusion in at least 29% of the tests that you do. If, on the other hand, you use P = 0.001, you’ll be wrong in only 1.8% of cases.

Of course the actual false detection probability can be much higher than these limits, but they provide a useful rule of thumb,

To be fair the Nature item puts it more accurately:

The echoes could be a statistical fluke, and if random noise is behind the patterns, says Afshordi, then the chance of seeing such echoes is about 1 in 270, or 2.9 sigma. To be sure that they are not noise, such echoes will have to be spotted in future black-hole mergers. “The good thing is that new LIGO data with improved sensitivity will be coming in, so we should be able to confirm this or rule it out within the next two years.

Unfortunately, however, the LIGO background noise is rather complicated so it’s not even clear to me that this calculation based on “random noise” is meaningful anyway.

The idea that the authors are trying to test is of course interesting, but it needs a more rigorous approach before any evidence (even “tentative” can be claimed). This is rather reminiscent of the problems interpreting apparent “anomalies” in the Cosmic Microwave Background, which is something I’ve been interested in over the years.

The real surprise might come from the spin measurements rather than any deviation from GR.

I am not an expert in statistics as you are but it seems to me that the statistical significance depends on the experimental object and subject and its unknown unknowns… I mean if a test object requires a higher sigma level than it should in a series of tests compared to the mean, it is a sign of unknown unknowns. The Large Hadron Collider in that sense seems to defy the level of significance on more than the mean, probably due to its high level of complexity generating unknown unknowns.