Bayes Factors for ABX tests

2016-03-30 22:49:51

As we all know the frequentist approach to hypothesis testing calculates a p-value, that is we assume the null hypothesis (H0) to be true and calculate the probability of obtaining a result as extreme as the one observed, or more extreme.For X ~ B(n, p):H0: p = 0.5H1: p > 0.5P(X >= x | H0)

Now we can answer the question: how well, relative to each other, do the hypotheses explain the data?

I use log10(BF) so that negative evidence results can be read easier. The categories* I will use are:= 0: no support< 0.5: not worth more than a bare mention< 1: moderate< 1.5: strong< 2: very strong>= 2: decisive

Here are results for some common ABX trial counts including interpretation (according to Jeffreys 1961, Appendix B):

10 trials

Correct

5

6

7

8

9

10

P(x|M0)

2.461E-01

2.051E-01

1.172E-01

4.395E-02

9.766E-03

9.766E-04

P(x|M1)

9.091E-02

1.319E-01

1.612E-01

1.759E-01

1.808E-01

1.817E-01

log10(BF10)

-0.432

-0.192

0.139

0.602

1.267

2.270

negative

negative

barely

moderate

strong

decisive

12 trials

Correct

6

7

8

9

10

11

12

P(x|M0)

2.256E-01

1.934E-01

1.208E-01

5.371E-02

1.611E-02

2.930E-03

2.441E-04

P(x|M1)

7.692E-02

1.091E-01

1.333E-01

1.467E-01

1.521E-01

1.536E-01

1.538E-01

log10(BF10)

-0.467

-0.248

0.043

0.437

0.975

1.720

2.799

negative

negative

barely

barely

moderate

very strong

decisive

14 trials

Correct

7

8

9

10

11

12

13

14

P(x|M0)

2.095E-01

1.833E-01

1.222E-01

6.110E-02

2.222E-02

5.554E-03

8.545E-04

6.104E-05

P(x|M1)

6.667E-02

9.285E-02

1.132E-01

1.254E-01

1.310E-01

1.328E-01

1.333E-01

1.333E-01

log10(BF10)

-0.497

-0.295

-0.033

0.312

0.771

1.379

2.193

3.339

negative

negative

negative

barely

moderate

strong

decisive

decisive

16 trials

Correct

8

9

10

11

12

13

14

15

16

P(x|M0)

1.964E-01

1.746E-01

1.222E-01

6.665E-02

2.777E-02

8.545E-03

1.831E-03

2.441E-04

1.526E-05

P(x|M1)

5.882E-02

8.064E-02

9.810E-02

1.092E-01

1.148E-01

1.169E-01

1.175E-01

1.176E-01

1.176E-01

log10(BF10)

-0.524

-0.335

-0.095

0.214

0.616

1.136

1.807

2.683

3.887

negative

negative

negative

barely

moderate

strong

very strong

decisive

decisive

20 trials

Correct

10

11

12

13

14

15

16

17

18

19

20

P(x|M0)

1.762E-01

1.602E-01

1.201E-01

7.393E-02

3.696E-02

1.479E-02

4.621E-03

1.087E-03

1.812E-04

1.907E-05

9.537E-07

P(x|M1)

4.762E-02

6.364E-02

7.699E-02

8.623E-02

9.151E-02

9.397E-02

9.490E-02

9.517E-02

9.523E-02

9.524E-02

9.524E-02

log10(BF10)

-0.568

-0.401

-0.193

0.067

0.394

0.803

1.313

1.942

2.721

3.698

4.999

negative

negative

negative

barely

barely

moderate

strong

very strong

decisive

decisive

decisive

*) The above categories may seem somewhat arbitrary similar to significance levels. They are not needed however since we can just look at the odds directly:Posterior Odds = Bayes Factor * Prior Odds

Example:We have two files which we have prior data on that tell us that about one in ten people can distinguish them.Prior Odds = 0.1 / (1 - 0.1) = 0.111...A person scores 9/10 in an ABX test, which gives us a Bayes Factor of 10^1.267 = 18.5.Posterior Odds = 2.056So the odds for this person doing better than chance (M1 over M0) are about 2:1.

Let's say the person does another 9/10, so 18/20 in total for a Bayes Factor of 525.5, resulting in odds of about 58:1.

Please consider that a high BF does not guarantee that a difference was heard. Again, we all know that various problems can creep into such a test that will make the results meaningless. For example, Evett (1991) has argued for a BF of at least 1000 against innocence in a criminal trial for forensic evidence alone. Also, even a BF of 1000 can still be too low to provide enough evidence for an extraordinary claim.

My 2¢ on this is that we want to see strong evidence or better for simple ABX tests. (Whether to take results seriously depends on much more than just this single number however.) Especially with higher trial counts this turns out to be more demanding than a 5% significance level.

Re: Bayes Factors for ABX tests

Thanks for a very interesting post. It's a shame I only saw this recently.

It is my opinion too that posterior odds are a more intuitive way to interpret ABX test results. These days I find that the more traditional p-value approach makes less sense to me the more I think about it.

Re: Bayes Factors for ABX tests

It's not just about the odds. You can convert any probability into odds if you like.

The important difference is that in the frequentist approach you calculate the probability of the data given/assuming that no difference was heard (the null hypothesis H0).If the probability of the data is unlikely then we reject the hypothesis which we assumed to be true for the calculation in the first place.

But in the Bayesian approach you calculate the probability that no difference was heard given that data, or that a difference was correctly identified given the data.Furthermore the Bayes Factors tell you how strong the data (the test results) supports one model or hypothesis over another.

And lastly it incorporates prior knowledge: extraordinary claims require extraordinary evidence.A 10/10 result for the claim that a losslessly compressed file sounds different from an uncompressed one will not convince anyone and no-one should believe that claim based on that evidence alone. But if the claim is that there's an audible difference between a lossless and 64 kbps mp3 then there's no real need for further evidence. (Although more data is always nice...)

Re: Bayes Factors for ABX tests

It's not just about the odds. You can convert any probability into odds if you like.

Yes. But with Bayes factors you can actually have a probabilistic statement about whether you were able to tell a difference.

When the Foobar ABX comparator says "probability you were guessing is x", this is not technically correct. As far as I can tell Foobar does not return the Bayesian "probability you were guessing", but rather a frequentist p-value. In a frequentist setting you cannot assign probabilities to fixed unknowns.

The Bayesian takes the simpler view that all unknowns are random. Therefore any unknown quantity can be assigned a probability.

Re: Bayes Factors for ABX tests

Remember that a high Bayes Factor for a model alone doesn't tell you that the model is true. It just tells you that it's a much better fit for the data.But there's the simple Posterior Odds = BF * Prior Odds.

In case of total ignorance (which realistically is as good as never the case) you'd start with 1:1 prior odds and then feed the resulting odds as prior odds into the same formula again for each repetition of the test.(If you plotted the probability distributions you'd also see a decrease in uncertainty for each repetition.)

You're right that what the ABX comparator says isn't right. There's another thread about it somewhere.It's the probability of getting the same or a more extreme result given no difference was heard, that the choices were made with random fair coin flips.

Re: Bayes Factors for ABX tests

If you want to be really (pedantically) Bayesian you can say that there is no objective truth when dealing with the unknowable and hence there is no underlying "true model". This is getting into troll territory though.

I'm guessing you know the concept of prior sample size. With the uniform prior in your post you have a prior sample size of two: one success and one fail. You can even have a prior sample size of one if you use an appropriately scaled Beta(0.5,0.5) prior. The Beta(0.5,0.5) prior can be shown to minimize the influence of the prior on the posterior distribution. In this way you can maximize the objectivity of the test if you want to. https://en.wikipedia.org/wiki/Beta_distribution

Re: Bayes Factors for ABX tests

EekWit, pedantically we can never arrive at truth using such tests and evaluations. But we can get to odds "beyond reasonable doubt" either way.That's just life ... where we cannot easily prove things such as in axiomatic systems.

On the uniform prior: yeah. I was not precise enough when I spoke about "total ignorance". The flat or Beta(1, 1) prior contains the knowledge that a trial can both fail and succeed. I think that's a very reasonable and sensible assumption for an ABX test.It also follows the principle of indifference: from .45 to .55 you get the same probability as from .9 to 1 which is - big surprise - 10%.

Beta(0, 0) could be interpreted as: either the trials always fail or they always succeed.This would make more sense e.g. for testing a whether a chemical reaction happens or not.0% everywhere except for the 100% at both extremes.

Beta(0.5, 0.5) could be interpreted as: we don't know that it's possible for trials to both fail and succeed.But that gets you 6% from .45 to .55 and 20% from .9 to 1.

This prior could make sense in a situation where you didn't know what kind of proportion between 0 and 1 you're dealing with (could be linear, could be logarithmic ...) and try to minimize the effects of the prior.And there are many other attempts at "objective" or "uninformative" or "diffuse" priors since the Jeffreys prior is not without problems can can even lead to inconsistent results, but that's a complex topic.

Re: Bayes Factors for ABX tests

Beta(0, 0) could be interpreted as: either the trials always fail or they always succeed.This would make more sense e.g. for testing a whether a chemical reaction happens or not.0% everywhere except for the 100% at both extremes.

Beta(0.5, 0.5) could be interpreted as: we don't know that it's possible for trials to both fail and succeed.But that gets you 6% from .45 to .55 and 20% from .9 to 1.

This prior could make sense in a situation where you didn't know what kind of proportion between 0 and 1 you're dealing with (could be linear, could be logarithmic ...) and try to minimize the effects of the prior.And there are many other attempts at "objective" or "uninformative" or "diffuse" priors since the Jeffreys prior is not without problems can can even lead to inconsistent results, but that's a complex topic.

Thank you. This is an excellent explanation of how to interpret "uninformative" beta priors. I agree that for ABX a uniform prior makes more sense, because we know it is always possible to respond correctly or incorrectly.