How to Weigh a Mountain of Evidence: Guest Blog by Professor Paul McKeigue (Part 1)

The first of two articles explaining how assess different theories and applying these principles to the Ghouta and Khan Sheikhoun cases. These articles were written by Professor Paul McKeigue and have been reproduced with permission of Professor Tim Hayward who first published them on his blog.

In this first of a two part special guest blog, my colleague Paul McKeigue will explain how the formal logic of probability theory can be used to evaluate the evidence for alternative explanations of an event like the Khan Sheikhoun chemical incident in April this year. As an epidemiologist and genetic statistician, Paul is expert in this approach.

Paul picks up on the recent debate between George Monbiot and myself (here and here) observing how we were somewhat at cross-purposes. George was insisting that I offer a competing explanation to the ‘Assad did it’ story, but I declined to speculate, having no independent way of knowing. Because George believed a “mountain” of evidence supported his belief, he found it vexing that I should question it without venturing a specific alternative explanation.

Paul points the way forward by arguing that the logic of probability theory implies that you cannot evaluate the evidence for or against a single hypothesis, but only the evidence favouring one hypothesis over another. He shows, with simple examples, how an observation that is consistent with a hypothesis does not necessarily support that hypothesis against an alternative; in fact, an observation that is highly unlikely under one hypothesis may still support that hypothesis if it is even more unlikely under an alternative.

In this framework, then, evidence for a claim that the Syrian government carried out a chemical attack in Khan Sheikhoun cannot be evaluated except by comparison with an alternative explanation. The problem for anyone who formulates such an alternative explanation is that, in the current climate, they are likely to be denounced as a “conspiracy theorist”. Paul shows, however, that you cannot evaluate evidence without envisaging what you would expect to observe if each of the alternative hypotheses were true. This inevitably requires you to ‘speculate’: it doesn’t mean that you endorse any of the alternative hypotheses.

In Part 1 posted here below, Paul explains the approach and demonstrates how it can be applied to evaluate the evidence for alternative explanations of the alleged chemical attack in Ghouta in 2013. We welcome discussions of the approach in the comments. In Part 2, he will examine the evidence for alternative explanations of the Khan Sheikhoun incident.

Paul McKeigue

Using probability calculus to evaluate evidence for alternative hypotheses, including deception operations

In this post I will try to show how the formal framework of hypothesis testing based on probability theory is able to separate subjective beliefs about the plausibility of alternative explanations, on which we can agree to differ, from the evaluation of the weight of evidence supporting each of these alternative explanations, on which it should be easier to reach a consensus. We can then begin to apply this to the Syrian conflict.

Although the mathematical basis for using evidence from observations to update the probability of a hypothesis was first set out by the 18th century clergyman Thomas Bayes, the first practical use of this framework was for cryptanalysis by Alan Turing at Bletchley Park. This was later elaborated by his assistant Jack Good as a general approach to evaluating evidence and testing hypotheses. This approach to testing hypotheses has been standard practice in genetics since the 1950s, and has spread into many other fields of scientific research, especially astronomy. It underlies the revolution in machine learning and artificial intelligence that is beginning to transform our lives. Although the practical usefulness of the Bayes-Turing framework is not in question, this does not prove that it is the only logical way to evaluate evidence. The basis for this was provided by the physicist Richard Cox, who showed that degrees of belief must obey the mathematical rules of probability theory if they satisfy simple rules of logical consistency. Another physicist, Edwin Jaynes, drew together the approach developed by Turing and Good with Cox’s proof to develop a philosophical framework for using Bayesian inference to evaluate uncertain propositions. In this framework, Bayesian inference is just an extension of the ordinary rules of logic to manipulating uncertain propositions; any other way of evaluating evidence would violate rules of logical consistency. There are too many names – not limited to Bayes, Turing, Good, Cox and Jaynes – attached to the development of this framework to name it after all of them, so I’ll follow Jaynes and just call it probability calculus.

The objective of this post and the one that follows is to show you, the reader, how to evaluate evidence for yourself using simple back-of-the-envelope calculations based on probability calculus.

Some fundamental principles of probability calculus can be expressed without using mathematical language:-

For two alternative hypotheses, H1and H2, the evidence favouring H1 over H2 is evaluated by comparing how well H1 would have predicted the observations with how well H2would have predicted the observations.

We cannot evaluate the evidence for or against a single hypothesis, only the evidence favouring one hypothesis over another.

The evidence favouring one hypothesis over another can be calculated without having to specify your prior degree of belief in which of these two hypotheses is correct. Two people may have different priors, but their calculations of the strength of evidence favouring one hypothesis over another should agree if they agree on what they would expect to observe if each of these hypotheses were true.

For a light-hearted tutorial in how to apply these principles in everyday life, try this exercise.

To take the argument further, I need to explain some simple maths. If you already have a basic grounding in Bayesian inference, you can skip to the next section. Otherwise, you can work through the brief tutorial below, or try an online tutorial like this one .

Before you have seen the evidence, your degree of belief in which of these alternatives is correct can be represented as your prior odds. For instance if you believe H1and H2are equally probable, your prior odds are 1 to 1, or even odds in everyday language. After you have seen the evidence, your prior odds are updated to become your posterior odds.

Bayes’ theorem specifies how evidence updates prior odds to posterior odds. The theorem can be stated in the form:-

Your prior odds encode your degree of belief favouring H1over H2, before you have seen the observations. Priors are subjective: one person may assign prior odds of 100 to 1 favouring H1 over H2, while another may believe that both hypotheses are equally probable.

The likelihood of a hypothesis is the conditional probability of the observations given that hypothesis. To evaluate it, we have to envisage what would be expected to happen if the hypothesis were true. We can think of the likelihood as measuring how well the hypothesis can predict the observation.

Likelihoods of hypotheses measure the relative support for those hypotheses; they are not the probabilities of those hypotheses.

The ratio of the likelihood of H1to the likelihood of H2is called the Bayes factor or simply the likelihood ratio. In recognition of his mentor, Good called it the “Bayes-Turing factor”.

It is only through the likelihood ratio that your prior odds are modified by evidence to posterior odds. All the evidence on whether the observations support H1or H2is contained in the likelihood ratio: this is the likelihood principle.

Examples

You have two alternative hypotheses about a coin that is to be tossed: H1that the coin is fair, and H2 that the coin is two-headed. In most situations your prior belief would be that H1 is far more probable than H2. Given the observation that the coin comes up heads when tossed once, the likelihood of a fair coin is 0.5 and the likelihood of a two-headed coin is 1. The likelihood ratio favouring a two-headed coin over a fair coin is 2. This won’t change your prior odds much. If, after the first ten tosses, the coin has come up heads every time, the likelihood ratio is 210=1024, perhaps enough for you to suspect that someone has got hold of a two-headed coin.

Hypothesis H1is that all crows are black (as in eastern Scotland), and hypothesis H2 is that only 1 in 8 crows are black (as in Ireland where most crows are grey). The first crow you observe is black. Given this single observation, the likelihood of H1 is 1, and the likelihood of H2 is 1/8. The likelihood ratio favouring H1 over H2 is 8. So if your prior odds were 2 to 1 in favour of H1, your posterior odds, after this first observation, will be 16 to 1. This posterior will be your prior when you next observe a crow. If this next crow is also black, the likelihood ratio contributed by this observation is again 8, and your posterior odds favouring H1 over H2 will be updated to (16×8=128) to 1.

Bayes’ theorem can be expressed in an alternative form by taking logarithms. If your maths course didn’t cover logarithms, don’t be put off. To keep things simple, we’ll work in logarithms to base 2. The logarithm of a number is defined as the power of 2 that equals the number. So for instance the logarithm of 8 is 3 (2 to the power of 3 equals 8). The logarithm of 1/8 is minus 3, and the logarithm of 1 is zero. Taking logarithms replaces multiplication and division by addition and subtraction, which is why if you went through secondary school before the arrival of cheap electronic calculators you were taught to use logarithms for calculations. However logarithms are not just for calculations but fundamental to using maths to solve problems in the real world, especially those that have to do with information. I was dismayed to find that here in Scotland, where logarithms were invented, they have disappeared from the national curriculum for maths up to year 5 of secondary school.

The logarithm of the likelihood ratio is called the weight of evidence favouring H1over H2. As taking logarithms replaces multiplying by adding, we can rewrite Bayes’ theorem as

prior weight + weight of evidence = posterior weight

where the prior weight and posterior weight are respectively the logarithms of the prior odds and posterior odds. If we use logarithms to base 2, the units of measurement of weight are called bits (binary digits).

So we can rewrite the crow example (prior odds 2 to 1, likelihood ratio 8, posterior odds 2×8=16) as

prior weight = 1 bit (21=2)

likelihood ratio = 3 bits (23=8)

posterior weight = 1 + 3 = 4 bits

One advantage of working with logarithms is that it gives us an intuitive feel for the accumulation of evidence: weights of evidence from independent observations can be added, just like physical weights. Thus in the coin-tossing example above, after one toss of the coin has come up heads the weight of evidence is one bit. After the first ten coin tosses have come up heads, the weight of evidence favouring a two-headed coin is 10 bits. As a rule of thumb, 1 bit of evidence can be interpreted as a hint, 2 to 3 bits as weak evidence, 5 to 6 bits as modest evidence, and anything more than that as strong evidence.

Hempel’s paradox

An observation that is consistent with a hypothesis is not necessarily evidence in favour of that hypothesis.

Good showed that this is not a paradox, but a corollary of Bayes’ theorem. To explain this, he constructed a simple example (I have changed the numbers to make it easier to work in logarithms to base 2). Suppose there are two Scottish islands denoted A and B. On island A, there are 215birds of which 26 are crows and all these crows are black. On island B, there are 215 birds of which 212 are crows and 29of these crows (that is, one eighth of all crows) are black. You wake up on one of these islands and the first bird that you observe is a black crow. Is this evidence that you are on island A, where all crows are black?

You can’t do inference without making assumptions. I’ll assume that on each island all birds, whatever their species or colour, have equal chance of being seen first. The likelihood of island A, given this observation, is 2-9. The likelihood of island B is 2-3. The weight of evidence favouring island B over island A is [−3−(−9)]=6 bits. So the observation of a black crow is moderately strong evidence against the hypothesis that you are on island A where all crows are black. So, when two hypotheses are compared, an observation that is consistent with a hypothesis can nevertheless be evidence against that hypothesis.

The converse applies: an observation that is highly improbable given a hypothesis is not necessarily evidence against that hypothesis. As an example, we can evaluate the evidence for a hypothesis that most readers will consider an implausible conspiracy theory: that the Twin Towers of the World Trade Center were brought down not by the hijacked planes that crashed into them but by demolition charges placed in advance, with the objective of bringing about a “new Pearl Harbour” in the form of a catastrophic event that would provoke the US into asserting military dominance. We’ll call the two alternative hypotheses for the cause of the collapses – plane crashes, planned demolitions – H1and H2 respectively. The proponents of this hypothesis attach great importance to the observation that a nearby smaller tower (Building 7), collapsed several hours after the Twin Towers for reasons that are not obvious to non-experts. I have no expertise in structural engineering, but I’m prepared to go along with their assessment that the collapse of a nearby smaller tower has low probability given H1. However I also assess that the probability of this observation given H2 is equally low. If the planners’ objective in destroying the Twin Towers was to create a catastrophic event, why would they have planned to demolish a nearby smaller tower several hours later, with the risk of giving away the whole operation? For the sake of argument, I’ll put a value of 0.05 on both these likelihoods. Note that it doesn’t matter whether the observation is stated as “collapse of a nearby tower” for which the likelihoods of H1 and H2 are both 0.05, or as “collapse of Building 7” for which (if there were five such buildings all equally unlikely to collapse) the likelihoods of H1 and H2 would both be 0.01. For inference, all that matters is the ratio of the likelihoods of H1 and H2 given this observation. If this ratio is 1, the weight of evidence favouring H1 over H2is zero.

The conditional probabilities in this example are my subjective judgements. I make no apology for this; the logic of probability calculus says that you can’t evaluate evidence without making these subjective judgements, that these subjective judgements must obey the rules of probability theory, and that any other way of evaluating evidence violates axioms of logical consistency. If your assessment of these conditional probabilities differs from mine, that’s not a problem as long as you can explain your assessments of these probabilities in a way that makes sense to others. The general point on which I think most readers will agree is that although the collapse of a nearby smaller tower would not have been predicted from H1, it would not have been predicted from H2either. The likelihood of a hypothesis given an observation measures how well the hypothesis would have predicted that observation.

We can see from this example that to evaluate the evidence favouring H1over H2, you have to assess, for each hypothesis in turn, what you would expect to observe if that hypothesis were true. Like a detective solving a murder, you have to “speculate”, for each possible suspect, how the crime would have been carried out if that individual were the perpetrator. This requirement is imposed by the logic of probability calculus: complying with it does not imply that you are a “conspiracy theorist”. The principle of evaluating how the data could have been generated under alternative hypotheses applies in many other fields: for instance, medical diagnosis, historical investigation, and intelligence analysis. A CIA manual on intelligence analysis sets out a procedure for ‘analysis of competing hypotheses’ which ‘demands that analysts explicitly identify all the reasonable alternative hypotheses, then array the evidence against each hypothesis – rather than evaluate the plausibility of each hypothesis one at a time.’ I am not trying to tell people who are expert in these professions that they don’t know how to evaluate evidence. However it can still be useful to work through the formal framework of probability calculus to identify when intuition is misleading. For instance, where two analysts evaluating the same observations disagree on the weight of evidence, working through the calculation will identify where their assumptions differ, and how the evaluation of evidence depends on these assumptions.

An interesting argument about the use of Bayesian evidence in court can be found in this judgement of the Appeal Court in 2010. In a murder trial, the forensic expert had given evidence that there was “moderate scientific support” for a match of the defendant’s shoes to the shoe marks at the crime scene, but had not disclosed that this opinion was based on calculating a likelihood ratio. The judges held that where likelihood ratios would have to be calculated from statistical data that were uncertain and incomplete, such calculations should not be used by experts to form the opinions that they presented to the court. However the logic of probability calculus imply that you cannot evaluate the strength of evidence except as a likelihood ratio. Calculating this ratio makes explicit the assumptions that are used to assess the strength of evidence. In this case, the expert had used national data on shoe sales to assign the likelihood that the foot marks were made by someone else, given that the foot marks were made by size 11 trainers. The conditional probability of size 11 trainers, given that they were made by someone else, should have been based on the frequency of size 11 trainers among people present at similar crime scenes. It was because the calculations were made available at the appeal that the judges were able to criticize the assumptions on which they were based and to overturn the conviction.

We next consider an example of Hempel’s paradox from the Syrian conflict.

Rockets used in the alleged chemical attack in Ghouta in 2013: evidence for or against Syrian government responsibility?

To explain the alleged chemical attack in Ghouta in 2013, two alternative hypotheses have been proposed, which we’ll denote H1and H2

H1states that a chemical attack was carried out by the Syrian military, under orders from President Assad. The proponents of this hypothesis include the US, UK and French governments.

H2states that a false-flag chemical attack was carried out by the Syrian opposition, with the objective of bringing about a US-led attack on the Syrian armed forces. A leading proponent of this hypothesis was the blogger “sasa wawa”, who set up a crowd-sourced investigation of the Ghouta incident. The evidence generated during this investigation was later set out in the framework of probability calculus by the Rootclaim project, founded by Saar Wilf, an Israeli entrepreneur (and noted international poker player) with a background in the signals intelligence agency Unit 8200. I think we can tentatively identify “sasa wawa”, who seemed “to have unlimited time and energy and to be some sort of polymath”, as Wilf.

Other hypotheses are possible: for instance we can define hypothesis H3that an unauthorized chemical attack was carried out by a rogue element in the Syrian military, and hypothesis H4 that there was no chemical attack but a massacre of captives, in which rockets and sarin were used to create a false trail of evidence for a chemical attack. But for now we’ll just consider H1 and H2.

You may have strong prior beliefs about the plausibility of these two hypotheses: for instance you may believe that H1is highly implausible because the Syrian government had no motive to carry out such an attack when OPCW inspectors had just arrived, or you may take the view that H2 is an absurd conspiracy theory requiring us to believe that the opposition carried out a large-scale chemical attack on themselves. Whatever your prior beliefs, to evaluate the evidence you must be prepared to envisage for each of these hypotheses what would be expected to happen if that hypothesis were correct, in order to compute the likelihood of that hypothesis. This requires for H1 that you put yourself in the shoes of a Syrian general ordered to carry out a chemical attack, or for H2that you put yourself in the shoes of an opposition commander planning a false-flag chemical attack that will implicate the regime.

A key observation is credited to Eliot Higgins, who showed that the rockets examined by the OPCW inspectors at the impact sites were a close match to a type of rocket that the Syrian army had been using as an improvised short-range siege weapon. This “Volcano” rocket consisted of a standard artillery rocket with a 60-litre tank welded over the nose, giving it a heavier payload but a very short range (about 2 km).

In Higgins’s interpretation, which has been widely disseminated, this observation is evidence for hypothesis H1. Let’s apply the framework of probability calculus to compute the weight of evidence favouring H1 over H2given this observation.

First, we compute the likelihood of H1. The Syrian military had large stocks of munitions specifically designed to deliver nerve agent at medium to long range, including missiles and air-delivered bombs together with equipment for safely filling them with sarin. Given that they had been ordered to carry out a chemical attack, I assess the probability that they would have used these purpose-designed munitions as at least 0.9. The probability that they would have used an improvised short-range siege weapon, which to reach the target would have had to be fired from the front line or from within opposition-controlled territory, is rather low: I assess this as about 0.05. This is the likelihood of H1given the observation.

Second, we compute the likelihood of H2. Given that under this hypothesis the objective of the attack was to implicate the Syrian government, the opposition had to be able to show munitions at the sites of sarin release that could plausibly be attributed to the Syrian military. They had two possible ways to do this: (1) to fake an air strike, with fragments of air-delivered munitions matching something in the Syrian arsenal; or (2) to use rockets or artillery shells matching something in the Syrian military arsenal. Volcano rockets, either captured from Syrian army stocks or copied, would have been ideal for this. With no other reason to choose between options (1) and (2), we assign equal probabilities to them under H2. The likelihood of H2given the observation is therefore 0.5.

The likelihood ratio favouring H2over H1 is 10, corresponding to a weight of evidence of 3.3 bits. Your assessment of the conditional probabilities may vary from mine, but I think the general point is clear: from hypothesis H1 we would not have predicted that the Syrian military would have chosen to use an improvised chemical munition rather than their stocks of purpose-designed chemical munitions, but from hypothesis H2 we would have expected the opposition to use any munition available to them that would implicate the Syrian army. So this is a classic example of Hempel’s paradox: an observation consistent with hypothesis H1 does not necessarily support H1, but instead contributes (under a plausible specification of the conditional probabilities) weak evidence favouring the alternative hypothesis H2.

This also shows how, by using the framework of probability calculus we are able to separate prior beliefs from evaluation of the weight of evidence. Your evaluation of the weight of evidence depends only on the ratios of the conditional probabilities that you specify for the observation given H1or given H2; it does not depend on your prior odds.

Comparison with Rootclaim’s evaluation of the weight of evidence contributed by the rockets

As discussed above, where two analysts disagree on evaluating the weight of evidence contributed by the same observations, using probability calculus allows us to identify exactly where their assumptions differ. For the weight of evidence favouring H2 over H1given the observation of Volcano rockets, I assigned a value of 3.3 bits. I had not looked at Rootclaim’s assessment which assigns a value of minus 0.5 bits for the weight of evidence favouring H2 over H1.

Let’s see how Rootclaim’s assumptions differ from mine. For the probability of observing Volcano rockets given H1, Rootclaim assigns the same value (0.05) as I have. However Rootclaim assigns a value of only 0.036 to the probability of observing Volcano rockets under H2. Rootclaim obtains this value by multiplying together a probability of 0.4 that an opposition group would capture Volcano rockets, a probability of 0.3 that another opposition group with access to sarin would find this group, and a probability of 0.3 that these two groups would figure out how to fill the munition with sarin.

I think Rootclaim’s assignment of the conditional probability of observing Volcano rockets given H2does not correctly condition on what is implied by H2. Under hypothesis H2, the purpose of the operation is to implicate the Syrian government. The conditional probability of observing Volcano rockets given H2 is the probability that these rockets would be found given that the opposition plans to release sarin and to leave a false trail of evidence implicating the Syrian military. To release sarin the opposition has to figure out some way to fill munitions (rockets or improvised devices) with it. To implicate the Syrian military, the opposition has to use a munition (captured or copied) that matches something in the Syrian military’s arsenal. The only choice for the opposition, given hypothesis H2, is whether to use a munition fired from the ground, like the Volcano rocket, or to use remnants of an air-dropped bomb with an improvised chemical-releasing device. With nothing to choose between these two options given H2, I have assigned them equal probabilities.

Without these calculations being stated explicitly, there would have been no way for you, the reader, to evaluate the difference between my assessment that the rockets contribute weak evidence in favour of hypothesis H2and Rootclaim’s assessment that the rockets contribute practically no evidence favouring either hypothesis. By working through the formal framework of probability calculus, you can see that this difference arises because my assignment of the likelihood of H2 is based on assumptions about the purpose of the deception operation that is implied by hypothesis H2.

This example illustrates a more general principle: to evaluate the likelihood of a hypothesis that implies a deception operation, we must condition on what that deception would entail.

Evidence contributed by the non-occurrence of an expected event

To evaluate all relevant evidence, we must include the non-occurrence of events that would have been expected under at least one of the alternative hypotheses. This is the principle set out in the case of “the curious incident of the dog in the night-time“: Holmes noted that the observation that the dog did not bark had low probability given the hypothesis of an unrecognized intruder, but high probability given the hypothesis that the horse was taken by someone that the dog knew.

From the alleged chemical attack in Ghouta, a “dog did not bark” observation is that despite the mass of stills and video clips uploaded that showed victims in hospitals or morgues, no images have appeared that showed victims being rescued in their homes or bodies being recovered from affected homes. The only images from Ghouta purporting to show victims being found where they had collapsed were obviously fraudulent, showing nine alleged victims of chemical attack dead in the stairwell of an unfinished building named the “Zamalka Ghost House” by researchers.

As an exercise, you can assess the likelihoods of each of the following hypotheses, given the observation that no images showing rescue of victims in their homes, or recovery of bodies of people who had died in their homes were made available. To put this observation in context, this page lists more than 150 original videos uploaded, most showing victims in hospitals or morgues, attributed to 18 different opposition media operations.

H1: a chemical attack was carried out by the Syrian military, authorized by the government

H2: a false-flag chemical attack was carried out by the Syrian opposition to implicate the government

H3: an unauthorized chemical attack was carried out by a rogue element in the Syrian military

H4: there was no chemical attack but a managed massacre of captives, with rockets and sarin used to create a trail of forensic evidence that would implicate the Syrian government in a chemical attack.

Given each of these hypotheses in turn, what do you assess to be the conditional probability that none of the uploaded videos would show the rescue of victims in their homes or the recovery of bodies of people who had died in their homes?

In the next post we shall explore how to apply the formal framework of probability calculus to evaluate the weight of evidence for alternative explanations of the alleged chemical attack in Khan Sheikhoun.

Trackbacks and Pingbacks:

[…] 10, 2017 — Leave a comment Paul McKeigue now applies the method described in Part 1 of his guest blog to the events in Ghouta (2013) and Khan Sheikhoun (2017). Based on extensive research, a false […]