Forbes columnist Steven Salzberg and author-investigator Joe Nickell will each be awarded the 2012 Robert P. Balles Prize in Critical Thinking, to be presented by the Committee for Skeptical Inquiry at the CFI Summit in October.

Statistics and the Test of Natasha

June 7, 2005

This commentary is a supplement to my previous report on CSICOP’s test of Natasha Demkina’s claim (Hyman, 2005). It is also a response to the many criticisms of our test. Some criticisms were reactions to the airing of the Discovery Channel’s The Girl with the X-ray Eyes. At the time of this writing, this television program had aired a few times in Europe and Asia, but not in the United States. Nevertheless, it generated a debate in the media and the internet. The reports of the test by Andrew Skolnick (2005) and me further provoked several emails and letters. Many focused on the statistical aspects of our test.

The initial drafts of my report on testing Natasha described the reasoning behind the statistics. My colleagues persuaded me to omit this statistical discussion. They said that such details would confuse the reader. Ironically, the majority of the critics focus on the statistics. The commentators make the following assertions: 1) our test set the critical value for “success” unreasonably high; 2) our test lacked sufficient power to detect a real effect; 3) Natasha’s four correct matches out of seven should have sufficed to reject the null hypothesis (i.e., to justify a conclusion that the outcome was not due to chance).

Before I respond to these statistical issues, I will put our test in its proper context. A few critics said that we should never have conducted the test. The conditions were such that the test was bound to be flawed. I sympathize with this viewpoint. However, I would emphasize the following points:

When I was asked to help design and conduct the test, CSICOP had already accepted the producer’s invitation to supervise the test. Cancelling the test was not an option.

My colleagues, Andrew Skolnick and Richard Wiseman, agreed with me that, given the circumstances, we could not conduct a “definitive” test of Natasha’s claim.

Instead, we could use the test for a preliminary screening. This screening could tell us whether we should continue investigating Natasha’s claim. If the outcome showed that continuing was worthwhile then we would have gone to the next step. We would have advocated investigating Natasha’s claim with more sophisticated and costly procedures.

Although Natasha apparently can “see” through many kinds of cloth (her subjects are fully clothed when she diagnoses them) she would not let us test her with an opaque screen between her and the subject. Within the parameters of our test, the best we could do would be to reduce possible external clues. We could do this by selecting subjects who were similar in age, apparent health, and other potential external clues about their internal state. In addition, we could try to keep the subjects from behaving in ways that might provide indications about their conditions. Such precautions, however, would not exclude the possibility of her picking up external, possibly subtle, indications of the subjects’ internal states.

Ideally, our test would allow us to decide whether the number of correct matches excluded the possibility both of chance and of the use of external clues. We recognized that this was not possible in our situation. So we devised our test to decide between two alternatives: 1) Natasha could get a sufficient number correct to make it worthwhile to pursue her claim with more adequate procedures; 2) Natasha could not get a sufficient number correct even in a task in which we stacked the deck in her favor. In the latter case, we would have no reason to pursue her claim further.

We made it clear to the producer what our test could and could not achieve. We expected that the program would make this clear to the viewers. When the producer interviewed me after the test, I emphasized that the test was simply a preliminary probe to see if Natasha’s claim is worth additional investigation. To our dismay, the television program did not include our warnings about the limitations of our test. The creators of the program did many things well. By failing to include our cautions about the limitations of the test, however, they added to the confusions and misunderstandings.

The critics of our test, for the most part, claim that the outcome of our test does provide justification for taking her claim seriously. These critics believe that getting four correct matches in our test is impressive. So I will explain why this is not so.

Establishing the Test Criterion

Some critics said that we had used the wrong probabilities. These critics apparently failed to realize that our test involved the matching problem. Consequently, their suggested alternatives were simply wrong. My commentary deals with those criticisms that recognized that our test involved the matching problem. The calculations of the probabilities for the matching problem are tricky. My main sources were Feller (1950) and Mosteller (1965).1 Richard Wiseman confirmed the results of my calculations with tables of probabilities for the matching problem that he found in the Journal of the Society for Psychical Research.

The most difficult problem for me was finding a way to specify the matching distribution for the case where the expected mean was other than one. In our case, I wanted the distribution for the case where the expected value would be five. Fortunately, Persi Diaconis and Susan Holmes, both professors in the Department of Statistics at Stanford University, came to my rescue. They calculated the necessary Bayes factor that I required.

Conducting a good test involves two phases. The first is the design of the test. The second is the execution of the test. Although we had only a month to prepare, Andrew Skolnick, Richard Wiseman and I thought very carefully about the design. We agreed that the matching procedure was appropriate. Initially, we planned to use five subjects. Circumstances made it much harder to use a larger number. The requirements of the television show would not allow us to have Natasha evaluate many subjects. The logistics of finding a suitable set of subjects further restricted the number we could use. We also wanted to avoid the possibility of overworking Natasha.

However, I quickly realized that five subjects would not be enough to provide a reasonable opportunity for Natasha to display her powers. I insisted that we needed at least seven subjects for our test to have sufficient power. This placed even more demands upon Andrew and the people who were helping him to obtain subjects. Getting this number of suitable subjects within the available time frame required heroic efforts. It also clashed with our desire to find a group that differed in internal conditions, yet looked similar in outward appearance.

Andrew, Richard and I each independently agreed that the critical value for our test should be five correct matches. Interestingly, we each chose the same critical value, but for different reasons. Given the limitations of our test and the nature of Natasha’s claim, Richard wanted the outcome to be large before he would recommend continued investigation. Andrew was concerned with the practical implications. He wanted Natasha’s accuracy to be sufficiently high to justify her making medical diagnoses. I arrived at the critical value of five using Bayesian considerations which I will describe later in this commentary.

I believe the design was adequate. The execution of the test was less so. We had to rely on third parties to arrange crucial aspects of the test. These included finding a location for the test and assembling a suitable group of subjects. Despite the best efforts of our volunteers, we encountered last minute problems. Two of the subjects dropped out of the test a few hours before it was to begin. Adding two subjects at the last moment aggravated our problem of getting a group that was homogenous in outward appearance.

The Matching Test

The table below lists the probabilities for each possible outcome of our test. These outcomes range from 0 to 7. Note that six correct matches cannot occur in this test. I will leave it to the reader to figure out why. For our test, the third column is the most relevant. It gives the probability, for a given outcome, of getting that number or more correct if the matches are due to random guessing. We chose five as the critical value. The table indicates that the probability of getting five or more correct just by guessing is.0044. This is approximately one out of 227. Some of our critics presumably consider this overly stringent.

Natasha got four correct matches. The table shows that the probability of getting four or more correct is .0183. This is roughly two out of 100. Our critics argue that the probability of getting four or more correct is sufficiently low that we should have rejected the null hypothesis. These critics probably are operating under one or more of the following assumptions: 1) an outcome that has a probability less than .05 of occurring just by chance should be declared “significant” (This a conventional procedure within the null hypothesis testing framework); 2) the probabilities listed in the last column of this table are the appropriate probabilities for judging if Natasha’s claim is true; and 3) the outcomes are due either to chance or to Natasha’s X-ray vision. The third assumption, as I have already discussed, is questionable. So are the other two.

Number of Correct Matches (y)

Probability of y

Probability of y or more correct matches

0

.3679

1.0000

1

.3681

.6322

2

.1833

.2641

3

.0625

.0808

4

.0139

.0183

5

.0042

.0044

6

0

.0002

7

.0002

.0002

The Null Hypothesis Test

The critics of our test seem to be operating within the Null Hypothesis Test (NHT) framework. This is understandable, because this framework has dominated the testing of hypotheses in the social and biological sciences for the past 75 years. Throughout this history, the NHT has been controversial. You can find an accessible introduction to this matter in Pigliucci (2002) and Stenger, et al (2003). I did not use the NHT framework to choose the criterion for our test. However, I will use this framework to discuss the rationale for choosing the critical value.

The NHT framework is the one most likely to be familiar to the critics. Our choice of five correct matches for the critical value makes sense in this framework and in others. As the name suggests, the NHT involves setting up a hypothesis to be tested. Usually the hypothesis is that the outcome will be consistent with chance. This null hypothesis has been called a “straw man” because the investigator usually wants to knock it down. In our test, the null hypothesis is that the outcome comes from a distribution that would result if the matches were just random guesses. The distribution on the null hypothesis for our test is the one given in the preceding table. If the null hypothesis is true in our situation, then, on average, we would expect Natasha to get one correct match. However, even if the null hypothesis is true, she could achieve any one of the possible outcomes from zero to seven. Some of these outcomes are much less likely to occur than others. Outcomes of zero, one, or two correct matches have high probabilities of occurring. Outcomes of five or seven correct are highly unlikely to occur.

R.A. Fisher introduced NHT in 1925. The underlying logic involved computing the probability of an observed result given that the null hypothesis is true. If this probability is sufficiently small, then the researcher can reject the null hypothesis. I want to emphasize two points about the NHT. Usually, the experimenter has framed the test so that he or she hopes to reject the null hypothesis. If the researcher can reject the null hypothesis, he or she states that the outcome is “significant.”2 The second point involves the critical region or “level of significance.” Fisher suggested using the .05 level for significance, and subsequent investigators have followed his advice.

If we had used the .05 level of significance for our test, we would have chosen four as the critical value. An outcome of three correct would be too low because when the null hypothesis is true the probability of three or more correct matches is .08. This value exceeds the .05 level. When the null hypothesis is true, the probability of four or more correct matches is .018. Because this value is less than the .05 level, this would warrant choosing four as the critical value. So we would choose four as the critical value if we were conducting our test at the conventional .05 level of significance. This is the reason that the critics are arguing that we should have declared Natasha’s performance “significant.”

However, the proponents of the NHT, traditionally insisted that the use of the .05 level was warranted only if the alternative hypothesis was plausible compared with the existing knowledge in a domain. They recognized that not all hypotheses or claims are equal. The alternative hypothesis, as contrasted with the null hypothesis, is the one the investigator is typically trying to confirm. The vast majority of such hypotheses are highly plausible and consistent with the existing body of theory and data in a given area of inquiry. It was for these plausible hypotheses that the pioneers of NHT advocated using the .05 level of significance.

These pioneers recognized that implausible hypotheses needed a stronger degree of evidence. For such implausible hypotheses, the recommendation was to use a stricter level of significance such as the .01 or the .001 level of significance. Indeed, statistical textbooks contain tables not only for doing significance tests at the .05 level, but also at the .01 and the .001 level. The idea is the familiar one: extraordinary claims require extraordinary proof. In a way, these advocates of the NHT were implicitly recognizing principles that are explicit in the Bayesian approach. Proponents of NHT not only recognized that the plausibility of the alternative hypothesis had to be taken into account. They also realized that the consequences of accepting the alternative hypothesis had to be considered. Again, this implicit recognition of the utility of the test decision is consistent with its explicit recognition in Bayesian approaches. Both these considerations relate to our test.

A paranormal claim, by definition, is one that is implausible or highly unlikely given the accepted scientific framework. Even J.B. Rhine acknowledged that ESP claims need to be tested at a more stringent level than the traditional .05. (Unfortunately, contemporary parapsychologists have departed from Rhine’s advice and routinely test their paranormal claims at the .05 level.) Except for some parapsychologists, the vast majority of scientists would demand a level of significance of .001 or less. (Many physical scientists argue that such claims should be tested at an even more stringent level.) This implies that we should have used the .001 level. Unfortunately, to achieve this level of significance, we would have had to set the critical value at seven successful matches. As a compromise, I chose to use the .01 level. Setting the critical value at five or more correct matches achieves a significance level less than .01( The probability of getting five or more correct matches given the null hypothesis is .004). This is greater than the desired .001 level but consistent with the .01 level.

I decided against setting the critical level at seven because this would require Natasha to be 100% accurate in our test. We wanted to give her some leeway. More important, setting the critical value at seven would make it difficult to detect a true effect. On the other hand, I did not want to set the critical value at four because this would be treating the hypothesis that she could see into people’s bodies as if it were highly plausible. The compromise was to set the value at five. This provides reasonable protection against falsely rejecting the null hypothesis. It also provides a reasonable level of power to detect evidence in favor of the alternative hypothesis. This is our next topic.

Type I and Type II Errors

Fisher’s prescription for testing the null hypothesis recognizes only one type of error. The investigator can reject the null hypothesis when, in fact, it is true. By choosing a critical value, the experimenter can control the probability of making such an error. In typical tests of the null hypothesis, as I have discussed, this “level of significance” is set at .05. The statisticians Neyman and Pearson openly clashed with Fisher on this and other issues regarding testing hypotheses. Although the two approaches are logically incompatible, the NHT framework includes both Fisher’s and Neyman-Pearson’s procedures. Despite the continuing attacks on this hybrid of discrepant assumptions, the NHT method of testing hypotheses persists.

Neyman and Pearson argued that the testing of hypotheses could result in two types of errors. Type I errors occur when the investigator falsely rejects the null hypothesis when it is true. It is the probability of this error that the investigator controls by choosing a critical value for the test. Type II errors occur when the investigator fails to reject the null hypothesis when it is false. In most situations, the researcher has little direct control over the size of a Type II error. To determine the power of a test, the researcher needs to know the expected value and the distribution under the alternative hypothesis.

Our test uses the matching procedure. The preceding table shows the probabilities that I calculated for the null hypothesis. To estimate the power of our test, we need to specify the expected value of the outcome if the alternative hypothesis is true. The alternative hypothesis, remember, is the one that we are comparing to the null hypothesis. We also have to specify the critical value.

We set the critical value at five. This means that Natasha would have had to make five or more correct matches for the outcome to be declared “significant.” In calculating power, I assumed that if Natasha’s claim is true we could expect her to get at least five correct matches. (I explain my rationale for this in more detail below.) Under these conditions, the power is approximately .75.That is, if, in fact, the alternative hypothesis is true, the odds are better than 3:1 that our test will detect it. These odds are not as great as we would like, but they are adequate. They are much better than some critics claim.

Of course, we could have increased the power of our test even more by reducing the critical value to four. While such a reduction would have decreased the probability of a Type II error, it would have done so by increasing the probability of a Type I error. In any test of a hypothesis, the investigators must strike a balance between these two possible errors.

Effect Size and Power

Researchers increasingly emphasize “effect size,” especially because of the recent popularity of meta-analysis. The effect size refers to the difference between the value expected on the null hypothesis and that expected given the alternative hypothesis. In the NHT framework the focus is upon the probability of the outcome given that the null hypothesis is true. If the probability is high or moderate, then the investigator cannot justify rejecting the null hypothesis. If the probability is low, especially if it is equal to or less than the specified significance level, the investigator rejects the null hypothesis.

A criticism of NHT is that it fosters erroneous beliefs. One is that the lower the probability of the observed outcome, the more meaningful or important is the finding. Investigators can dismiss a large effect as unimportant because the probability of the outcome is large. The same investigators can hail a small and trivial outcome as important because the probability level is very low. The problem is that the probability of an outcome depends upon the sample size. A large effect can be non-significant if the sample size is small. Even a very small effect can produce a very small probability if the sample size is large.

Researchers measure effect size in a variety of ways. However, all of them correct for sample size. Beyond removing the influence of sample size, the measures are standardized so that they are comparable across different studies. Focusing upon effect sizes can be helpful, but it is not a panacea. Just because the measures are standardized, they become detached from the specific context from which they arose. In many situations, it is this specific context that provides the basis for a realistic assessment of when an effect size is large or meaningful. Failure to consider the original context, can result in such errors as treating effect sizes as equivalent just because they are the same “size,” or using an arbitrary, one-size-fits-all, scale for deciding when an effect is small or large. At least some of our critics used such arbitrary, context-free measures to conclude that our test lacked adequate power.

Mindless application of “effect size” to our test can result in the false belief that four correct matches in our test corresponds to a large effect. The same reasoning leads to the equally false belief that five correct matches corresponds to a very large effect. If we carefully examine the context of our situation, I argue that five correct matches do not make a large effect. Further, I claim that four correct matches, given the context of our test, yield a weak and, if real, trivial effect. Let me explain.

To begin, I must emphasize that the matching test we gave Natasha was a highly simplified version of what she does during her typical diagnoses. Consider the following points:

In her typical consultations with patients, Natasha allegedly has no prior knowledge about the client’s problems. She has to scan the entire body, look at every organ, and even look for problems at the cellular level. In our task, we tell her exactly what the condition is that she should be looking for. On each trial, she not only knows what she is looking for, she also knows where to look for it.

In her typical consultations, she not only does not know where to look and what to look for, but she often has to diagnose conditions whose detection involves very subtle cues such as slight changes in texture or discoloration. In our task, we chose conditions whose detection was non-problematic. If her X-ray vision operates as she claims, she did not have to rely on subtle indications. We presented her with conditions that were clear-cut and unambiguous. We were not asking her to look for changes in cells, slight alterations in size or shape, or malfunctioning processes. Instead we presented her with conditions that should stand out boldly for a person with X-ray vision—a large hole in the skull; a sizeable portion of a lung missing; metal surgical staples in the chest; a hip replacement; etc.

In her typical consultations she has to make an absolute judgement about each condition she is evaluating. An absolute judgment is one in which she has to decide, say, if an organ deviates from its normal state without the benefit of a comparison example. Our task allowed her to make comparative judgments. On each trial, she had the benefit of six normal examples to contrast with the one deviant case for which she was looking. Perceptual psychologists have shown that comparative judgments are several orders of magnitude easier than absolute judgments.

Her supporters vouch for the accuracy in her typical diagnoses (her mother even claimed that she never errs). Natasha informed the producer that our proposed test would be much less than demanding than her typical reading. This is because she would not have to scan the entire body for each subject. So if her claim is correct, any one of the reasons listed above should insure many correct matches on our test. Taken together, they should guarantee close to perfection.

These are the reasons why I felt justified in expecting almost perfect performance on our test if she really has the claimed X-ray vision. I softened this expectation to allow for less than perfection. I also lowered this expectation to guarantee adequate power for our test. Setting the expectation at five correct matches is the lowest value I could justify in our test. Four or less correct would simply be inconsistent with her claim.

The preceding observations assume an ideal situation. As everyone now realizes, our situation was not ideal. When we were planning the test, we knew it could not be ideal. We could not exclude the possibility of her picking up external clues with her normal vision. Realistically, the “null” distribution, as a result, would have an expected mean greater than one. Beyond getting one or more correct matches just by chance, we could expect her to get a few additional matches from external clues. These clues would be subtle if we had achieved our goal of making our group of subjects homogeneous in all external aspects. My previous report makes it clear, however, that the situation provided obvious clues that might have given her information about the subjects’ conditions. These possibilities provide additional reasons why getting four correct is not enough to show that Natasha has X-ray vision.

A Bayesian Perspective

Until now, I have discussed our test within the NHT framework. Within that framework, I found it easy to justify the criteria for our test. However, I chose the criterion for the test within a Bayesian framework. Within this framework, the probabilities for the matching procedure given in the preceding table are only part of the story. We have to consider explicitly the prior odds of both the null and the alternative hypotheses. The information provided by the outcome does not, in itself, provide us the probabilities that the null or the alternative hypothesis is correct. Instead, the information from the outcome is used to revise the prior odds.

A logical problem with the NHT was recognized by Fisher. For both the null and the alternative hypotheses we can calculate the probabilities for each possible outcome. For example, given the null hypothesis, the preceding table gives the probability of four correct matches as .0139. The probability of four correct matches given the alternative hypothesis is .1562. The problem is that the investigator is not interested in these probabilities. Rather, he or she wants to know the probability that the null or the alternative hypothesis is true given the outcome of the test. This is a subtle, but crucially important difference. The experiment or the test provides us with data (an outcome). We can compute the probability of this outcome given the hypotheses.

What we want, however, is the probability of the hypotheses given the outcome. This is the problem of “inverse probabilities.” Philosophers and statisticians engage in complicated and never-ending debates about whether such inverse probabilities can be justified. The difficulty is that we need to know the prior probabilities of null and alternative hypotheses before we can get the probabilities for these hypotheses after we have observed the outcome.

In our case, the Bayesian context requires us to specify two hypotheses to compare. In addition, we have to specify a prior probability that each is true. Consider the claim that Natasha has X-ray vision and can use this ability to diagnose medical conditions. What are the odds that this claim is true? The Bayesian approach is often criticized because the assignment of prior odds to hypotheses is subjective and arbitrary. This article is not the place to debate this matter. I only need to say that we have an empirical basis for assigning prior odds to Natasha’s claim. The assignment does not need to be exact. A crude approximation will do.

Natasha’s claim belongs to a large family of similar ones where medical sensitives declared that they could diagnose illness by “seeing” inside patients. Such claims go back as far as the early 19th Century when mesmerized individuals allegedly displayed such abilities. Since then, and continuing into our time, thousands of individuals have made these claims. Yet, not one of these claims has withstood a scientific test. Natasha’s claims and the anecdotes about her achievements places her into this class of medical sensitives. The probability that the claims of any individual in this class are true is obviously quite low. Indeed, given that not one of these claimants have produced scientific evidence in support of their ability, it would be reasonable to assign odds of several thousand to one against the truth of the claim.

I took a more conservative approach. I decided to assume that the prior odds in favor of the null hypothesis were 99:1. This means that I was also assuming that the prior odds against the alternative hypothesis are also 99:1. The null hypothesis in our test is that the average number of correct matches will be one. The alternative hypothesis is that the average number of correct matches will be five. These two hypotheses are statistical hypotheses. The statistical procedure uses the outcome of the test as a basis for deciding between these two statistical hypotheses. We should distinguish these statistical hypotheses from conceptual or substance hypotheses.3

In this context, the probabilities in the preceding table do not directly tell us how likely the null hypothesis is true given a particular outcome. In the Bayesian framework, we also need to compare the probability of the outcome on the null hypothesis with the probability of the same outcome on the alternative hypothesis. For an outcome of four correct matches this comparison yields odds of approximately 11:1 in favor of the alternative hypothesis. These odds are called a likelihood ratio. They tell us that an outcome of four correct matches provides evidence in favor of the alternative.

The Bayesian approach combines the likelihood ratio (determined by the evidence provided in the test) with the prior odds for each hypothesis. This combination yields posterior odds for each hypothesis. The posterior odds represent how the prior odds were revised because of the evidence provided by the test. An outcome of four correct matches does revise the original odds so they move closer to the alternative. The question is, do they revise the original odds enough to reject the null hypothesis in favor of the alternative?

The answer in our case is “no.” The four correct matches lower the odds against the alternative hypothesis from 99:1 to 9:1. This is big reduction, but not enough to revise the evidence in favor of the alternative hypothesis. However, if the outcome had been five correct matches, the revised odds would have been close to 1:1. In other words, this latter outcome would have resulted in the conclusion that the odds are even that the alternative hypothesis is true. Although such an outcome still does not favor the alternative, I was willing to conclude that an outcome that reduced the original odds against the alternative from 99:1 to 1:1 was impressive enough to justify additional investigation of her claim.

Some critics of our test argued that we should have considered four correct outcomes as “significant.” Within the Bayesian framework, such an argument implicitly assumes that the prior odds in favor of Natasha’s claim are 1:1. As I have explained, I think such an assumption is unreasonable. Before the test, I think all but her dedicated proponents would have placed the odds against her claim as much higher than the 99:1. Even with my setting the prior odds at this modest level, the evidence provided by the outcome still fell far short of swinging the odds in her favor.

As I previously mentioned, we designed our test to decide between two statistical hypotheses. The null hypothesis was that the outcome comes from a distribution with mean of one. This is the distribution we would expect if the number of correct matches is due to chance. The alternative hypothesis is that the outcome comes from a distribution whose mean is five. I have discussed the many different reasons why we concluded in favor of the null hypothesis. However, if we had decided in favor of the alternative hypothesis this would not be the same as confirming the hypothesis that Natasha has X-ray powers.

The statistical test enables us to decide (with a certain degree of confidence) between two statistical hypotheses. The alternative statistical hypothesis is that the observed outcome comes from a distribution whose mean is five. The major alternative conceptual hypothesis is that Natasha’s correct matches are the result of her alleged X-ray vision. When we reject the null hypothesis, we are deciding that the statistical alternative is more likely to be true. This is not the same thing as confirming the conceptual alternative. This is because many other conceptual alternatives might be consistent with the statistical alternative.

Some possible conceptual alternatives in our situation are: 1) Natasha’s correct matches are due to X-ray vision; 2) Natasha’s correct matches are due to external clues; 3) Natasha’s correct matches are due to a combination of external clues and X-ray vision. Because Natasha sees the subjects with her normal vision when she is allegedly using her X-ray powers, we cannot rule out the alternative conceptual hypothesis of reliance on external, normal clues. We designed our test to be the first step in a potentially sequential procedure. The first step would enable us to decide between two statistical hypotheses. If the outcome did not allow us to reject the null hypothesis, then it would provide no support for the alternative statistical hypothesis. Such an outcome would also provide no support for any of the conceptual hypotheses. Given such an outcome, we would have no reason to continue the investigation.

On the other hand, if the outcome allowed us to reject the null hypothesis, then it would provide support for the alternative statistical hypothesis. However, this would not be the same thing as supporting the conceptual hypothesis of paranormal X-ray powers. The alternative statistical hypothesis could be consistent with an array of possible conceptual alternatives. The most likely one would be that Natasha was using external clues to get her correct matches. Obviously, we would have to do further testing with extensive resources and clever procedures to eliminate many other conceptual possibilities before we could say that her matches were due to X-ray vision.

What is the Difference Between Four and Five Correct Matches?

Some critics, including Natasha herself, claim that her score of four correct matches is close enough to the criterion of five. They say we should give her credit for getting so close. The difference between four and five correct matches, however, is not trivial. This is because we are dealing with a discrete distribution with only seven possible values. For the situation as I set it up, an outcome of four correct guesses yields posterior odds of 9:1 in favor of the null hypothesis. An outcome of five correct guesses, on the other hand, yields posterior odds closer to 1:1.

Conclusions

This commentary has been somewhat technical and repetitious. I wanted to explain as fully as possible the reasons behind the planning and interpretation of our test. We devoted much thought to the planning of the test. My colleagues and I each agreed that the appropriate critical value was five correct matches. We each reached this conclusion for different reasons. The fact that we converged upon the same critical value suggests that this was a reasonable choice.

The limitations of our test were those of execution rather than design. All these limitations favored Natasha.The requirements of the test were much less demanding than what occurs in her typical diagnoses. These and other factors probably worked to increase the number of correct guesses. Despite these flaws, Natasha still could not achieve the number of critical matches to pass the test. This number was one to which all parties had agreed to in advance. All parties to the agreement were committed to the two possible conclusions. If she got five or more, we would have advocated further and more conclusive testing. If she got less than five, as was the case, we would drop any additional interest in her claim.

The outcome was insufficient to pass our criterion. Moreover, the specific correct matches and misses added additional evidence to weaken her claim. As I explained in my previous article, the pattern of her matches was inconsistent with the operation of X-ray vision. However, this pattern was fully consistent with the possibility that her matches relied upon external clues.

Notes

Diaconis and Holmes (2002) deal with the matching problem from a Bayseian perspective. This is interesting because I chose the critical value for our test based on elementary Bayesian reasoning. Their paper provides a basis for designing a test that could accommodate different probabilities for correctly matching each subject.

Almost from the beginning, critics of NHT have bemoaned the fact that the rejection of the null hypothesis is labeled a “significant” outcome. “Significance” implies importance. A statistically “significant” result simply means that the outcome was sufficiently different from the one expected on the null hypothesis that it equaled or surpassed a critical value. With small samples, the outcome has to be very different from the expected value to achieve “significance.” However, with sufficiently large samples, even a trivially small difference can achieve “significance.”

Content copyright CSI or the respective copyright holders. Do not redistribute without obtaining permission. Thanks to the ESO for the image of the Helix Nebula, also NASA, ESA and the Hubble Heritage Team for the image of NGC 3808B (ARP 87).