(Note: The "Conceptual Garbage Group" was a weekly group in the 1970s and 1980s that discussed technical and scientific topics outside the range of everyday work. This "Theme" was presented about 1984. It can be downloaded as a MS Word file here.)

Thesis: Significance tests have no
value other than to get papers past Journal editors, and their use can damage
the progress of science.

In casual discussion with some members of the CGG, I have
expressed the opinion that tests of "statistical significance" have
no place in science. I have been surprised by the repeated comment "but I
thought statistics was important." The implication that statistics
consists of significance testing is profoundly disturbing. Statistics is, more
than anything, the formalized technique of description and measurement. Even if
significance tests had any philosophical basis, they would form only a small
part of the body of statistics; it suggests a dereliction of duty on the part
of teachers of experimental methods that stu­dents can come away with the
impression that statistics can be equated with
significance tests.

To use a significance test, one requires a so-called null
hypothesis, which is assumed to be true unless the test shows
"significance" at some predetermined probability level such as 0.05
or 0.01. There is often a background hypothesis in the tester's mind, which
will be acceptable (but not necessarily accepted) if the test shows
significance, or which will be rejected if the test shows non-significance, but
the background hypothesis does not figure in the significance test itself.
Inasmuch as one knows a priori that the null hypothesis is false (any
finite hypothesis about the world will be false as a total descrip­tion of
the world), the significance test tells nothing about either the null
hypothesis or the background hypothesis. It tells only about the sensitivity of
the experiment in the context of how close the null hypothesis comes to being
an accurate description. It fol­lows that one can, in principle, find significant effects wherever they are sought, by mak­ing the
experiment sensitive enough.

Given the foregoing, one may ask why significance tests have
taken the hold they have on the community of experimental psychologists. I
suggest that the reasons are largely social (as our Chaos-related CGG
discussions would predict).

Significance tests are easy to perform using cookbook
methods and prepared pro­grammes.

Most effects predicted by the background hypotheses of
interest are large enough to be shown as "significant" by an
experiment small enough to be practical.

Most experimenters have only one background hypothesis
in mind (or a class of related ones), and therefore can accept it without
competition from other ones if the test shows "significance."

The "null hypothesis" is a common-sense
description, or is based on a currently well-accepted theory that the
experimenter wishes to falsify.

The sensitivity. of most
experiments varies over only a small range: there is a con­ventional wisdom
about how many subjects, how many trials, how much training, how many
conditions, etc. are required for finding "interesting" effects to be
significant. Therefore a finding of "significance" in a paper can be
roughly equated with another finding of "significance" in respect of
the minimum effect magnitude that could have produced the significant result.

Usually, "significance" is not the only
result reported. Tables and graphs of actual measured magnitudes are normally
included so that one can see how big the effects really are. Sometimes, these
measurements are omitted, which makes the paper useless for guiding further
research. Since significance tests do not have to stand alone in showing the
results of experiments, they need less philosophical backing than they might
otherwise require for survival.

I argue that if the motive for doing an experiment is to
compare the merits of a conven­tional (null) hypothesis with those of a
hypothesis favoured by the researcher, the test
should directly compare the two, rather than attempt to falsify the one so that
the other can be accepted by default. All scientific theories are approximate
descriptions, and all can be falsified, so it is only the relative merit of the
descriptions that are at issue. If a conventional theory fits better than the
new one with an accepted framework of belief, then the data must be
substantially better described by the new one than by
the old before the new can be accepted. This is the main reason why
significance tests are made with a significance level of 1 in 20 or 1 in 100,
rather than the 50-50 which would seem more reasonable
on the face of it. Of course, a statistically valid reason for choosing such
low probabilities as thresholds for "significance" is that there is
actually an infinity of possible background hypotheses
rather than just the one favoured by the researcher.
But only hypotheses that have been thought of can be true competitors to the
null hypothesis, and so a finite significance level is chosen rather than the 1
in infinity level (0.0000 ... ) that truly would allow
one to accept the null hypothesis.

Students are generally told "you never accept the null hypothesis; you just fail to
reject it." But a scientist should never accept any hypothesis,
except as a working description of part of the world; in this sense one does accept the null hypothesis most of the
time. Experimental science is a competition among hypotheses old and new. The
purpose of experiments is to provide information that can alter the relative
assessments of the useful­ness of those hypotheses that claim to describe
all relevant data-old and new.

If philosophical error were the only problem with significance
testing, I would have no serious complaint other than a general distaste for
wrong and inefficient ways of doing things. But the situation is worse: the use
of significance tests can seriously distort the progress of science, leading
people to believe falsehoods. The classic situation in which this occurs is
when a theory suggests there should be a particular effect, but the theory is
not popular. Experiments are done which show a "non-significant"
result. There are several such experiments, all of which show a
"non-significant" effect in the predicted direction. If one were to
combine all the data, the effect would be seen clearly, but because several
experiments all showed "no effect" (which is how "not
significant" is usu­ally quoted by the author who next writes about
the study), conventional wisdom now regards the proposed theory as disproved,
whereas the data in fact strongly support it.

In a recent controversy over a review of "The
Psychology of Reading" (Taylor and Taylor, Academic Press, 1983) the problem of significance tests was important.
Both the reviewer and a critic of my comments on the review misused
significance in a way slightly different from the classic
"many-experiment" problem. Their misreading of the nature of
significance could have a significant (socially) impact on the way children are
taught to read. The issue therefore has an importance beyond scholarly debate.

I propose a CGG discussion, or preferably a debate, on the
merits of significance testing as opposed to descriptive or Bayesian statistics
(or other approaches as may be favoured by CGG
members). I hope that someone will support the contrary opinion.

I include with this note a copy of handwritten notes I used in a seminar
on signal detection and related topics in 1966. The first "batch" and the beginning of the second
"batch" are relevant, and the result proved starting on page 13
of the second batch may interest some people. The first batch introduces the
concept of an "Assessment Function" that describes how an experiment
affects what we can say about hypotheses that claim to describe the data. The
enquiry into assessment is continued in the second batch. There follows a
discussion of an appropriate measure of the distinctiveness of two hypotheses
(d'), which is not very relevant to the discussion at hand. The result on
pp13-14 of the second batch shows how much information one can gain about the
attributes of an object if one knows its detectability (or distinctiveness) from some null base. In the context of this discussion, it
suggests how much detail one can describe about competing hypotheses in respect
of their differences from a null hypothesis.