This is a verbatim transcript of a draft working paper
dated April 1972, privately circulated.

Introduction

Ockham's razor cuts into the body of hypotheses as a surgeon's knife,
excising the fat and leaving only the leanest. It must be one of the oldest
tools of Western philosophy, dating from the thirteenth century. Its use
has never received a clear rationale, since among the excised hypotheses
is certainly at least one more true than the hypothesis that the operation
leaves viable.

The use of Ockham's razor can be seen as an example of the proper use
of creditation in the theory of hypothesis testing. It does not have to
rely on metaphysical principles of "the simpler the better" or
" nature loves elegance." The discussion follows concepts defined
and developed in Watanabe (1969), although the ideas are not due to Watanabe.

The Universe of hypotheses

In principle, any hypothesis can be written as a linear assemblage of
words and symbols taken from a finite set of possible symbols. Inasmuch
as we can enumerate the letters used in the words, each hypothesis can be
given a number. They are countably many. In some manner, any hypothesis
can be given an index of simplicity. A convenient one, suitable from an
information-theoretic viewpoint, is simply the length of the symbol string
describing the hypothesis. This is, of course, not unique. The hypothesis
"F=Ma" is four symbols, the hypothesis "Force equals mass
times acceleration" is 36 symbols including spaces. The same hypothesis
can be written in many different ways, with different lengths. It is reasonable,
however, to judge that the longer statement is more complex than the shorter,
since there are more ways of saying things in five words than in four symbols;
the 36-character statement selects from a greater set of possibilities,
and is thus more informative conditional on there being at least 36 characters
in the statement.

It is not necessary, on the other hand, to judge the complexity of the
two statements as different. There will be many sets of synonymous statements
among all the possible hypotheses in the universe, and these may be judged
to have the complexity of their shortest member.

Complexity is not ordinarily judged from the length of the statement.
The interrelations among the elements of the statement, and the extra assumptions
and prior knowledge that must be assumed in order that the statement be
intelligible both come into the ordinary assessment of complexity. However,
it can be argued (1) that the prior conditions and assumptions must be the
same for all competing hypotheses and that if any hypothesis requires assumptions
not required for the others, then these must be stated, and (2) that in
any case most reasonable measures of complexity will correlate fairly well
with statement length. While the judgment of complexity is a very subjective
matter, the judgment of the length of a statement that with existing
prior assumptions suffices to describe a mass of data is objective.
All statements can be given a length measure.

Let the set of all hypotheses be {Hi} and their lengths be
Li. The number of hypotheses of length L goes up as NL,
where N is the "effective" number of possible symbols. N takes
into account redundancy due to unequal probabilities of symbol usage and
inter-symbol relationships. There are approximately NL hypotheses
of length L. The uncertainty associated with which particular statement
of length L will be made is thus L log N, and for large enough N the same
estimate applies if we include elements shorter than L as possibilities.
The number of statements of length L is very much larger than the number
of all the possible shorter statements.

Watanabe shows that for any finite set of data the hypotheses can be
divided into three more or less distinct groups. The first group is of logically
refuted hypotheses, which have denied the possibility of some datum actually
observed. The second group is of strongly discredited hypotheses, whose
credibility is very much lower than that of the third group, which consists
of more or less equally credited hypotheses which all describe the data
equally accurately. For an infinite amount of data, the second group is
uniquely distinguishable from the third, since the credibility of any member
of the second group vanishes even though it is not logically refuted, whereas
that of any member of the third group converges to a finite value. Watanabe
defined these credibilities (posterior subjective probabilities assigned
to statements of the kind "this hypothesis describes the manner these
data arise") in terms of a finite number of hypotheses, but we must
here consider an infinite number, and must therefore refer to direct measures
that cannot be normalized. Credibilities refer then to the measure of credibility
assigned to the hypothesis, and conditional credibility can be turned into
a probability when the condition is that only a finite number of hypotheses
are available for consideration. Credibility measure is the numerator of
Watanabe's credibility probability defining fraction.

In most, if not all, cases with a finite amount of data, there will be
an infinite number of hypotheses with relatively large credibility measure.
If L is chosen large enough, there will be a finite number with length L
or less. Without prior knowledge of the situation, the probability that
any one hypothesis will describe the data is as high as the corresponding
probability for any other hypothesis. If there are NL (approximately)
possible hypotheses, and K acceptable ones of length L or less, then the
a priori probability that any one hypothesis will be acceptable is K/NL.
There will be approximately KNP-L acceptable hypotheses of length
P<L. In particular, there are almost K acceptable hypotheses of length
L, or, to a closer approximation, K(N-1)/N.

Given that a hypothesis belongs to the acceptable set, the probability
that it has length P is KNP-L/K or NP-L. We now revert
to Garner's (REF) notion of "good form," as relating to the number
of members of an equivalence set. To paraphrase this idea, we assume that
all members of a set of structures can be sorted into subsets whose members
are "like" one another. When people are asked to do this task,
they isolate some structures as being unlike any of the others, and make
large groups of other sets of structures. Separately, when asked to judge
the "goodness" or "simplicity" of the forms, those that
were isolated in the sorting are judged to be "good" or "simple",
and the grouped ones to be complex. The converse of this is that complex
forms tend to be members of large groups, any one of which would serve as
a representative of the group, whereas simple forms cannot be well represented
by other simple forms. Informationally, the presentation of a simple form
reduces the uncertainty about what was presented from the initial level
of log N (N is the number of possible forms) to almost zero, presentation
of a complex form reduces the uncertainty down only to log M (M is the number
of elements in the associated group).

Applying this idea to the hypothesis set, we can say that one complex
hypothesis is "as good as" another, whereas the simpler hypotheses
stand alone. The groupings do not necessarily mean that all equally complex
forms go together, but merely that the sizes of the relevant groups increase
with increasing complexity. If we presume that the group size goes up as
some fraction (1/k) of the number of hypotheses with the given complexity,
G = NL/k, then the uncertainty reduction involved in finding
a credible hypothesis of length P will be from L log N to (P/k) log N, where
L is the maximum length within which one will entertain a hypothesis.

It is presumed in the foregoing analysis that the person gathering the
hypotheses finds only a small subset of the credible hypotheses. Among these
he must select that single hypothesis he is most willing to credit with
being able to explain the available data as well as similar data he may
collect later. Not all members of a complex group will be found, and complex
hypotheses have to be regarded as members of their group. The credibility
calculated on the usual basis of the explicitly tested hypotheses will be
more or less equal for all of the hypotheses we have called "credible."
However, the conditional credibility of actually tested hypotheses, with
the condition that one of the group is the wanted hypothesis, must also
be equal across the whole group associated with any one test hypothesis.
Hence we can argue that the creditation of any one hypothesis should be
"diluted" in proportion to the probable number of group members,
whether they are tested or not.

The dilution of creditation argument then suggests that if all credible
hypotheses have an overt credibility measure Q, and tested but discredited
hypotheses have credibility near or equal to zero, then the perceived credibility
of any one hypothesis should be Q/G, where G is the number of members of
a group of which a tested hypothesis is a representative. Perceived credibility
will then be QP=QN-P/K. This number declines sharply
with P. It declines so sharply, in fact, that the total perceived credibility
of the infinite set of hypotheses is finite, which permits normalization
and the use of credibility as a probability measure. If the probability
that any randomly selected hypothesis remains credible is R, then the normalization
of perceived credibility depends on HQP=RQP=1N-P/k
= RkQ/ ln N. The normalized perceived credibility will then be qp=(N-P/k
ln N)/Rk. If we express R as an equivalent length, by writing R=N-S,
we can write qp=(N-(P-S)/k ln N)/k. The shorter and
hence less complex hypotheses are perceptually the more credible.

The predictive power of hypotheses

A hypothesis is not supposed merely to describe a body of data. It has
the further function of predicting data yet to come. Indeed, this predictive
power is taught in experimental design classes as the only valid
test of a hypothesis. You are not supposed to test your hypothesis with
data gathered before you built (i.e. discovered) the hypothesis. Those data
are permitted in tests of hypotheses you had already invented, but not in
tests of new ones. This attitude has some justification, though not very
much. On the face of it, the attitude taught to students is ridiculous,
since the creditation of a hypothesis does not depend in the least on its
date of invention

A hypothesis is supposed to describe a particular body of data. The conditions
under which the data are gathered is included in the background assumptions
common to all hypotheses. For example, Newton proposed F=Ma to describe
the mechanical motions of any solid bodies. Subsequently, the body of data
described by this simple formula had to be restricted to those obtained
at relatively low speeds. So long as the data base is thus restricted, Netwon's
formula remains credible, but if all attainable speeds are allowed, it becomes
incredible. Einstein's more complex formula is the more credible, and is
as credible as Newton's even on the restricted data base. Only because Newton's
is more simple is it used in any circumstances.

The data used to determine the credibility of hypotheses has a finite
number of degrees of freedom. This is to say that a finite number of statements
will suffice to describe the data exactly. The point of a theory is to describe
the same data in fewer statements. Intuitively, if we use a new statement
to describe every data point, we cannot expect to predict any new data not
yet collected. Conversely, if a statement like "the voltage is 117
volts" has proved to describe every measurement so far made, we can
legitimately expect it to describe measurements in the future. This intuition
can serve as the basis for a predictive rationalization for Ockham's razor.

The credible set of hypotheses have one thing in common. They each describe
the data almost equally adequately. The total error involved in the description
is the same for all the hypotheses. The hypotheses which involve an individual
statement for every datum, or equivalently permit the recovery of every
datum through logical combinations of the statements, should have no error.
We can omit hypotheses of this type as being uninteresting on the grounds
that their range of description encompasses only the data already gathered.
They do not claim to predict. On the grounds of the data already gathered,
they are the most credible, and it seems that sheer credibility is not a
measure of the value of a hypothesis. Value must depend on predictive credibility.

Just as a body of data has a number of degrees of freedom, so has a hypothesis.
The number of degrees of freedom in the data is determined by the number
of independent measurements that serve to describe the data, and the number
of degrees of freedom in the hypothesis by the number of independent statements
needed to complete the hypothesis. "Independence" of statements
may not be too clearly defined. Neither is it always clear what statements
are needed to complete a hypothesis. The required statements should not
include the common body of assumptions underlying all competing hypotheses,
but should include assumption belonging to some but not others. Nevertheless,
it should be possible to provide a crude measure of the number of degrees
of freedom in a hypothesis. It is a measure of complexity, and at worst
can be submitted to the judgments of independent observers.

Suppose that a hypothesis has been found to have H degrees of freedom,
and that it describes a body of data with D degrees of freedom with a total
error E. After the hypothesis has been stated, the data has only D-H degrees
of freedom left which could contribute to the error. The hypothesis has
described H degrees of freedom exactly. Hence the goodness of the hypothesis
can be described in terms of the error per remaining degree of freedom.
This is the best estimate of the probable error if more degrees of freedom
were to be added by the accumulation of more data. The predictive error
of a hypothesis is then E/(D-H). The smaller H, the better the prediction
for a common E and a given body of data. When D is much larger than H, the
predictive error is almost E/D, and the value of a hypothesis is therefore
determined by how well it predicts the data--by the value of E--rather than
by changes in its complexity. Only when the number of degrees of freedom
in the hypothesis approaches that in the data does the predictive error
become more strongly dependent on H than on E. For most situations, D will
be much larger than H, and the hypothesis that most accurately describes
the data will be preferred. Only when two hypotheses describe the data almost
equally accurately will their relative complexity determine preference,
and that is what Ockham's razor states. It is interesting, however, that
an increase in simplicity can override an increase in total descriptive
error on some occasions.

References

Garner, W. R. (Work in the 1960's; specific references to be looked up).

Watanabe, S. (1969) Knowing and Guessing: a quantitative study of inference
and information. New York: Wiley

Draft Working Paper dated April 1972 .Transcribed into electronic form Jan 8 1993.HTML version March 22, 1997.