In the fourteenth century William of Occam stated
``plurality should not be assumed without necessity''.
This principle has since become known as Occam's razor. Occam's razor
was originally intended as a basis for determining one's ontology.
However, in modern times it has been widely reinterpreted and adopted
as an epistemological principle--a means of selecting between
alternative theories as well as ontologies. Modern reinterpretations
of Occam's razor are widely employed in classification learning.
However, the utility of this principle has been subject to widespread
theoretical and experimental attack. This paper adds to this debate by
providing further experimental evidence against the utility of the
modern interpretation of Occam's razor. This evidence takes the form
of a systematic procedure for adding non-redundant complexity to
classifiers in a manner that is demonstrated to frequently improve
predictive accuracy.

The modern interpretation of Occam's razor has been
characterized as ``of two hypotheses H and H, both of which
explain E, the simpler is to be preferred''
[Good, 1977]. However, this does not specify what aspect of a
theory should be measured for simplicity. Syntactic, semantic,
epistemological and pragmatic simplicity are all alternative criteria
that can and have been employed
Bunge [1963]. In practice, the common use of Occam's razor in machine
learning seeks to minimize surface syntactic complexity. It is this
interpretation that this paper addresses.

It is to be assumed that Occam's razor is usually applied in the
expectation that its application will, in general, lead to some
particular form of advantage.
There is no widely accepted
articulation of precisely how Occam's razor should be applied or what
advantages are to be expected from its application in classification
learning. However, the literature does contain two statements that
seem to capture at least one widely adopted approach to the principle.
Blumer, Ehrenfeucht, Haussler, and Warmuth [1987] suggest that to wield Occam's razor is to adopt the
goal of discovering ``the simplest hypothesis that is consistent with
the sample data'' with the expectation that the simplest hypothesis will
``perform well on further observations taken from the same source''.
Quinlan [1986] states

``Given a choice between two decision trees, each of which is
correct over the training set, it seems sensible to prefer the simpler
one on the grounds that it is more likely to capture structure inherent
in the problem. The simpler tree would therefore be expected to
classify correctly more objects outside the training set.''

While these statements would not necessarily be accepted by all
proponents of Occam's razor, they capture the form of Occam's razor
that this paper seeks to address--a learning bias toward classifiers
that minimize surface syntactic complexity in the expectation of
maximizing predictive accuracy.

Both of the above statements of Occam's razor restrict themselves
to classifiers that correctly classify all objects in a training set.
Many modern machine learning systems incorporate learning biases that
tolerate small levels of misclassification of the training data
[Clark and Niblett, 1989,
Michalski, 1984,
Quinlan, 1986,
Quinlan, 1990, for example,]. In this
context, and extending the scope of the definition beyond decision
trees to classifiers in general, it seems reasonable to modify
Quinlan's [1986] statement (above) to

Given a choice between two plausible classifiers that perform
identically on the training set, the simpler classifier is expected to
classify correctly more objects outside the training set.

This will be referred to as the Occam thesis.

The concept of identical performance on a training set could be
defined in many different ways. It might be tempting to opt for a
definition that requires identical error rates when two classifiers are
applied to the training set. A less strict interpretation
might allow two classifiers to have differing error rates so long as the
difference is within some statistical confidence limit. However, to
maximize the applicability of its results, this paper will adopt a very
strict interpretation of identical performance--that for every
object o in the training set, both classifiers provide the same
classification for o.

It should be noted that the Occam thesis is not claiming that
for any two classifiers with equal empirical support the least
complex will always have greater predictive accuracy on previously
unseen objects. Rather, it is claiming that more frequently than not
the less complex will have higher predictive accuracy.

This paper first examines some arguments for and against the Occam
thesis. It then presents new empirical evidence against the thesis.
This evidence was acquired by using a learning algorithm that
post-processes decision trees learnt by C4.5. This post-processor was
developed by rejecting the Occam thesis and instead attending to the
assumption that similarity is predictive of class. The post-processor
systematically adds complexity to decision trees without altering their
performance on the training data. This is demonstrated to lead to an
increase in predictive accuracy on previously unseen objects for a range
of `real-world' learning tasks. This evidence is taken as incompatible
with the Occam thesis.