We begin by specifying a large collection of candidate features. We do
not require a priori that these features are actually relevant or useful.
Instead, we let the pool be as large as practically possible. Only a small
subset of this collection of features will eventually be employed in our final
model. In short, we would like to include in the model only a subset of
the full set of candidate features . We will call the set of
active features. The choice of must capture as much information about
the random process as possible, yet only include features whose expected values
can be reliably estimated.

With each one might associate a Bernoulli random variable
with the following (``objectivist'' or ``non-Bayesian'')
interpretation. Imagine that the set of pairs comprising were
drawn from an imaginary, infinite distribution of pairs. In this imaginary
distribution, f is active ( ) with some probability , and
inactive ( ) with probability .

(A Bayesian would prefer to put a distribution over the possible values of
, informed both by some empirical evidence--such as the fraction of
times in the sample --and also by some outside knowledge of
what one miht expect to be. This line of reasoning forms the basis of
``fuzzy maximum entropy,'' which is a quite intriguing line of investigation we
shall not pursue here.)

As a Bernoulli variable, f is subject to the law of large
numbers, which asserts that for
any :

That is, by increasing the number (N) of examples , the fraction
for which approaches the constant . More
succinctly, .
Unfortunately, is not infinite, and the empirical estimate
for each is not guaranteed to lie close to its ``true'' value
. However, a subset of will exhibit . And of these, a fraction will have a non-negligeable bearing on the
likelihood of the model. The goal of the inductive learning process is to identify
this subset and construct a conditional model from it.

Searching exhaustively for this subset is hopeless for all but the most
restricted problems. Even a priori fixing its size doesn't
help much: there are unique outcomes
of this selection process. For concreteness, a typical set of figures in
documented experiments involving maxent modeling are
and , meaning possible sets of 25 active
features. Viewed as a pure search problem, finding the optimal set of
features is computationally inconceivable.

To find , we adopt an incremental approach to feature selection. The idea
is to build up by successively adding features. The choice of feature to
add at each step is determined by the training data. Let us denote the set of
models determined by the feature set as . ``Adding'' a feature f
is shorthand for requiring that the set of allowable models all satisfy the
equality . Only some members of will satisfy this
equality; the ones that do we denote by .

Thus, each time a candidate feature is adjoined to , another linear
constraint is imposed on the space of models allowed by the features
in . As a result, shrinks; the model in with the
greatest entropy reflects ever-increasing knowledge and thus, hopefully,
becomes a more accurate representation of the process. This narrowing of the
space of permissible models was represented in figure 1 by a
series of intersecting lines (hyperplanes, in general) in a probability
simplex. Perhaps more intuitively, we could represent it by a series of nested
subsets of , as in figure 2.

As an aside, the intractability of the ``all at once'' optimization problem is
not unique to or an indictment of the exponential approach. Decision trees, for
example, are typically constructed recursively using a greedy algorithm. And in
designing a neural network representation of an arbitrary distribution, one
typically either fixes the network topology in advance or performs a restricted
search within a neighborhood of possible topologies for the optimal
configuration, because the complete search over parameters and topologies
is intractable.

Figure 2: A nested sequence of
subsets of
corresponding to increasingly large sets of features