In most of our experiments we used the rule learning system RIPPER
[2]. RIPPER has some advantages for this problem:
in particular, it handles the ``set-valued'' features (described
below) directly, and is efficient enough to use on problems of this
scale. (About 65,000 parse-tree node examples are generated from the
84 wrapper/page pairs). The main reason for using RIPPER, however,
was that we were familiar with it; as we show later, other learning
systems achieve comparable results on this problem.

RIPPER, like most classification learners, requires an example to be
represented as a vector of relatively simple features. The value of
each feature is either a real number, or else a symbolic feature--an
atomic symbol from a designated set, like {true,
false}.
The primitive tests allowed for a real-valued feature
are of the form
f<=q
or
f>=q,
where f is a
feature and
q is a constant number, and the primitive tests
allowed for a symbolic feature are of the form
f=ai, where f is a
feature and ai is a possible value for f.
RIPPER also allows
set-valued features [3]. The value of a
set-valued feature is a set of atomic symbols, and tests on set-valued
features are of the form ai IN f,
where f is the name of feature
and ai is a possible value (e.g.,
ul IN ancestorTagNames).
For two-class problems of this sort,
RIPPER uses a number of heuristics to build a disjunctive normal form
formula that generalizes the positive examples. This formula is
usually thought of as a set of rules, each rule having the form
``label an instance `positive' if t1 and t2 and ... '', where
each ti in the rule is a primitive test on some feature.

After some thought, we devised the following nineteen features to
describe a parse tree node. The tag name is the HTML tag name
associated with the node, such as a, p, br, and html. This
is an informative feature: some tags such as head are always
negative, while, other tags such as the anchor tag a are often
positive. We measured the size of the string directly
associated3 with a node in two ways: the text
length is the total size of all the text associated with a node, and
the non-white text length is similar to text length but ignores
blanks, tabs, and line returns. We also measured the length of text
contained in the subtree rooted at the current node by the features
recursive text length and recursive non-white text length;
these features are important because they measure the size of the
string si that would be extracted if the node were marked as
positive. The features set of ancestor tag names, depth,
number of children, number of siblings, parent tag
name, set of child tag names and set of descendent tag
names are other natural and easily-computed features. Since the size
of parse trees varies considerably, we also normalize many of the
above features by the total number of nodes or by the maximal node
degree.

The final features we designed are intended to detect and quantify
repeated structure in the parse tree; intuitively, this is important,
because positive nodes often reside on a structure frequently repeated
in the tree. The repetitive aspect of a structure can often be
detected by looking at the sequence of node tags that appear in paths
through the tree; for instance, in the parse tree for bibliography
page of Figure , there are many paths labeled
li-a that originate at the ul node.

To measure this sort of repetition, let
tag(n)
denote the tag
associate with a node n, and define the tag sequence position
of n, p(n), as the sequence of tags encountered in traversing the
path from the root of the parse tree to n: that is,
p(n) =
<html,...,tag(n)
>
.
If
p(n1)=
<t1,...,tk>
.
and
p(n2)=
<t1,...,tk,tk+1,...,tm>
,
then we say the tag
sequence position p(n1) is a prefix of p(n2); if
additionally p(n1) is strictly shorter than p(n2), then we say
that p(n1) is a proper prefix of p(n2).

We use the node prefix count for n as a way of measuring the
degree to which n participates in a repeated substructure. The
node prefix count for n,
pcount(n),
is the number of
leaf nodes l in the tree such that the tag sequence position of nis a prefix of the tag sequence of l: more formally,
p
count(n) = |{l: p(n)
is a tag sequence prefix of p(l), l is a
leaf}|.
The node suffix count for n, s(n), is closely
related; it is defined as the number of leaf nodes l with tag
sequence positions of which p(n) is a proper prefix. We normalize
both
pcount(n)
and
scount(n)
by the total number of
paths in the tree.

These last features are clearly engineered, but are based on clear and
plausible intutions; the tree-rewriting language specifies nodes to be
re-written in terms of their tag sequence positions, and tools for
writing programs in the tree-rewriting language are also based on
looking for repeated tag sequences. Most of the other features were
introduced with a fairly non-selective ``brainstorming'' process; here
we were looking for features that were easy to compute and plausibly
related to the classification task. No effort was made to to select
an optimal set of features, or to include any sort of strong knowledge
of the type of extraction programs we expected to see.

As an illustration of the way in which these features are used in
learning, Figure shows some representative rules that
appear in the hypothesis obtained by training RIPPER on all of the 84
wrapper/page pairs. Rule 1 says that a node should be labeled as
``positive''--that is, part of a list to be extracted--if it is an
anchor (a) element that repeats frequently, as measured by the
normalized suffix count for the node. Rule 2 is similar; it marks
anchor elements for extraction if they repeat somewhat less
frequently, but are close to the root of the parse tree. Rule 3 marks
paragraph (p) elements for extraction; since this type of element
is less likely to be in a list, the rule requires very strong
repetition, as measured both by the node prefix count, and by the
number of siblings of the node. Rule 4 marks table entry elements
(td) for extraction if they are part of a repeating structure
that comprises most of the page.
The complete hypothesis learned by RIPPER from this data contains a
total of 36 such rules.