The Internet provides access to numerous sources of useful information
in textual form -- telephone directories, event listings, product
catalogs, etc. Recently, there has been much interest in building
systems that gather such information on a user's behalf. But because
these information resources are formatted for use by people,
mechanically extracting their content is difficult. Systems using
such resources typically use hand-coded wrappers, customized
procedures for information extraction.

We make three contributions. First, we introduce wrapper
induction, a technique for automatically constructing wrappers
from labeled examples of a resource's content. Second, we identify a
class of wrappers that is efficiently learnable, yet expressive enough
to handle 48 percent of a recently surveyed sample of actual Internet
resources. Finally, we describe a method for heuristically labeling
the examples used by the induction algorithm. We demonstrate, both
empirically and analytically (using the PAC computational learning
model), that automatic wrapper induction is feasible, and that the
system degrades gracefully with imperfect labeling heuristics.

We tested our system on several actual Web sites. The graphs
below indicates the number of examples needed to learn a satisfactory
wrapper, as a function of increasing oracle noise, for two sites (OKRA
and BigBook); see the
paper for details.