Abstract:
Previous studies have highlighted the rapidity with which new content
arrives on the web. We study the extent to which this new content can
be efficiently discovered in the crawling model. Our study has two
parts. First, we employ a maximum cover formulation to study the inherent
difficulty of the problem in a setting in which we have perfect
estimates of likely sources of links to new content. Second,
we relax the requirement of perfect estimates into a more realistic
setting in which algorithms must discover new content using historical
statistics to estimate which pages are most likely to yield links to
new content.

We measure the overhead of discovering new content, defined as
the average number of fetches required to discover one new page. We
show first that with perfect foreknowledge of where to explore for
links to new content, it is possible to discover 50% of all new
content with under 3% overhead, and 100% of new content with 28%
overhead. But actual algorithms, which do not have access to perfect
foreknowledge, face a more difficult task: 26% of new content is
accessible only by recrawling a constant fraction of the entire web.
Of the remaining 74%, 80% of this content may be discovered within
one week at discovery cost equal to 1.3X the cost of gathering the new
content, in a model with full monthly recrawls.