Re: lynx-dev The traversal limitation

From:

David Woolley

Subject:

Re: lynx-dev The traversal limitation

Date:

Tue, 29 Sep 1998 07:36:08 +0100 (BST)

> Most sites do contain much more material than I ever want to download over
> my=20=
> slow link=2E
Many sites object strongly to being crawled as well, because they expend
bandwidth on pages not read. IMDB is a case in point. Please never crawl
that site with Lynx or you will find that Lynx gets permanently barred from
it. (The other issue is that mirrored copies breach the copyright and
deny them the ability to obtain the advertising revenue that pays for the
site.)
> So I try to filter the searches from the start file via the reject=2Edat
> file=20=
> and -realm=2E But in this case, the intersting pages is in a /cgi-bin/=20=
Pages aren't normally in /cgi-bin, but rather a program is run to create
the page on the fly when you reference URLs of this form. That's particularly
expensive for the site and most sites will use the robots.txt file to bar
access to well behaved crawlers. Unfortunately, Lynx is NOT well behaved.
> directory and not below the startfile=2E And even if I was able to make a=20=
> complete filter, the "noise factor" among the valid files would be around
> 95%=20=
> of all pages=2E
>
> So I tried to build an own local html-coded startfile to pull the
> interesting=20=
wget is designed for this purpose and does exist in win32 versions. It is
well behaved as a crawler, but can be given an explicit list of URLs, on
the command line, or in a file, and will then bypass robots.txt. Because it
is well behaved, it is less likely to be barred, although excessive use
for mirroring a set of pages could still have this effect.
It doesn't render the HTML, but can fixup internal URLs so that they work
from the local filesystem, allowing you to use Lynx, or another browser, on
that copy.