From owner-robots Thu Oct 12 14:39:19 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA20349; Thu, 12 Oct 95 14:39:19 -0700
Message-Id: <9510122139.AA20341@webcrawler.com>
To: robots
Subject: The robots mailing list at WebCrawler
From: Martijn Koster
Date: Thu, 12 Oct 1995 14:39:19 -0700
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Welcome to our new home...
This mailing list is now open for traffic.
For details see:
http://info.webcrawler.com/mailing-lists/robots/info.html
-- Martijn
__________
Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html
From owner-robots Thu Oct 12 16:09:58 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA25602; Thu, 12 Oct 95 16:09:58 -0700
Message-Id:
Date: Thu, 12 Oct 95 16:09 PDT
X-Sender: a07893@giant.mindlink.net
X-Mailer: Windows Eudora Pro Version 2.1.2
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: robots@webcrawler.com
From: Tim Bray
Subject: Something that would be handy
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
It might be nice to enhance robots.txt to include a hint as to how
long the file ought to be cached by a Robot driver. People who don't
understand why probably ought to ignore this message. People who do
might want to suggest (a) reasons why this is a silly idea, (b) a
syntax/method for doing it, or (c) any implementation difficulties
that could ensue.
My suggestion, expressed in the form of perl code that could be used
to implement it:
if (/^\s*CacheHint:\s+(\d+)\s*([dhm])\s*$/)
{
$SecondsToCache = $1;
if ($2 eq 'd') { $SecondsToCache *= 60*60*24; }
elsif ($2 eq 'h') { $SecondsToCache *= 60*60; }
else { $SecondsToCache *= 60; }
}
Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com)
From owner-robots Fri Oct 13 18:03:54 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA29927; Fri, 13 Oct 95 18:03:54 -0700
Message-Id:
Date: Sat, 14 Oct 95 11:07:39 0000
From: James
Organization: Tourist Radio Pty Ltd
X-Mailer: Mozilla 1.1N (Macintosh; I; 68K)
Mime-Version: 1.0
To: robots@webcrawler.com
Subject: Site Announcement
X-Url: http://info.webcrawler.com/mailing-lists/robots/info.html
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=us-ascii
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
We wish to advise those with a robot
seeking facility that we have two sites
at http://www.com.au/aaa
and
http://www.world.net/touristradio
We would be grateful if you would ask
your robots to visit and announce our
sites where possible.
If this is bad net ettique, we
apologise, there are huge back logs
with manual services.
James
From owner-robots Mon Oct 16 08:25:16 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA00957; Mon, 16 Oct 95 08:25:16 -0700
Message-Id: <9510161525.AA00951@webcrawler.com>
To: robots
Subject: Re: Site Announcement
In-Reply-To: Your message of "Sat, 14 Oct 1995 11:07:39."
Date: Mon, 16 Oct 1995 08:25:16 -0700
From: Martijn Koster
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Hi,
You've asked me to add a link. The best way to get a link added
to the WebCrawler, submit them to
http://www.webcrawler.com/WebCrawler/SubmitURLS.html
Regards,
-- Martijn
__________
Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html
From owner-robots Mon Oct 16 18:36:43 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA29862; Mon, 16 Oct 95 18:36:43 -0700
Message-Id:
Date: 16 Oct 1995 18:40:48 -0800
From: "Roger Dearnaley"
Subject: How do I let spiders in?
To: " "
X-Mailer: Mail*Link SMTP-QM 3.0.2
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Is there any way currently supported of providing spiders access to our (soon
to be launched) username & password authenticated site? (Of course if a
customer followed a link generated by this spider search, they will be asked
for authentication, but when the can't provide it we will redirect them to a
Registration page.)
The security on our site is not meant to be high: it is there primarily so
that the forms CGI scripts have a unique user name to figure out who is doing
what. Thus for our site we would probably be happy to just place a user name
and password in robots.txt, or some similar low-security solution. However, I
can see that for other sites this might not be an acceptable, so spider
maintainers might want to consider adding fields for the username and password
to use to their 'Please index this URL' submission forms. Then, ideally, it
should be possible to submit these forms securely.
--Roger Dearnaley (roger_dearnaley@intouchgroup.com)
From owner-robots Wed Oct 18 08:32:24 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA12938; Wed, 18 Oct 95 08:32:24 -0700
X-Sender: narnett@hawaii.verity.com
Message-Id:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Wed, 18 Oct 1995 08:31:05 -0700
To: robots@webcrawler.com
From: narnett@Verity.COM (Nick Arnett)
Subject: Unfriendly robot at 205.177.10.2
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
One of my Web servers (http://asearch.mccmedia.com/ last night was attacked
by a very unfriendly robot that requested many documents per second. This
robot was originating from 205.177.10.2. I've tried to resolve that IP
address, but I'm unable thus far. However, a traceroute shows that a
cais.net router was the last hop before the domain in which the offending
robot lives, so I sent an e-mail to the postmaster there, hoping that he or
she will know whose host that is and will forward it (assuming that whoever
owns this thing is a CAIS customer).
Has anyone else encountered this one? It doesn't identify itself at all.
Nick
From owner-robots Wed Oct 18 08:58:47 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA14082; Wed, 18 Oct 95 08:58:47 -0700
Message-Id:
Date: Wed, 18 Oct 95 08:58 PDT
X-Sender: a07893@giant.mindlink.net
X-Mailer: Windows Eudora Pro Version 2.1.2
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: robots@webcrawler.com
From: Tim Bray
Subject: Re: Unfriendly robot at 205.177.10.2
Cc: robots@webcrawler.com
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
At 08:31 18/10/95 -0700, Nick Arnett wrote:
>One of my Web servers (http://asearch.mccmedia.com/ last night was attacked
>by a very unfriendly robot that requested many documents per second. This
>robot was originating from 205.177.10.2.
That resolves to 'murph.cais.net' - no idea who they are, never heard
of 'em. - Tim
From owner-robots Wed Oct 18 09:06:44 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA14459; Wed, 18 Oct 95 09:06:44 -0700
X-Sender: narnett@hawaii.verity.com
Message-Id:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Wed, 18 Oct 1995 09:05:20 -0700
To: robots@webcrawler.com
From: narnett@Verity.COM (Nick Arnett)
Subject: CORRECTION -- Re: Unfriendly robot
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Whoops -- I pasted the wrong IP address into this message. The unfriendly
robot was at 205.252.60.50.
Nick
From owner-robots Wed Oct 18 09:32:08 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA15587; Wed, 18 Oct 95 09:32:08 -0700
X-Sender: narnett@hawaii.verity.com
Message-Id:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Wed, 18 Oct 1995 09:30:32 -0700
To: robots@webcrawler.com
From: narnett@Verity.COM (Nick Arnett)
Subject: Re: Unfriendly robot at 205.177.10.2
Cc: tbray@opentext.com
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
At 8:58 AM 10/18/95, Tim Bray wrote:
>That resolves to 'murph.cais.net' - no idea who they are, never heard
>of 'em.
As you may have seen in my correction, that was a mistake on my part. I
copied that from the traceroute -- it's the last router before the address
space in which the misbehaving robot lives. It is Capitol Area Internet
Service and under the assumption that the owner of the robot is one of
their customers, I sent a message to the CAIS postmaster.
The correct address of the owner of the robot is 205.252.60.50, which won't
resolve. Tight security, apparently. Ironically.
Nick
From owner-robots Wed Oct 18 09:43:26 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA16066; Wed, 18 Oct 95 09:43:26 -0700
From: reinpost@win.tue.nl (Reinier Post)
Message-Id: <199510181643.RAA22167@wsinis11.win.tue.nl>
Subject: Re: Unfriendly robot at 205.177.10.2
To: robots@webcrawler.com
Date: Wed, 18 Oct 1995 17:42:55 +0100 (MET)
In-Reply-To: from "Nick Arnett" at Oct 18, 95 08:31:05 am
X-Mailer: ELM [version 2.4 PL23]
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Content-Length: 921
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
You (Nick Arnett) write:
>
>One of my Web servers (http://asearch.mccmedia.com/ last night was attacked
>by a very unfriendly robot that requested many documents per second. This
>robot was originating from 205.177.10.2. I've tried to resolve that IP
>address, but I'm unable thus far. However, a traceroute shows that a
>cais.net router was the last hop before the domain in which the offending
>robot lives, so I sent an e-mail to the postmaster there, hoping that he or
>she will know whose host that is and will forward it (assuming that whoever
>owns this thing is a CAIS customer).
Here you are:
% host 205.177.10.2
Name: murph.cais.net
Address: 205.177.10.2
Aliases:
>Has anyone else encountered this one? It doesn't identify itself at all.
No accesses here from 205.177.10.2 or cais.net.
>Nick
--
Reinier Post reinpost@win.tue.nl
a.k.a. me
From owner-robots Wed Oct 18 11:32:15 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA21768; Wed, 18 Oct 95 11:32:15 -0700
Message-Id: <9510181831.AA06646@ai.iit.nrc.ca>
Date: Wed, 18 Oct 95 14:31:39 EDT
From: Alain Desilets
To: robots@webcrawler.com
Subject: Looking for a spider
Cc: alain@ai.iit.nrc.ca
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Dear spider developpers.
My name is Alain Desilets. I am a researcher in the Interactive
Information Group of the National Research Council of Canada.
We are a small group (6 people) developing tools for interactive
access to information. Our technological angle on this problem is AI
based approaches, in particular Machine Learning and Agents. You can
find more about our work at http://ai.iit.nrc.ca/II_public/.
In order to test our methods we need to acquire a large corpus of
full HTML files from the Web. We plan to use a spider for that task.
We are aware of the controversy surrounding the creation of new
spiders and therefore do not plan to develop one. That
would not only be a duplication of effort but would also introduce a
new, possibly buggy spider in Koster's already vast list of Web
critters. Instead, we would like to use a publically available, well
behaved and proven spider.
Is there such spider available for serious research purpose?
Or maybe the corpus we need already exists? Is there a CD-ROM or .zip
file that would give us the whole of the web in full HTML?
Thanks for your help.
Alain Desilets
Institute for Information Technology
National Research Concil of Canada
Building M-50
Montreal Road
Ottawa (Ont)
K1A 0R6
e-mail: alain@ai.iit.nrc.ca
Tel: (613) 990-2813
Fax: (613) 952-7151
From owner-robots Wed Oct 18 12:28:54 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA23934; Wed, 18 Oct 95 12:28:54 -0700
Date: Wed, 18 Oct 1995 15:34:04 -0400
Message-Id: <199510181934.PAA12177@maple.sover.net>
X-Sender: Leigh.D.Dupee@neinfo.net
X-Mailer: Windows Eudora Version 1.4.4
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: robots@webcrawler.com
From: Leigh.D.Dupee@neinfo.net (Leigh DeForest Dupee)
Subject: Re: Unfriendly robot at 205.177.10.2
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Query:All records (ALL):2.10.177.205.in-addr.arpa
Authoritative Answer
2.10.177.205.in-addr.arpa PTR murph.cais.net
10.177.205.in-addr.arpa NS cais.com
cais.com A 199.0.216.4
Complete: 2.10.177.205.in-addr.arpa
Query:All records (ALL):murph.cais.net
Authoritative Answer
Name does not exist
Complete:NO_DATA murph.cais.net
Best I can come up with!
>One of my Web servers (http://asearch.mccmedia.com/ last night was attacked
>by a very unfriendly robot that requested many documents per second. This
>robot was originating from 205.177.10.2. I've tried to resolve that IP
>address, but I'm unable thus far. However, a traceroute shows that a
>cais.net router was the last hop before the domain in which the offending
>robot lives, so I sent an e-mail to the postmaster there, hoping that he or
>she will know whose host that is and will forward it (assuming that whoever
>owns this thing is a CAIS customer).
>
>Has anyone else encountered this one? It doesn't identify itself at all.
>
>Nick
>
>
>
---------------------------------------------------------------
Leigh DeForest Dupee
Help Me Learn, Inc., Administrator for NEInfo.Net
South Stream Road RR3 Box 4203, Bennington, VT 05201
(802) 447-2905
---------------------------------------------------------------
From owner-robots Wed Oct 18 12:49:50 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA24697; Wed, 18 Oct 95 12:49:50 -0700
Message-Id: <9510181951.AA08164@pluto.sybgate.sybase.com>
X-Sender: dbakin@pluto
X-Mailer: Windows Eudora Version 2.1.1
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Wed, 18 Oct 1995 12:49:14 -0700
To: robots@webcrawler.com
From: David Bakin
Subject: Is it a robot or a link-updater?
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
As the subject implies, I'm curious if there is a difference, in the impact
on the serving site, between a true robot and someone running an automatic
link updater? Can they even be told apart by the serving site? -- Dave
--
Dave Bakin How much work would a work flow flow if a #include
415-872-1543 x5018 work flow could flow work?
From owner-robots Wed Oct 18 13:16:38 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA25902; Wed, 18 Oct 95 13:16:38 -0700
From: amonge@cs.ucsd.edu (Alvaro Monge)
Message-Id: <9510182013.AA10642@dino>
Subject: Re: Looking for a spider
To: robots@webcrawler.com
Date: Wed, 18 Oct 1995 13:13:55 -0700 (PDT)
In-Reply-To: <9510181831.AA06646@ai.iit.nrc.ca> from "Alain Desilets" at Oct 18, 95 02:31:39 pm
X-Mailer: ELM [version 2.4 PL23]
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Content-Length: 1865
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
A colleague of mine and I are also doing research which is AI based
and are in need of a large corpus for our use. We would like to use
anything that is already available which keeps the structure of the
real WWW and does not take anything away. This is in order to create
realistic experiments of our approaches.
Thanks in advance for any pointers,
--Alvaro
Computer science and engineering department
University of California, San Diego
>
> Dear spider developpers.
>
>
> My name is Alain Desilets. I am a researcher in the Interactive
> Information Group of the National Research Council of Canada.
>
> We are a small group (6 people) developing tools for interactive
> access to information. Our technological angle on this problem is AI
> based approaches, in particular Machine Learning and Agents. You can
> find more about our work at http://ai.iit.nrc.ca/II_public/.
>
> In order to test our methods we need to acquire a large corpus of
> full HTML files from the Web. We plan to use a spider for that task.
>
> We are aware of the controversy surrounding the creation of new
> spiders and therefore do not plan to develop one. That
> would not only be a duplication of effort but would also introduce a
> new, possibly buggy spider in Koster's already vast list of Web
> critters. Instead, we would like to use a publically available, well
> behaved and proven spider.
>
> Is there such spider available for serious research purpose?
>
> Or maybe the corpus we need already exists? Is there a CD-ROM or .zip
> file that would give us the whole of the web in full HTML?
>
>
> Thanks for your help.
>
> Alain Desilets
>
> Institute for Information Technology
> National Research Concil of Canada
> Building M-50
> Montreal Road
> Ottawa (Ont)
> K1A 0R6
>
> e-mail: alain@ai.iit.nrc.ca
> Tel: (613) 990-2813
> Fax: (613) 952-7151
>
>
From owner-robots Wed Oct 18 14:13:35 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA28102; Wed, 18 Oct 95 14:13:35 -0700
Message-Id:
Date: 18 Oct 1995 15:13:44 -0700
From: "Xiaodong Zhang"
Subject: Re: Looking for a spider
To: robots@webcrawler.com
X-Mailer: Mail*Link SMTP-QM 3.0.2
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Reply to: RE>>Looking for a spider
7/24/95 - Frontier Technologies licenses Lycos Internet
Catalog software
MEQUON, WIS. (July 24) BUSINESS WIRE -July 24, 1995--Frontier Technologies
Corp.
today announced it has signed an agreement to license the Lycos(TM) Internet
Catalog.
The Lycos Catalog has been incorporated into Frontier Technologies" new
SuperHighway Access
product, called SuperHighway Access CyberSearch(TM), which allows users to
perform a Lycos
search offline via CD-ROM, connecting to the Internet only once relevant
Internet resources have
been identified.
The Lycos technology was developed at Carnegie Mellon University, and was
recently transferred
to Lycos Inc., a newly-created subsidiary of CMG Information Services Inc.
Lycos is a software
system which contains a robot that searches the World Wide Web and catalogs
the documents it
finds. It also includes an information search engine that helps users access
information quickly and
easily when they type in key words or topics. The Lycos exploration robot
locates new and
changed documents and builds abstracts, which consist of title, headings,
subheadings, 100 most
significant words and the first 20 lines of the document. The catalog is
continually updated by the
Lycos exploration agent. Frontier will receive regular updates from Lycos
Inc., allowing it to
produce monthly issues of SuperHighway Access CyberSearch.
"It's now widely understood that one of the primary barriers to users"
productivity on the Internet is
finding information," said Dennis Freeman, Frontier Technologies" marketing
director. "That's why
Internet search services like Lycos are among the Internet's most popular
sites."
"Lycos Inc. is pleased to partner with Frontier as they contribute to our
continued position as the
most widely used and most comprehensive catalog product on the Web," said Bob
Davis, CEO of
Lycos Inc.
The product, now shipping, consists of a 608-megabyte subset of the Lycos
catalog, indexing about
half a million web pages, integrated with Frontier's multi-session,
multi-protocol Internet browser
software. The product is shipped on CD-ROM and is available through Frontier's
reseller channel.
The CD will be updated monthly (bi-monthly initially)
Frontier is offering the first issue of CyberSearch at $14.95. A charter
subscription for 6 issues is
priced at $6.75 per month. Subscribers should call 1-800/879-0075
(+1-414/571-0190 outside
the U.S.) or access Frontier's web server, http://www.frontiertech.com for
further information.
Lycos Inc., with offices in Wilmington, Mass. and Pittsburgh, Penn., is the
newly formed
corporation based upon technology developed at Carnegie Mellon University.
Frontier Technologies Corp., based in Mequon, is a leading supplier of TCP/IP
and Internet-based
products that make businesses more competitive in a global market.
CONTACT:
Frontier Technologies Corp., Mequon
Nicole Rogers, 414/241-4555 x293
or
Lycos Inc.
Mike Olfe, 508/657-5050 x3124
------------------------------
Date: 10/18/95 3:01 PM
To: Zhang, Xiaodong
From: robots@webcrawler.com
A colleague of mine and I are also doing research which is AI based
and are in need of a large corpus for our use. We would like to use
anything that is already available which keeps the structure of the
real WWW and does not take anything away. This is in order to create
realistic experiments of our approaches.
Thanks in advance for any pointers,
--Alvaro
Computer science and engineering department
University of California, San Diego
>
> Dear spider developpers.
>
>
> My name is Alain Desilets. I am a researcher in the Interactive
> Information Group of the National Research Council of Canada.
>
> We are a small group (6 people) developing tools for interactive
> access to information. Our technological angle on this problem is AI
> based approaches, in particular Machine Learning and Agents. You can
> find more about our work at http://ai.iit.nrc.ca/II_public/.
>
> In order to test our methods we need to acquire a large corpus of
> full HTML files from the Web. We plan to use a spider for that task.
>
> We are aware of the controversy surrounding the creation of new
> spiders and therefore do not plan to develop one. That
> would not only be a duplication of effort but would also introduce a
> new, possibly buggy spider in Koster's already vast list of Web
> critters. Instead, we would like to use a publically available, well
> behaved and proven spider.
>
> Is there such spider available for serious research purpose?
>
> Or maybe the corpus we need already exists? Is there a CD-ROM or .zip
> file that would give us the whole of the web in full HTML?
>
>
> Thanks for your help.
>
> Alain Desilets
>
> Institute for Information Technology
> National Research Concil of Canada
> Building M-50
> Montreal Road
> Ottawa (Ont)
> K1A 0R6
>
> e-mail: alain@ai.iit.nrc.ca
> Tel: (613) 990-2813
> Fax: (613) 952-7151
>
>
------------------ RFC822 Header Follows ------------------
Received: by zazu.softshell.com with SMTP;18 Oct 1995 14:59:25 -0700
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA25902; Wed, 18 Oct 95 13:16:38 -0700
From: amonge@cs.ucsd.edu (Alvaro Monge)
Message-Id: <9510182013.AA10642@dino>
Subject: Re: Looking for a spider
To: robots@webcrawler.com
Date: Wed, 18 Oct 1995 13:13:55 -0700 (PDT)
In-Reply-To: <9510181831.AA06646@ai.iit.nrc.ca> from "Alain Desilets" at Oct
18, 95 02:31:39 pm
X-Mailer: ELM [version 2.4 PL23]
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Content-Length: 1865
Sender: owner-robots@webcrawler.com
Precedence: bulk
Reply-To: robots@webcrawler.com
From owner-robots Wed Oct 18 14:55:47 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA29718; Wed, 18 Oct 95 14:55:47 -0700
X-Sender: narnett@hawaii.verity.com
Message-Id:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Wed, 18 Oct 1995 14:54:22 -0700
To: robots@webcrawler.com
From: narnett@Verity.COM (Nick Arnett)
Subject: Unfriendly robot owner identified!
Cc: aleonard@well.com
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Got 'em.
Using whois, I found that the IP address belongs to Library Corp. in
Virginia. They're the providers of the "NlightN" search service at:
http://www.nlightn.com/
Anybody know anything about their robot? I know that they've licensed the
Lycos data.
Their background information says, "NlightN, a division of The Library
Corporation, was formed to develop and market a Universal Index to the
world's electronically stored information."
I guess their robot has to work fast to build a universal index... ;-)
Nick
From owner-robots Wed Oct 18 15:19:02 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA01014; Wed, 18 Oct 95 15:19:02 -0700
Date: Wed, 18 Oct 1995 15:18:53 -0700 (PDT)
From: Andrew Leonard
Subject: Re: Unfriendly robot owner identified!
To: robots@webcrawler.com
In-Reply-To:
Message-Id:
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Hi, all.
I'm a reporter for Wired working on a story about bots, and I'm
personally following up on this NlightN robot episode. I've put a call
into their Reston VA headquarters asking to talk to someone about their
search robot, and I'll keep the list posted on whatever I find out.
Andrew Leonard
Wired Magazine
> Got 'em.
>
> Using whois, I found that the IP address belongs to Library Corp. in
> Virginia. They're the providers of the "NlightN" search service at:
>
> http://www.nlightn.com/
>
> Anybody know anything about their robot? I know that they've licensed the
> Lycos data.
>
> Their background information says, "NlightN, a division of The Library
> Corporation, was formed to develop and market a Universal Index to the
> world's electronically stored information."
>
> I guess their robot has to work fast to build a universal index... ;-)
>
> Nick
>
>
>
From owner-robots Wed Oct 18 15:38:57 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA02009; Wed, 18 Oct 95 15:38:57 -0700
From: amonge@cs.ucsd.edu (Alvaro Monge)
Message-Id: <9510182200.AA11857@dino>
Subject: Re: Looking for a spider
To: robots@webcrawler.com
Date: Wed, 18 Oct 1995 15:00:01 -0700 (PDT)
In-Reply-To: from "Xiaodong Zhang" at Oct 18, 95 03:13:44 pm
X-Mailer: ELM [version 2.4 PL23]
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Content-Length: 555
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Unfortunately, I cannot use most robots that I know of because they
DO NOT SAVE the entire document, or its hierarchical structure.
Lycos for example:
> The Lycos exploration robot locates new and changed documents and
> builds abstracts, which consist of title, headings, subheadings,
> 100 most significant words and the first 20 lines of the document.
For my research, this is not that useful. I need the entire document,
as it appears at the source -- not as saved by some robot, because I
want to follow the links within the document.
--Alvaro
From owner-robots Wed Oct 18 16:19:20 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA04137; Wed, 18 Oct 95 16:19:20 -0700
X-Sender: narnett@hawaii.verity.com
Message-Id:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Wed, 18 Oct 1995 16:18:02 -0700
To: robots@webcrawler.com
From: narnett@Verity.COM (Nick Arnett)
Subject: Really fast searching
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
It's a bit off-topic, but I can't resist sharing something that one of our
sharp-eyed engineers found in a certain company's information page about
their search service:
> By transparently linking hundreds of data sources, ******* has
> created the world's largest integrated index, already comprised of
> more than 100 gigabytes and growing daily. A proprietary database
> engine provides immediate response time and actually increases speed
> as the size of the index grows.
We need this algorithm, our engineer says. It start off with immediate
responses, then gets faster. Wowza! ("A meeting on time travel will be
held last week.")
Nick
From owner-robots Thu Oct 19 06:29:53 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA16844; Thu, 19 Oct 95 06:29:53 -0700
Message-Id: <9510191329.AA12490@ai.iit.nrc.ca>
Date: Thu, 19 Oct 95 09:29:15 EDT
From: Alain Desilets
To: robots@webcrawler.com
Subject: Re: Looking for a spider
Cc: alain@ai.iit.nrc.ca
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Dear Alvaro,
Thanks for responding. I'll let you know if I find something. I'm interested
to know more about your work. Do you have a Web page on it?
Thanks
Alain
From owner-robots Thu Oct 19 06:32:09 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA17037; Thu, 19 Oct 95 06:32:09 -0700
Message-Id: <9510191331.AA12583@ai.iit.nrc.ca>
Date: Thu, 19 Oct 95 09:31:31 EDT
From: Alain Desilets
To: robots@webcrawler.com
Subject: Re: Looking for a spider
Cc: alain@ai.iit.nrc.ca
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Dear Zhang,
Thank you for the info. Unfortunately, I am in the same position as Alvaro
Monge. I need the original HTML files, as opposed to some condensed version of
it produced by a robot.
Alain
From owner-robots Thu Oct 19 06:39:50 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA17600; Thu, 19 Oct 95 06:39:50 -0700
Message-Id: <9510191339.AA12691@ai.iit.nrc.ca>
Date: Thu, 19 Oct 95 09:39:13 EDT
From: Alain Desilets
To: robots@webcrawler.com
Subject: Sorry!
Cc: alain@ai.iit.nrc.ca
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Sorry about the previous messages. I intended to send them directly to
the people concerned but it somehow got sent to this list.
- Alain
From owner-robots Thu Oct 19 07:53:29 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA22425; Thu, 19 Oct 95 07:53:29 -0700
From: reinpost@win.tue.nl (Reinier Post)
Message-Id: <199510191453.PAA06141@wswiop11.win.tue.nl>
Subject: Re: Unfriendly robot at 205.177.10.2
To: robots@webcrawler.com
Date: Thu, 19 Oct 1995 15:53:11 +0100 (MET)
Cc: tbray@opentext.com
In-Reply-To: from "Nick Arnett" at Oct 18, 95 09:30:32 am
X-Mailer: ELM [version 2.4 PL23]
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Content-Length: 989
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
>The correct address of the owner of the robot is 205.252.60.50, which won't
>resolve. Tight security, apparently. Ironically.
Well, on our site (www.win.tue.nl), it's causing no problems at all:
% grep '205\.252' /usr/www/logs/cern_access.log
205.252.60.50 - - [13/Oct/1995:12:30:13 +0100] "GET / HTTP/1.0" 302 381
205.252.60.50 - - [13/Oct/1995:20:58:55 +0100] "GET / HTTP/1.0" 302 381
% wc /usr/www/logs/cern_access.log
206422 2062250 22193056 /usr/www/logs/cern_access.log
That is, out of the last 206,422 requests, 2 were from this site.
Lycos wants to index as many documents on a site it can find. This
robot has only made two requests, and it didn't even retrieve our home page
(/ is redirected to /win/, which is the actual home page). Perhaps it doesn't
follow redirections.
>Nick
--
Reinier Post reinpost@win.tue.nl
a.k.a. me
[LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK]
From owner-robots Thu Oct 19 07:57:03 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA22755; Thu, 19 Oct 95 07:57:03 -0700
From: reinpost@win.tue.nl (Reinier Post)
Message-Id: <199510191456.PAA06159@wswiop11.win.tue.nl>
Subject: Re: Looking for a spider
To: robots@webcrawler.com
Date: Thu, 19 Oct 1995 15:56:40 +0100 (MET)
In-Reply-To: <9510182200.AA11857@dino> from "Alvaro Monge" at Oct 18, 95 03:00:01 pm
X-Mailer: ELM [version 2.4 PL23]
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Content-Length: 1038
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
You (Alvaro Monge) write:
>Unfortunately, I cannot use most robots that I know of because they
>DO NOT SAVE the entire document, or its hierarchical structure.
>
>Lycos for example:
>
>> The Lycos exploration robot locates new and changed documents and
>> builds abstracts, which consist of title, headings, subheadings,
>> 100 most significant words and the first 20 lines of the document.
>
>For my research, this is not that useful. I need the entire document,
>as it appears at the source -- not as saved by some robot, because I
>want to follow the links within the document.
Lycos follows the links of documents; that's how robots work.
The summaries are built for indexing purposes. You can't save
the full text of all documents because of the disk space requirements
(perhaps OpenText can?) and because of legal considerations.
>--Alvaro
--
Reinier Post reinpost@win.tue.nl
a.k.a. me
[LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK]
From owner-robots Thu Oct 19 08:44:31 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA26046; Thu, 19 Oct 95 08:44:31 -0700
X-Sender: narnett@hawaii.verity.com
Message-Id:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Thu, 19 Oct 1995 08:41:05 -0700
To: robots@webcrawler.com
From: narnett@Verity.COM (Nick Arnett)
Subject: Re: Unfriendly robot at 205.252.60.50
Cc: tbray@opentext.com
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
At 7:53 AM 10/19/95, Reinier Post wrote:
>>The correct address of the owner of the robot is 205.252.60.50, which won't
>>resolve. Tight security, apparently. Ironically.
>
>Well, on our site (www.win.tue.nl), it's causing no problems at all
In my e-mail to NlightN, I said that I assume it was unintentional. I
can't imagine that anyone would purposely request documents at the rate
they were hitting us. Of course, there's no way to know if that was the
robot or a human-controlled browser hitting your site from the same host...
Thanks!
Nick
From owner-robots Thu Oct 19 09:10:34 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA28065; Thu, 19 Oct 95 09:10:34 -0700
Message-Id: <9510191609.AA14728@ai.iit.nrc.ca>
Date: Thu, 19 Oct 95 12:09:49 EDT
From: Alain Desilets
To: robots@webcrawler.com
Subject: Re: Looking for a spider
Cc: alain@ai.iit.nrc.ca
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
In response to Alvaro's message,
> >
> >> The Lycos exploration robot locates new and changed documents and
> >> builds abstracts, which consist of title, headings, subheadings,
> >> 100 most significant words and the first 20 lines of the document.
> >
> >For my research, this is not that useful. I need the entire document,
> >as it appears at the source -- not as saved by some robot, because I
> >want to follow the links within the document.
Reinier Post writes:
>
> Lycos follows the links of documents; that's how robots work.
> The summaries are built for indexing purposes. You can't save
> the full text of all documents because of the disk space requirements
> (perhaps OpenText can?) and because of legal considerations.
>
Like Alvaro, no robot generated indexe of the whole web is sufficient for
my purpose. My group working on developping new tools that can process the web
and "summarise" it in some novel way. For example:
- New and hopefully better keyword extraction algorithms
- Automatic generation of hierarchichal indexes a la Yahoo
- Merging of small indexes into bigger ones
- etc...
In order to test these new approaches, we need the full HTML, not an index of
it.
- Alain
From owner-robots Thu Oct 19 09:18:30 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA28618; Thu, 19 Oct 95 09:18:30 -0700
Date: Fri, 20 Oct 1995 02:18:16 +1000
From: Murray Bent
Message-Id: <199510191618.CAA08466@wittgenstein.icis.qut.edu.au>
To: robots@webcrawler.com
Subject: re: Lycos unfriendly robot
Content-Length: 439
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
According to Reinier Post:
>Lycos wants to index as many documents on a site it can find. This
>robot has only made two requests, and it didn't even retrieve our home page
>(/ is redirected to /win/, which is the actual home page). Perhaps it doesn't
>follow redirections.
>>Nick
>--
>Reinier Post reinpost@win.tue.nl
That may be fine if you have shares in Lycos or something. Do you?
mj
From owner-robots Thu Oct 19 11:01:14 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA05721; Thu, 19 Oct 95 11:01:14 -0700
From: reinpost@win.tue.nl (Reinier Post)
Message-Id: <199510191801.TAA19705@wsinis02.win.tue.nl>
Subject: Re: Lycos unfriendly robot
To: robots@webcrawler.com
Date: Thu, 19 Oct 1995 19:01:00 +0100 (MET)
In-Reply-To: <199510191618.CAA08466@wittgenstein.icis.qut.edu.au> from "Murray Bent" at Oct 20, 95 02:18:16 am
X-Mailer: ELM [version 2.4 PL23]
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Content-Length: 918
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
You (Murray Bent) write:
>
>
>According to Reinier Post:
>>Lycos wants to index as many documents on a site it can find. This
>>robot has only made two requests, and it didn't even retrieve our home page
>>(/ is redirected to /win/, which is the actual home page). Perhaps it doesn't
>>follow redirections.
>
>>>Nick
>
>>--
>>Reinier Post reinpost@win.tue.nl
>
>That may be fine if you have shares in Lycos or something. Do you?
I don't follow your logic. *What* is fine if I have shares in Lycos?
The fact that this visit was made by something that doesn't follow
redirections, and therefore is unlikely to be a Lycos robot?
>mj
For some reason you seem to bear a grudge against Lycos. If my posting
did anything to tear open any old wounds, I apologise.
--
Reinier Post reinpost@win.tue.nl
a.k.a. me
From owner-robots Sat Oct 21 07:17:11 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA06960; Sat, 21 Oct 95 07:17:11 -0700
Date: Sat, 21 Oct 1995 07:17:03 -0700 (PDT)
From: Andrew Leonard
Subject: Re: Unfriendly robot at 205.252.60.50
To: robots@webcrawler.com
Cc: robots@webcrawler.com, tbray@opentext.com
In-Reply-To:
Message-Id:
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
I contacted NlightN, and their CEO said that their most junior hire was
testing a new robot. They were apparently unaware of the robot exclusion
protocol but plan to mend their ways.
Andrew Leonard
Wired Magazine
On Thu, 19 Oct 1995, Nick Arnett wrote:
> At 7:53 AM 10/19/95, Reinier Post wrote:
> >>The correct address of the owner of the robot is 205.252.60.50, which won't
> >>resolve. Tight security, apparently. Ironically.
> >
> >Well, on our site (www.win.tue.nl), it's causing no problems at all
>
> In my e-mail to NlightN, I said that I assume it was unintentional. I
> can't imagine that anyone would purposely request documents at the rate
> they were hitting us. Of course, there's no way to know if that was the
> robot or a human-controlled browser hitting your site from the same host...
>
> Thanks!
>
> Nick
>
>
>
From owner-robots Sat Oct 21 11:21:18 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA23944; Sat, 21 Oct 95 11:21:18 -0700
X-Sender: narnett@hawaii.verity.com
Message-Id:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Sat, 21 Oct 1995 10:35:40 -0700
To: robots@webcrawler.com
From: narnett@Verity.COM (Nick Arnett)
Subject: Re: Unfriendly robot at 205.252.60.50
Cc: robots@webcrawler.com, tbray@opentext.com
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
At 7:17 AM 10/21/95, Andrew Leonard wrote:
>I contacted NlightN, and their CEO said that their most junior hire was
>testing a new robot. They were apparently unaware of the robot exclusion
>protocol but plan to mend their ways.
I haven't heard from them, but our server/spider product manager received a
telephone apology.
I can't resist pointing out the irony of a search services company that
apparently failed to find some critical information about robots on the
Internet. On the other hand, we've probably done equally silly things.
I hope they'll add a user-agent field, at least.
Nick
From owner-robots Sat Oct 21 17:47:17 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA20668; Sat, 21 Oct 95 17:47:17 -0700
Message-Id:
From: kimba@snog.it.com.au (Kim Davies)
Subject: Re: Unfriendly robot at 205.252.60.50
To: robots@webcrawler.com
Date: Sun, 22 Oct 1995 08:46:39 +0800 (WST)
In-Reply-To: from "Nick Arnett" at Oct 21, 95 10:35:40 am
X-Mailer: ELM [version 2.4 PL24 PGP2]
Content-Type: text
Content-Length: 554
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Hi,
> >I contacted NlightN, and their CEO said that their most junior hire was
> >testing a new robot. They were apparently unaware of the robot exclusion
> >protocol but plan to mend their ways.
>
> I haven't heard from them, but our server/spider product manager received a
> telephone apology.
Has someone invited them to join this list? If they discussed what they
were doing it might be better for all concerned..
catchya,
--
Kim Davies | "Belief is the death of intelligence" -Snog
kimba@it.com.au | http://www.it.com.au/~kimba/
From owner-robots Sun Oct 22 13:14:28 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA01215; Sun, 22 Oct 95 13:14:28 -0700
X-Sender: narnett@hawaii.verity.com
Message-Id:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Sun, 22 Oct 1995 13:13:12 -0700
To: robots@webcrawler.com
From: narnett@Verity.COM (Nick Arnett)
Subject: Re: Unfriendly robot at 205.252.60.50
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
At 5:46 PM 10/21/95, Kim Davies wrote:
>Hi,
>
>> >I contacted NlightN, and their CEO said that their most junior hire was
>> >testing a new robot. They were apparently unaware of the robot exclusion
>> >protocol but plan to mend their ways.
>>
>> I haven't heard from them, but our server/spider product manager received a
>> telephone apology.
>
>Has someone invited them to join this list? If they discussed what they
>were doing it might be better for all concerned..
I directed them to the robots pages on www.webcrawler.com, which should
lead them to this list.
What am I thinking -- the server that they were hammering with their robot
includes recent messages from this list (at
http://asearch.mccmedia.com/robots/). I suppose that means they might have
looked...
Nick
From owner-robots Mon Oct 23 07:50:14 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA03859; Mon, 23 Oct 95 07:50:14 -0700
Date: Mon, 23 Oct 95 10:50:03 EDT
From: wulfekuh@cps.msu.edu (Marilyn R Wulfekuhler)
Message-Id: <9510231450.AA10394@pixel.cps.msu.edu>
To: robots@webcrawler.com
Subject: Re: Looking for a spider
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Alain Desilets writes:
> In order to test our methods we need to acquire a large corpus of
> full HTML files from the Web. We plan to use a spider for that task.
>
and Alvaro Monge writes:
> A colleague of mine and I are also doing research which is AI based
> and are in need of a large corpus for our use. We would like to use
> anything that is already available which keeps the structure of the
> real WWW and does not take anything away. This is in order to create
> realistic experiments of our approaches.
>
We are also doing research on AI based approaches to processing the
web, and toward the goal of having a test bed of the web, we have a
text-only copy of a subset of the web (currently about 650 meg) which
we have been calling "the proving grounds". It is not possible to get
a complete snapshot of the web at any given time, but without images
and audio, we can at least have a large, known, subset. It's also to
our collective advantage to all be working from the same subset.
It is our intention to make the proving grounds available to the public,
hopefully within the next two weeks.
We used a spider which was a modified htmlgobble, which takes a URL and
follows all the links, copying all the documents it finds except image,
audio, and video files. The urls inside the documents have been modified
so that everything points to the local copy, enabling a spider (or human
browser) to traverse the database locally.
Before we go public, I have a few questions:
(1) We currently don't copy audio, video, image files and instead
create a file by the same name with a single character identifying
it as video, image, or audio. Would an empty file suffice? Is
there another identification scheme that would be more useful?
(2) We currently copy postscript, but are considering treating them as
we do image files. They take a LOT of space, and are of no utility
for the kind of analysis that we want to do. Would it be more useful
to keep the postscript, or treat it as we do images (which would then
allow us to use the space for a larger web subset)?
I appreciate any feedback and I'll announce to the list when it's ready
for public use.
Marilyn Wulfekuhler
Intelligent Systems Lab, Michigan State University
From owner-robots Mon Oct 23 15:27:34 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA05597; Mon, 23 Oct 95 15:27:34 -0700
X-Sender: narnett@hawaii.verity.com
Message-Id:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Mon, 23 Oct 1995 15:26:16 -0700
To: Andrew Daviel , robots@webcrawler.com
From: narnett@Verity.COM (Nick Arnett)
Subject: Re: Proposed URLs that robots should search
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
At 1:51 PM 10/23/95, Andrew Daviel wrote:
>With my other hat on (admin@vancouver-webpages.com), I'm
>trying to build a database of URLs and other information for businesses
>on the Net.
I can't quite contain the urge to say, "Isn't everyone?"
>Some database registration robots (I believe) search submitted URLs for
>keywords, doing some natural language processing to discard modifiers and
>prepositions. However, the trend to graphics-dominated homepages makes
>such efforts of dubious utility.
I wouldn't be so quick to jump to that conclusion. I have seen few, if
any, business sites that don't offer text-only versions of their key pages.
Also, I'm utterly certain that a good relevancy-ranking engine will do a
better job at assigning categories than will an uncontrolled set of people,
especially when those people are out to maximize hits, rather than to
maximize relevancy.
Having said all of that, I'd like to agree that we need some additional
information for robots. Could we start simply by having a standard way to
set forth the name of the site? An icon for the site would be really nice.
It's very frustrating to build a search results list and have no
definitive way of describing the site on which the documents reside! Next,
I'd like to have the means to name groups of documents (Press releases,
product descriptions, as examples of typical business groupings). We guess
at these from directory names, but that's very haphazard. The secondary
naming problem is more difficult because there are many-to-many
relationships involved.
>In the spirit of /robots.txt, I would like to propose a set of files that
>robots would be encouraged to visit:
>
>/robots.htm - an HTML list of links that robots are encouraged to traverse
What does "encouraged" mean? How is it differnet from (not (robots.txt))?
Why HTML?
>/descript.txt - a text file describing what the site (or directory) is
> all about
Agreed.
>/keywords.txt - a text file with comma-delimited keywords relevant to the
> site (or directory)
Disagree greatly. This opens a giant can of worms. Keywords are never
enough, often confusing and difficult to maintain.
>/linecard.txt - for commercial sites, a text file with comma-delimited
> line items (brands) manufactured or stocked
This will drown in details.
>/sitedata.txt - a text file similar to the InterNIC submissions forms,
> with publicly-available site data such as
>
>Organization: organisation name
>Type: commercial/non/profit/educational etc.
>Admin: email of admininstration
>Webmaster: email of Web admininstration
>Postal: postal address
>ZIP: ZIP/postcode
>Country:
>Position: Lat/Long
>etc.
Yes to some of this at least. But there's an assumption that there's a
one-to-one relationship between the server and these field data. Often,
there isn't and no scheme that fails to deal with that is going to succeed.
I'm ready to adapt one of my prototype robots to parse this data for our
engine, so here's one hand up for "Yes, I'll implement it." I'm just doing
research, but my research does fall in front of our engineers at some
point.
By the way, today, Verity announced that NetManage and Purveyor have signed
up to use our search engine. They join Netscape, Quarterdeck and a few
others.
Nick
P.S. I've replied to the new list server address at webcrawler.com, rather
than the Nexor address.
From owner-robots Mon Oct 23 16:31:22 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA10352; Mon, 23 Oct 95 16:31:22 -0700
Message-Id: <9510232331.AA10338@webcrawler.com>
To: robots
Cc: Andrew Daviel
Subject: Re: Proposed URLs that robots should search
In-Reply-To: Your message of "Mon, 23 Oct 1995 15:26:16 PDT."
Date: Mon, 23 Oct 1995 16:31:17 -0700
From: Martijn Koster
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
In message , Nick Arnett writes:
> Also, I'm utterly certain that a good relevancy-ranking engine will do a
> better job at assigning categories than will an uncontrolled set of people,
> especially when those people are out to maximize hits, rather than to
> maximize relevancy.
Yeah, isn't that fun... :-/ Maybe we should have a shared spammer
blacklist :-)
> [want the name of the site]
> [groups of documents]
> >In the spirit of /robots.txt, I would like to propose a set of files that
> >robots would be encouraged to visit:
> >
> >/robots.htm - an HTML list of links that robots are encouraged to traverse
>
> What does "encouraged" mean? How is it differnet from (not (robots.txt))?
Because a robot may not want to traverse the whole site, and would
prefer to get "sensible" pages.
> Why HTML?
Yeah, bad news.
> [/keywords]
> Disagree greatly. This opens a giant can of worms. Keywords are never
> enough, often confusing and difficult to maintain.
Hmmm... yes, but it's not necesarrily worse than straight HTML text,
which is the alternative.
> >/linecard.txt - for commercial sites, a text file with comma-delimited
> > line items (brands) manufactured or stocked
>
> This will drown in details.
Yup.
> >/sitedata.txt - a text file similar to the InterNIC submissions forms,
> > with publicly-available site data such as
> >
> Yes to some of this at least. But there's an assumption that there's a
> one-to-one relationship between the server and these field data. Often,
> there isn't and no scheme that fails to deal with that is going to succeed.
Well, I hate to repeat myself, but ALIWEB's /site.idx will give you all of
the above (OK, not the icon, but you could add that). It doesn't seem
to scale to well to large sites who want to describe every single page
or resource on their server, but that's not the goal here...
Note also that nobody is stopping you to pull just the URLs from a site.idx,
and doing your standard robot summarising on that...
-- Martijn
__________
Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html
From owner-robots Mon Oct 23 17:06:25 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA12787; Mon, 23 Oct 95 17:06:25 -0700
Message-Id:
From: kimba@snog.it.com.au (Kim Davies)
Subject: Re: Proposed URLs that robots should search
To: andrew@andrew.triumf.ca (Andrew Daviel)
Date: Tue, 24 Oct 1995 08:03:58 +0800 (WST)
Cc: robots@webcrawler.com
In-Reply-To: from "Andrew Daviel" at Oct 23, 95 09:51:17 pm
X-Mailer: ELM [version 2.4 PL24 PGP2]
Content-Type: text
Content-Length: 1378
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Hi,
> /robots.htm - an HTML list of links that robots are encouraged to traverse
A plain text file would be much more well suited, similar to the
existing robots.txt - reading in plain text and adding it to the
stack of URL's to be processed is sure to be more effective than
sending the html to the robot reasoning engine to parse about.
> [snip]
>
> Organization: organisation name
> Type: commercial/non/profit/educational etc.
> Admin: email of admininstration
> Webmaster: email of Web admininstration
> Postal: postal address
> ZIP: ZIP/postcode
> Country:
> Position: Lat/Long
> etc.
How are you going to get a system administrator to implement all these
files? How many system administrators do you know even know about
robots.txt? Assuming you want a large chunk of sites to adopt these
details, I'd propose it be implemented into the HTTP protocol somehow.
an "ADMIN" request, for example, could request the above details from
the site just as an "/admin", for example, on IRC, grabs the admin
details of a server from the lines in the configuration.
If a space was made in a server's configuration or makefile for these
details, web administrators are far more likely to implement.
catchya,
--
Kim Davies | "Belief is the death of intelligence" -Snog
kimba@it.com.au | http://www.it.com.au/~kimba/
From owner-robots Tue Oct 24 02:48:24 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA14601; Tue, 24 Oct 95 02:48:24 -0700
Date: Tue, 24 Oct 1995 02:48:19 -0700 (PDT)
From: Andrew Daviel
To: robots@webcrawler.com
Subject: Re: Proposed URLs that robots should search
In-Reply-To:
Message-Id:
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Let's see if I can reply to everyone without getting in a tangle ... :)=
>>I'm trying to build a database of URLs for business...
>I can't quite contain the urge to say, "Isn't everyone?"
Know any good ones? Nothing jumped out at me from CUSI, or Submit-It, etc.
>I have seen few .. business sites that don't offer text-only versions
I seem to keep seeing sites that say "Works best with Netscape 1.2 - get
it!"
>Could we start .. standard way to set forth the name of the site?
Having it in the of the document root is quite common, but you get
"BloggCo Home Page", "Welcome to BloggCo", and sometimes
"Welcome to B L O G G C O". I've tried looking for non-dictionary words
with some success.
>>/linecard.txt - for commercial sites, a text file with comma-delimited
>> line items (brands) manufactured or stocked
>This will drown in details.
>Yup.
This was a suggestion from a professional buyer. Sure, collecting these
for the whole world would get out of control, but with a small enough
scope it might be manageable. The buyers look up brand names in a huge
12-volume book to find distributors or manufacturers. Finding who stocks
Motorcraft in Tipperary can't produce that many records.
>Well, I hate to repeat myself, but ALIWEB's /site.idx will give you ..
Didn't know about it. Looks like what I was thinking of. I see it has
keywords ( >..Disagree greatly. This opens a giant can ... )
> >/robots.htm - an HTML list of links
> Why HTML?
A simplistic idea. I figured that if existing robots are written to
traverse HTML, then giving them an HTML file to start from would be
fairly easy.
Re. site.idx, is this a fairly open-ended list of fields? I had in mind some
fields relevant to larger businesses, like Sales-Email, Info-Email,
Tech-Email, Sales-FaxBack, etc. etc. for voice, fax, email where some places
may have separate hotlines for hardware, software, licenses, etc. How to
handle this for big concerns that have one website and hundreds of regional
offices is another problem.
I find the Lat/Long format in IAFA a bit strange; I use the "standard"
navigational format from navigation books, GPS and Loran, etc. eg. 49D14.7N
123D13.6W, except that as there isn't a degree symbol in ASCII I've used "D",
which makes it similar to the NMEA0182 format. The current NMEA0183 standard
for navigation equipment would use something like:
$LCGLL,4001.74,N,07409.43,W for 40 degrees 1.74 minutes North, 74 degrees
9.43 minutes West. Anyway, it's just bits and easy enough to convert.
>How are you going to get a system administrator to implement all these
>files?
Well, one might assume that a good many HTML authors and Webmasters read
comp.infosystems.author.html, or whatever it's called. Or one could
just send them all mail ... 50,000 returned mail messages wouldn't make
too much of a dent in my disk ... :)=
>I'd propose it be implemented into the HTTP protocol ..
I'd think it might take a while for everyone to update their
servers - say, at least 2 years...
Andrew Daviel email: advax@triumf.ca
From owner-robots Wed Oct 25 15:49:09 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA05021; Wed, 25 Oct 95 15:49:09 -0700
Date: Thu, 26 Oct 1995 08:48:57 +1000
From: Murray Bent
Message-Id: <199510252248.IAA09980@wittgenstein.icis.qut.edu.au>
To: robots@webcrawler.com
Subject: lycos patents
Content-Length: 134
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
To add insult to injury, Lycos are patenting spiders and robots.
Anyone care to comment on what Lycos Inc. is up to these days?
mj
From owner-robots Wed Oct 25 15:56:03 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA05454; Wed, 25 Oct 95 15:56:03 -0700
Message-Id: <9510252256.AA05447@webcrawler.com>
Content-Type: text/plain
Mime-Version: 1.0 (NeXT Mail 3.3 v118.2)
From: Scott Stephenson
Date: Wed, 25 Oct 95 15:55:18 -0700
To: robots
Subject: Re: lycos patents
References: <199510252248.IAA09980@wittgenstein.icis.qut.edu.au>
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Hi,
What, Lycos is trying to patent spiders and robots. Got any more
information on this?!? How can this be possible, as it is certainly
not technology that they developed.
ss
From owner-robots Wed Oct 25 15:58:36 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA05629; Wed, 25 Oct 95 15:58:36 -0700
Message-Id: <9510252258.AA05583@webcrawler.com>
To: robots
Cc: Murray Bent
Subject: Re: lycos patents
In-Reply-To: Your message of "Thu, 26 Oct 1995 08:48:57 +1000."
<199510252248.IAA09980@wittgenstein.icis.qut.edu.au>
Date: Wed, 25 Oct 1995 15:58:13 -0700
From: Martijn Koster
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
In message <199510252248.IAA09980@wittgenstein.icis.qut.edu.au>, Murray Bent wr
ites:
> To add insult to injury, Lycos are patenting spiders and robots.
Can you elaborate? Where did you hear this, where can we find out more?
-- Martijn
__________
Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html
From owner-robots Wed Oct 25 16:09:34 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA06328; Wed, 25 Oct 95 16:09:34 -0700
Date: Wed, 25 Oct 1995 19:08:47 -0400 (EDT)
From: Matthew Gray
X-Sender: mkgray@bokonon
To: robots@webcrawler.com
Subject: Re: lycos patents
In-Reply-To: <199510252248.IAA09980@wittgenstein.icis.qut.edu.au>
Message-Id:
Organization: net.Genesis
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
> To add insult to injury, Lycos are patenting spiders and robots.
I assume he is referring to the comment:
> We have a patent pending on our spider technology, which makes it
> possible for us to both keep up with the exponential growth of the
> Internet, and still find the most popular sites.
which appears in the FAQ at http://lycos-tmp1.psc.edu/reference/faq.html
I hope when they refer to "our spider technology", they are referring to
something genuinely unique. If not there are a great many cases for
prior art, notably my Wanderer which (while no longer the best) was the
first one around in spring of '93.
I agree that some comment or clarification from Lycos would be good.
Matthew Gray --------------------------------- voice: (617) 577-9800
net.Genesis fax: (617) 577-9850
56 Rogers St. mkgray@netgen.com
Cambridge, MA 02142-1119 ------------- http://www.netgen.com/~mkgray
From owner-robots Wed Oct 25 16:19:27 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA06783; Wed, 25 Oct 95 16:19:27 -0700
Date: Thu, 26 Oct 1995 09:16:39 +1000
From: Murray Bent
Message-Id: <199510252316.JAA10010@wittgenstein.icis.qut.edu.au>
To: robots@webcrawler.com
Subject: re: Lycos patents
Content-Length: 570
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
reference:
> From: "Alison O'Balle" (Alison O'Balle)
> Subject: Catalog of the Internet
> To: Multiple recipients of list
[...]
> A representative from Lycos made a presentation on campus Thursday morning
> in which he said a number of interesting things about the future of the
> internet, cataloging,and other topics.
[Interesting facts and figures deleted]
> They are patenting web spiders and robots. This was glossed over, but the
> lycos guy said the patent process was going well for them so far.
From owner-robots Wed Oct 25 16:22:14 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA06913; Wed, 25 Oct 95 16:22:14 -0700
Message-Id: <9510252322.AA06904@webcrawler.com>
To: fuzzy@cmu.edu
Cc: robots
Subject: Patents?
From: Martijn Koster
Date: Wed, 25 Oct 1995 16:22:18 -0700
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Hi Fuzzy,
I can't see you in the list of subscribers to the robots list,
(to which this is cc'ed) so maybe you missed a message regarding
patents there.
In http://www.lycos.com/reference/faq.html one reads:
> We have a patent pending on our spider technology, which makes it
> possible for us to both keep up with the exponential growth of the
> Internet, and still find the most popular sites.
Can you give any further details, either on the technical nature or
the patent application?
-- Martijn
__________
Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html
From owner-robots Wed Oct 25 16:45:53 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA08081; Wed, 25 Oct 95 16:45:53 -0700
Message-Id:
Date: 25 Oct 1995 16:47:13 -0800
From: "Roger Dearnaley"
Subject: Re: lycos patents
To: robots@webcrawler.com
X-Mailer: Mail*Link SMTP-QM 3.0.2
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
> What, Lycos is trying to patent spiders and robots. Got any more
> information on this?!? How can this be possible, as it is certainly
> not technology that they developed.
If this is so, then some interested parties should let the Patent Office (or
whatever the corresponding US body is called) know this. Particularly given
what a terrible job they have been doing judging software and algorithm
patents recently, it's a bad idea to just assume that the Patent Office will
get it right.
--Roger Dearnaley
From owner-robots Wed Oct 25 19:19:25 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA14858; Wed, 25 Oct 95 19:19:25 -0700
From: reinpost@win.tue.nl (Reinier Post)
Message-Id: <199510260219.DAA02026@wsinis02.win.tue.nl>
Subject: Re: lycos patents
To: robots@webcrawler.com
Date: Thu, 26 Oct 1995 03:19:08 +0100 (MET)
In-Reply-To: from "Matthew Gray" at Oct 25, 95 07:08:47 pm
X-Mailer: ELM [version 2.4 PL23]
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Content-Length: 1094
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Lycos's patents:
>I hope when they refer to "our spider technology", they are referring to
>something genuinely unique. If not there are a great many cases for
>prior art, notably my Wanderer which (while no longer the best) was the
>first one around in spring of '93.
Mmm ... I think I first saw JumpStation in January '93.
http://js.stir.ac.uk/jsbin/js
Simple spiders existed before; I used one in November '92 to fill a proxy
cache and fake a live Internet connection for a demo, but it wasn't used for
indexing purposes.
>I agree that some comment or clarification from Lycos would be good.
The author has been seen to post to this list, before it moved.
I should think the summaries may be patentable; in fact this thought first
occurred to me when I saw his short talk on Lycos at WWW'95 in Darmstadt,
in the workshop on Web indexing. But I haven't heard from Lycos since.
There may be some unusual tricks in running the spiders as well. If XOR-ing
bitmaps can be patented, why can't a bunch of details in spider technology?
--
Reinier Post reinpost@win.tue.nl
From owner-robots Tue Oct 31 06:58:02 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA03475; Tue, 31 Oct 95 06:58:02 -0800
From: davidmsl@anti.tesi.dsi.unimi.it (Davide Musella)
Message-Id: <9510311459.AA13828@anti.tesi.dsi.unimi.it>
Subject: meta tag implementation
To: robots@webcrawler.com (Mailing list su robot)
Date: Tue, 31 Oct 1995 15:59:26 +0100 (MET)
Organization: Dept. of Computer Science, Milan, Italy.
X-Mailer: ELM [version 2.4 PL23alpha2]
Content-Type: text
Content-Length: 772
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Hi to everybody!
I would like to know what do you think about a possible implementation of the
meta http-equiv tag on an http-server. I' working in this direction to build a
complete system to catalogue www docs but I think that the bigger problems is
that there isn't any http-server that handle this meta tag (maybe only the WN
server)
Thanx
Davide
+--------------------------------------------------+
|Davide Musella |
|e-Mail musella@dsi.unimi.it Dept. of |
|Phone number +39.(0)2.4390821 Computer Science |
|Address: Via Montevideo, 25 University of |
| 20144 Milano ITALY Milan, Italy |
|http://www.dsi.unimi.it/Users/Tesi/musella |
+--------------------------------------------------+
From owner-robots Thu Nov 2 09:30:07 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA15340; Thu, 2 Nov 95 09:30:07 -0800
Message-Id:
Date: Thu, 2 Nov 1995 12:28:47 -0500 (EST)
From: "Jeffrey C. Chen"
To: robots@webcrawler.com (Mailing list su robot)
Subject: Re: meta tag implementation
Cc:
In-Reply-To: <9510311459.AA13828@anti.tesi.dsi.unimi.it>
References: <9510311459.AA13828@anti.tesi.dsi.unimi.it>
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Hi everybody!
I am a MS student at CMU. I am working on a software tool for
collecting full system traces on the Alpha. The tool will also gather
statistics by using the on-chip hardware event counters. I am
interested in using a web server and a client as my test workload. It
would be interesting to identify performance bottlenecks in a web server
as it runs over a period of time servicing requests. Does anyone have a
simple robot that I can use to exercise a web server?
Thanks,
Jeff
From owner-robots Thu Nov 2 10:40:02 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA20410; Thu, 2 Nov 95 10:40:02 -0800
From: Jaakko Hyvatti
Message-Id: <199511021835.UAA17200@krisse.www.fi>
Subject: Simple load robot
To: robots@webcrawler.com
Date: Thu, 2 Nov 1995 20:35:19 +0200 (EET)
In-Reply-To: from "Jeffrey C. Chen" at Nov 2, 95 12:28:47 pm
X-Mailer: ELM [version 2.4 PL22]
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Content-Length: 412
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
> Does anyone have a simple robot that I can use to exercise a web
> server?
Would this do the job, maybe run multiple times in parallel?
(Please replace the url's..)
#!/bin/sh
while true
do
for i in \
http://www.fi/ \
http://www.fi/search.html \
http://www.fi/index/ \
http://www.fi/~jaakko/ \
http://www.fi/sss/ \
http://www.fi/www/ \
http://www.fi/links.html
do
lynx -source $i > /dev/null
done
done
From owner-robots Mon Nov 6 22:44:28 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA15194; Mon, 6 Nov 95 22:44:28 -0800
Date: Tue, 7 Nov 1995 00:43:47 -0600
Message-Id: <9511070643.AA120822@nic.smsu.edu>
X-Sender: kdf274s@nic.smsu.edu
X-Mailer: Windows Eudora Light Version 1.5.2
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: robots@webcrawler.com
From: Keith Fischer
Subject: Preliminary robot.faq (Please Send Questions or Comments)
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Archive-name: robot.faq
Posting-Frequency: variable
Last-modified: Nov. 6, 1995
This article is a description and primer for World Wide Web robots and spiders.
The following topics are addressed:
1) DEFINING ROBOTS AND SPIDERS
1.1) What is a ROBOT?
1.2) What is a SPIDER?
1.3) What is a search engine?
1.4) How many ROBOTS are there?
1.5) What can be achieved by using ROBOTS?
1.6) What harm can a ROBOT do?
2) THE THEORY BEHIND A ROBOT
2.1) Who can write one?
2.2) How is one written?
2.3) What is the Proposed Standard for Robot Exclusion?
2.4) What are the potential problems?
2.5) How do I use proper Etiquette?
3) THE REALITY OF THE WEB
3.1) Can I visit the entire web?
1) DEFINING ROBOTS AND SPIDERS
1.1) What is a ROBOT?
A Robot is a program that traverses the World Wide Web, gathering some
sort of information from each site it visits. This journey is accomplished
by visiting a web page and then recursively visiting all or some of it's
linked pages.
1.2) What is a SPIDER?
Spiders are synonymous with Robots, as are Wanderers. These names
however, have some misleading implications. For instance many people think
that a spider or wanderer leaves the home site to work its magic, when in
reality it never leaves. The Spider rather just acts as a sophisticated web
browser, automatically retrieving documents and/or images until it is told
to stop. I prefer the term Robot and will continue using it throughout this
document.
1.3) What is a search engine?
A search engine is not a robot. However some search engines rely heavily on
robots. A search engine is nothing more than a glorified index. It searches
the index, which resides on the host's computer, and returns the result. A
common misconception is that a search engine like Lycos or Yahoo actively
searches the web upon request. This is not true, all activity by the robot
is done ahead of time.
1.4) How many ROBOTS are there?
There are about 30 in existence. Martijn Koster maintains a list at:
http://info.webcrawler.com/mak/projects/robots/active.html
1.5) What can be achieved by using ROBOTS?
The possibilities are endless. Once you visit a page, you have free run of
the html. You can retrieve files or the html itself. Most robots retrieve
pieces of the html document. This is then used to build an index, which is
later used by a search engine.
1.6) What harm can a ROBOT do?
The robot can do no harm per say, but it can anger a lot of people. If your
robot acts irresponsibly it can fall into a black hole, a link that
dynamically makes new links, or worse it can get stuck in a loop. Both of
these actions are certain to reek havoc on a server. The goal in web
traversal is to never be on one server for to long.
The solution to the problem of bad htmls or rather your robot's handling of
bad htmls is to stay online. Simply put, never leave your robot unattended.
2) THE THEORY BEHIND A ROBOT
2.1) Who can write one?
Anyone can write a robot provided that they have web access. But, a word to
the wise, tell your system administrators because they WILL feel the system
drain and they WILL hear many complaints concerning your activities.
But, just because the possibility exists doesn't mean you should take on
this task half cocked. Before even thinking about coding a robot: do your
research, have an intended goal, and read the following:
The Proposed Standard for Robot Exclusion located at:
http://info.webcrawler.com/mak/projects/robots/norobots.html
The Guidelines for Robot Writers located at:
http://info.webcrawler.com/mak/projects/robots/guidelines.html
Ethical Web Agent located at:
http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/eichmann.ethical/eichma
nn.html
2.2) How is one written?
A Robot is nothing more than an executable program. It can be in the form
of a script or a binary file. It makes a connection to a web server and
requests a document be sent, much the same way a web browser works. The
difference is in the automation provided by the robot.
2.3) What is the Proposed Standard for Robot Exclusion?
Martijn Koster explains the reason for a robot exclusion standard with the
following: "In 1993 and 1994 there have been occasions where robots have
visited WWW servers where they weren't welcome for various reasons.
Sometimes these reasons were robot specific, e.g. certain robots swamped
servers with rapid-fire requests, or retrieved the same files repeatedly. In
other situations robots traversed parts of WWW servers that weren't
suitable, e.g. very deep virtual trees, duplicated information, temporary
information, or cgi-scripts with side-effects (such as voting)."
The form the robot exclusion standard takes is given in more detail at:
The Proposed Standard for Robot Exclusion located at:
http://info.webcrawler.com/mak/projects/robots/norobots.html
2.4) What are the potential problems?
The potential problems can't be listed. The list would be far to big and
unpredictable. The very nature of the World Wide Web is diversity and this
very diversity makes robot writing both important and increasingly
difficult. There is no one right html. They can be written in many ways and
in many formats. My suggestion is get the spec sheet for html and practice,
practice, practice, making your robot robust.
2.5) How do I use proper Etiquette?
Etiquette is a very touchy subject. Many people stand in opposition to your
newly written robot. They don't like the idea that their server will be
over run with seemingly pointless requests. The solution is simple, first
give them the results. Or rather put up for public consumption the results
of your searches. This is the concept of giving back to the community that
provided for you. Not to mention, if a person can use your results, the
robot's requests may seem to have more merit.
Another form of etiquette is slow requests. You've heard the term rapid
fire. This means quick requests (a request every second or so); basically
put, this brings a server to its figurative knees. The solution is limit
your requests to any given server to one every minute (some say one every
five minutes).
More information about etiquette is located at:
The Guidelines for Robot Writers located at:
http://info.webcrawler.com/mak/projects/robots/guidelines.html
Ethical Web Agents located at:
http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/eichmann.ethical/eichma
nn.html
3) THE REALITY OF THE WEB
3.1) Can I visit the entire web?
No. So don't try. Gauge your goals in reasonable amounts.
______________________________________________________________
I disclaim everything. The contents of this article might be totally
inaccurate, inappropriate, misguided, or otherwise perverse - except for my
name (you can probably trust me on that).
Copyright (c) 1995 by Keith D. Fischer, all rights reserved.
This FAQ may be posted to any USENET newsgroup, on-line service, or BBS as
long as it is posted in its entirety and includes this copyright statement.
This FAQ may not be distributed for financial gain.
This FAQ may not be included in commercial collections or compilations
without express permission from the author.
____________________________________________________________
Keith D. Fischer - kfischer@mail.win.org or kfischer@science.smsu.edu
Keith D. Fischer
kfischer@mail.win.org
kdf274s@nic.smsu.edu
"Misery loves company" By Anonymous
"Today is a good day to die." By Crazy Horse
"To be or not to be ..." Hamlet -- William Shakespeare
From owner-robots Tue Nov 7 02:37:01 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA27042; Tue, 7 Nov 95 02:37:01 -0800
Date: Tue, 7 Nov 95 10:32:55 GMT
Message-Id: <9511071032.AA09660@raphael.doc.aca.mmu.ac.uk>
X-Sender: steven@raphael.doc.aca.mmu.ac.uk
X-Mailer: Windows Eudora Pro Version 2.1.2
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: robots@webcrawler.com
From: Steve Nisbet
Subject: Re: meta tag implementation
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
At 12:28 PM 11/2/95 -0500, you wrote:
>Hi everybody!
>
>I am a MS student at CMU. I am working on a software tool for
>collecting full system traces on the Alpha. The tool will also gather
>statistics by using the on-chip hardware event counters. I am
>interested in using a web server and a client as my test workload. It
>would be interesting to identify performance bottlenecks in a web server
>as it runs over a period of time servicing requests. Does anyone have a
>simple robot that I can use to exercise a web server?
>
>Thanks,
>Jeff
>
>
Hi there Jef, know this sounds cheeky, but if you get any useful repliesfor
robots that have nothing to do with PERL, could you let me know. I tride
asking the same question you asked, but got no replies.
From owner-robots Tue Nov 7 04:05:00 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA02122; Tue, 7 Nov 95 04:05:00 -0800
From: davidmsl@anti.tesi.dsi.unimi.it (Davide Musella)
Message-Id: <9511071205.AA13152@anti.tesi.dsi.unimi.it>
Subject: Re: meta tag implementation
To: robots@webcrawler.com
Date: Tue, 7 Nov 1995 13:05:21 +0100 (MET)
In-Reply-To: <9511071032.AA09660@raphael.doc.aca.mmu.ac.uk> from "Steve Nisbet" at Nov 7, 95 10:32:55 am
Organization: Dept. of Computer Science, Milan, Italy.
X-Mailer: ELM [version 2.4 PL23alpha2]
Content-Type: text
Content-Length: 251
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
> Hi there Jef, know this sounds cheeky, but if you get any useful repliesfor
> robots that have nothing to do with PERL, could you let me know. I tride
> asking the same question you asked, but got no replies.
No replies until now....sigh!!!
Davide
From owner-robots Tue Nov 7 06:17:49 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA08070; Tue, 7 Nov 95 06:17:49 -0800
From: reinpost@win.tue.nl (Reinier Post)
Message-Id: <199511071417.OAA06656@wsinis11.win.tue.nl>
Subject: Re: meta tag implementation
To: robots@webcrawler.com
Date: Tue, 7 Nov 1995 15:17:26 +0100 (MET)
In-Reply-To: <9511071205.AA13152@anti.tesi.dsi.unimi.it> from "Davide Musella" at Nov 7, 95 01:05:21 pm
X-Mailer: ELM [version 2.4 PL23]
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 8bit
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
You (Davide Musella) write:
>
>> Hi there Jef, know this sounds cheeky, but if you get any useful repliesfor
>> robots that have nothing to do with PERL, could you let me know. I tride
>> asking the same question you asked, but got no replies.
>
>No replies until now....sigh!!!
You might use Lynx (2.4.FM); it has a -traverse switch now.
Experimental, and I don't think it supports the RES (Robot Exclusion
Standard) yet. We have a simple robot written in C, but it doesn't
follow the RES either.
What's your resaon to stay away from Perl?
>Davide
--
Reinier Post reinpost@win.tue.nl
From owner-robots Tue Nov 7 06:54:36 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA09686; Tue, 7 Nov 95 06:54:36 -0800
Date: Tue, 7 Nov 95 14:41:39 GMT
Message-Id: <9511071441.AA11827@raphael.doc.aca.mmu.ac.uk>
X-Sender: steven@raphael.doc.aca.mmu.ac.uk
X-Mailer: Windows Eudora Pro Version 2.1.2
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: robots@webcrawler.com
From: Steve Nisbet
Subject: Re: meta tag implementation
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Hi Davide,
thanks very much for the info. I stay away from Perl here because it was
badly set up and I have to reinstall it. SO its more of a grudge :)
Other than that I think its a good thing. I will do as you suggest. All the
best in you endeavours.
From owner-robots Tue Nov 7 07:11:12 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA10944; Tue, 7 Nov 95 07:11:12 -0800
Message-Id:
Date: Tue, 7 Nov 95 07:11 PST
X-Sender: a07893@giant.mindlink.net
X-Mailer: Windows Eudora Pro Version 2.1.2
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: robots@webcrawler.com
From: Tim Bray
Subject: Re: Preliminary robot.faq (Please Send Questions or Comments)
Cc: robots@webcrawler.com
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
>1.1) What is a ROBOT?
>
> A Robot is a program that traverses the World Wide Web, gathering some
>sort of information from each site it visits. This journey is accomplished
>by visiting a web page and then recursively visiting all or some of it's
>linked pages.
True but misleading; there are much better strategies for covering
the web than this kind of direct recursion.
Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com)
From owner-robots Wed Nov 8 01:30:52 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA21486; Wed, 8 Nov 95 01:30:52 -0800
Date: Wed, 8 Nov 1995 03:30:45 -0600
Message-Id: <9511080930.AA35454@nic.smsu.edu>
X-Sender: kdf274s@nic.smsu.edu
X-Mailer: Windows Eudora Light Version 1.5.2
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: robots@webcrawler.com
From: Keith Fischer
Subject: Re: Preliminary robot.faq (Please Send Questions or Comments)
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
>>1.1) What is a ROBOT?
>>
>> A Robot is a program that traverses the World Wide Web, gathering some
>>sort of information from each site it visits. This journey is accomplished
>>by visiting a web page and then recursively visiting all or some of it's
>>linked pages.
>
>True but misleading; there are much better strategies for covering
>the web than this kind of direct recursion.
>
>
>Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com)
1.1) What is a ROBOT?
A Robot is a program that traverses the World Wide Web, gathering some
sort of information from each site it visits. This journey is accomplished
by visiting a web page and then visiting some or all of its linked pages.
The method one follows whether it's recursive or some sort of fuzzy logic
determines the effectivness of the search.
How is the above. If you like, this will be the new 1.1. Also, could you
please elaborate on better stratagies. (I'm assuming you are talking about
the fuzzy logic that Yahoo and Lycos use.)
Keith
kfischer@mail.win.org
kdf274s@nic.smsu.edu
Keith D. Fischer
kfischer@mail.win.org
kdf274s@nic.smsu.edu
"Misery loves company" By Anonymous
"Today is a good day to die." By Crazy Horse
"To be or not to be ..." Hamlet -- William Shakespeare
From owner-robots Wed Nov 8 05:45:00 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA03365; Wed, 8 Nov 95 05:45:00 -0800
From: reinpost@win.tue.nl (Reinier Post)
Message-Id: <199511081344.NAA17571@wsinis02.win.tue.nl>
Subject: Re: Preliminary robot.faq (Please Send Questions or Comments)
To: robots@webcrawler.com
Date: Wed, 8 Nov 1995 14:44:43 +0100 (MET)
In-Reply-To: <9511080930.AA35454@nic.smsu.edu> from "Keith Fischer" at Nov 8, 95 03:30:45 am
X-Mailer: ELM [version 2.4 PL23]
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 8bit
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
You (Keith Fischer) write:
>1.1) What is a ROBOT?
>
> A Robot is a program that traverses the World Wide Web, gathering some
>sort of information from each site it visits. This journey is accomplished
>by visiting a web page and then visiting some or all of its linked pages.
>The method one follows whether it's recursive or some sort of fuzzy logic
>determines the effectivness of the search.
We have a robot which does 'fuzzy' searching, for which your description
is appropriate. But in general, the document collection process (= robot)
and the search process executed in response to a user query (on the resulting
collection) are completely separate. Besides, searching the contents of
document collections is not the only purpose of robots; robots can be used
to check the validity of hyperlinks, for example. Your description is
accurate, as applied to the robot process itself, but it may be confusing.
A minor quibble: robots must use some heuristics in determining which links
to follow. All robots are 'recursive', and most of them cut off the process
in a more or less arbitrary way, which could be called 'fuzzy'. There is no
or/or decision here.
--
Reinier Post reinpost@win.tue.nl
a.k.a. me
[LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK] [LINK]
From owner-robots Wed Nov 8 08:38:48 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA13242; Wed, 8 Nov 95 08:38:48 -0800
Subject: Re: Preliminary robot.faq (Please Send Questions or Comments)
From: YUWONO BUDI
To: robots@webcrawler.com
Date: Thu, 9 Nov 1995 00:37:33 +0800 (HKT)
In-Reply-To: <9511080930.AA35454@nic.smsu.edu> from "Keith Fischer" at Nov 8, 95 03:30:45 am
X-Mailer: ELM [version 2.4 PL24alpha3]
Content-Type: text
Content-Length: 1603
Message-Id: <95Nov9.003740hkt.19032-3+260@uxmail.ust.hk>
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
>
> >>1.1) What is a ROBOT?
> >>
> >> A Robot is a program that traverses the World Wide Web, gathering some
> >>sort of information from each site it visits. This journey is accomplished
> >>by visiting a web page and then recursively visiting all or some of it's
> >>linked pages.
> >
> >True but misleading; there are much better strategies for covering
> >the web than this kind of direct recursion.
> >
> >
> >Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com)
>
>
> 1.1) What is a ROBOT?
>
> A Robot is a program that traverses the World Wide Web, gathering some
> sort of information from each site it visits. This journey is accomplished
> by visiting a web page and then visiting some or all of its linked pages.
> The method one follows whether it's recursive or some sort of fuzzy logic
> determines the effectivness of the search.
I am not sure understand what the original comment is getting at.
But it seems to me that the word "recursive" is somewhat overloaded.
To those with CS background, a "recursive" visit implies a "depth first"
tree traversal. Most robot implementations that I'm aware of use
"breadth first" traversals. Among the reasons is that you would want
to be able to limit the depth your robot digs into. Whether
depth limitation is more useful than breadth limitation is another
issue, IMHO. One thing for sure, stopping the robot after it
reaches a certain depth is much simpler than deciding which links
to follow/ignore.
I don't know what would be the more general term in place of
"recursively," "sequentially" perhaps?
-Budi.
From owner-robots Thu Nov 9 08:53:37 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA12795; Thu, 9 Nov 95 08:53:37 -0800
Resent-Message-Id: <9511091653.AA12783@webcrawler.com>
Resent-From: mak@beach.webcrawler.com
Resent-To: robots
Resent-Date: Thu, 9 Nov 1995 16:53:32
Date: Wed, 8 Nov 95 10:08:51 -0800
From:
Message-Id: <9511081808.AA19321@webcrawler.com>
To: owner-robots
Subject: BOUNCE robots: Admin request
X-Filter: mailagent [version 3.0 PL41] for mak@surfski.webcrawler.com
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
>From tbray@opentext.com Wed Nov 8 10:08:46 1995
Return-Path:
Received: from giant.mindlink.net by webcrawler.com (NX5.67f2/NX3.0M)
id AA19311; Wed, 8 Nov 95 10:08:46 -0800
Received: from Default by giant.mindlink.net with smtp
(Smail3.1.28.1 #5) id m0tDEv9-000343C; Wed, 8 Nov 95 10:08 PST
Message-Id:
Date: Wed, 8 Nov 95 10:08 PST
X-Sender: a07893@giant.mindlink.net
X-Mailer: Windows Eudora Pro Version 2.1.2
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: robots@webcrawler.com
From: Tim Bray
Subject: Re: Preliminary robot.faq (Please Send Questions or Comments)
Cc: robots@webcrawler.com
We're wasting too much time on this. All I meant to say was that the
original language strongly suggested that robots use the following
algorithm:
sub RetrievePage(url)
text = HttpGet(url)
foreach sub_url in text
RetrievePage(sub_url)
Whereas lots of robots don't. Obviously it is recursive in that you
do pull urls out of pages and eventually follow them, but it doesn't
feel recursive. The 'fuzzy' stuff is a complete red herring - except
for the special case of 'fuzzy logic' (not what's being done here) the
word 'fuzzy' in the information retrieval context is a marketing term
without semantic content.
Cheers, Tim Bray, Open Text Corporation (tbray@opentext.com)
From owner-robots Fri Nov 17 09:12:34 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA21835; Fri, 17 Nov 95 09:12:34 -0800
Date: Fri, 17 Nov 1995 09:24:00 -0800 (PST)
From: Benjamin Franz
X-Sender: snowhare@ns.viet.net
To: robots@webcrawler.com
Subject: Bad robot: WebHopper bounch! Owner: peter@cartes.hut.fi
In-Reply-To: <95Nov9.003740hkt.19032-3+260@uxmail.ust.hk>
Message-Id:
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
I was checking my stats and this showed up with 1838 hits on the
9th of November. It tried to completely explore an infinite virtual space
in one run, with an average time between hits of 4.3 seconds. Its' parser
has to be broken because it was exploring a space defined by a
?cookie=number (used for shopping basket session tracking), but failing
to preserve the '=' (generating 'cookienumber' instead of
'cookie=number') between calls and causing a new cookie to be assigned
to every request. It went into an infinite loop over the same five base
pages as it tried to do a depth first search of the site - for a
little over two hours.
Argh.
Anyone else hit by this rather broken robot?
--
Benjamin Franz, Webmaster, Net Images
From owner-robots Thu Nov 23 12:44:36 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA03420; Thu, 23 Nov 95 12:44:36 -0800
Date: Thu, 23 Nov 1995 12:42:51 -0800 (PST)
From: Andrew Daviel
To: libwww-perl@ics.UCI.EDU, /CN=robots/@nexor.co.uk
Cc: Daniel Terrer
Subject: wwwbot.pl problem
Message-Id:
Mime-Version: 1.0
Content-Type: text/PLAIN; charset="US-ASCII"
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
(I send a request to libwww-perl-request just before my last message
to the list, so I might not be on yet. Please Cc any replies to me.)
I was having trouble with wwwbot from the libwww-perl-0.40 library.
I continued to work on the problem after posting to the perl list.
It seems that botcache is not well enough defined, so that
a site with User-Agent: * Disallow / would kill subsequent GETs to a
site that was previously in the cache. I have made a patch which adds the
address to the cache, and fixes a couple of other odd cases, such as
where the address is not fully defined working within a domain,
and there are host names such as ypsun, ypsun2 etc. which would
become confused with the path count.
See ftp://andrew.triumf.ca/pub/wwwbot.patch
Andrew Daviel email: advax@triumf.ca
TRIUMF voice: 604-222-7376
4004 Wesbrook Mall fax: 604-222-7307
Vancouver BC http://andrew.triumf.ca/~andrew
Canada V6T 2A3 49D14.7N 123D13.6W
From owner-robots Thu Nov 23 23:45:39 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA07952; Thu, 23 Nov 95 23:45:39 -0800
Date: Fri, 24 Nov 95 16:45:28 JST
From: francis@cactus.slab.ntt.jp (Paul Francis)
Message-Id: <9511240745.AA03918@cactus.slab.ntt.jp>
To: robots@webcrawler.com
Subject: yet another robot
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
For all its worth, we have implemented a robot in order to
(surprise surprise) gather web resources to build a (distributed)
search database.
The robot is called Yobot, and
http://rodem.slab.ntt.jp:8080/home/robot-e.html
tells you who to complain to if Yobot misbehaves.
Thanks,
PF
From owner-robots Fri Nov 24 13:51:35 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA17245; Fri, 24 Nov 95 13:51:35 -0800
Date: Sat, 25 Nov 1995 07:53:43 +1000 (EST)
From: David Eagles
To: robots@webcrawler.com
Subject: yet another robot, volume 2
In-Reply-To: <9511240745.AA03918@cactus.slab.ntt.jp>
Message-Id:
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
We, too, have developed a robot to provide Web resource search facilities
to Australia and the South Pacific. The crawler engine will only follow
links to designated domains, and the search engine allows individual
selection of the search domain for queries.
Named after a famous Australian spider, the FunnelWeb, the service is
available at http://funnelweb.net.au
Enjoy.
Regards,
David Eagles
From owner-robots Fri Nov 24 15:20:08 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA22501; Fri, 24 Nov 95 15:20:08 -0800
Date: Sat, 25 Nov 95 09:29:44 +1100 (EST)
Message-Id:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: robots@webcrawler.com
From: radio@mpx.com.au (James)
Subject: Re: yet another robot, volume 2
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
>We, too, have developed a robot to provide Web resource search facilities
>to Australia and the South Pacific. The crawler engine will only follow
>links to designated domains, and the search engine allows individual
>selection of the search domain for queries.
>
>Named after a famous Australian spider, the FunnelWeb, the service is
>available at http://funnelweb.net.au
>
>Enjoy.
>David we tried it out the other day.We lodged AAA and Tourist Radio(2 Sites)
Great VISION
Keith Ashton
>Regards,
>David Eagles
AAA Australia Announce Archive / Tourist Radio
Home of the Australian Cool Site of the Day !
http://www.com.au/aaa
From owner-robots Fri Nov 24 16:13:17 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA25777; Fri, 24 Nov 95 16:13:17 -0800
Date: Sat, 25 Nov 95 11:13:05 +1100 (EST)
Message-Id:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: robots@webcrawler.com
From: radio@mpx.com.au (James)
Subject: Re: yet another robot, volume 2
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
>>We, too, have developed a robot to provide Web resource search facilities
>>to Australia and the South Pacific. The crawler engine will only follow
>>links to designated domains, and the search engine allows individual
>>selection of the search domain for queries.
>>
>>Named after a famous Australian spider, the FunnelWeb, the service is
>>available at http://funnelweb.net.au
>>
>
>>Enjoy.
>>David we tried it out the other day.We lodged AAA and Tourist Radio(2 Sites)
>Great VISION
>
>Keith Ashton
>
>
>
____________________________________________________________________________
___________
David,
We just got an Email back from you but there was no content
Keith
____________________________________________________________________________
____________
>
>
>
>>Regards,
>>David Eagles
>
>AAA Australia Announce Archive / Tourist Radio
>Home of the Australian Cool Site of the Day !
>http://www.com.au/aaa
AAA Australia Announce Archive / Tourist Radio
Home of the Australian Cool Site of the Day !
http://www.com.au/aaa
From owner-robots Sat Nov 25 06:21:14 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA05034; Sat, 25 Nov 95 06:21:14 -0800
From: Byung-Gyu Chang
Message-Id: <199511251419.XAA02550@ktmp.kaist.ac.kr>
Subject: Q: Cooperation of robots
To: robots@webcrawler.com (Robot Mailing list)
Date: Sat, 25 Nov 1995 23:19:12 +0900 (KST)
X-Mailer: ELM [version 2.4 PL21-h4]
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-2022-kr
Content-Transfer-Encoding: 7bit
Content-Length: 378
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Hi, I am newbie to this mailing-list. If I do
some mistake, plz reply to me.
I have one question :
Is there some effort for robots to do gathering
informations in cooperative work style?
That is, Sharing informations gathered by the other kind of
robots with some communication between robots like
the that of intelligent agents in Intelligent Agent area.
- Byung-Gyu Chang
From owner-robots Sat Nov 25 10:19:10 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA15907; Sat, 25 Nov 95 10:19:10 -0800
Date: Sat, 25 Nov 1995 13:19:03 -0500
Message-Id: <199511251819.NAA27702@moe.infi.net>
X-Sender: magi@infi.net (Unverified)
X-Mailer: Windows Eudora Light Version 1.5.2
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: robots@webcrawler.com
From: Michael Goldberg
Subject: Smart Agent help
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
A am developing sites for numerous large associations. I want to provide a
service to the members by which they can choose from selected topics..say
mortgage interest rates..and a robot goes out and searches selected sites
and provides either by e-mail a formated "newsletter" or return a
"newsletter" in html.
Any suggestions?
<<< Media Access Group>>>
Local Access to electronic marketing
Triad member- Network Hampton Roads
2101 Parks Ave. Suite 606
Virginia Beach, VA 23451
804-422-4481
From owner-robots Sat Nov 25 15:22:58 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA01362; Sat, 25 Nov 95 15:22:58 -0800
Date: Sun, 26 Nov 1995 09:24:39 +1000 (EST)
From: David Eagles
To: robots@webcrawler.com
Subject: Re: Q: Cooperation of robots
In-Reply-To: <199511251419.XAA02550@ktmp.kaist.ac.kr>
Message-Id:
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
On Sat, 25 Nov 1995, Byung-Gyu Chang wrote:
> Hi, I am newbie to this mailing-list. If I do
> some mistake, plz reply to me.
>
>
> I have one question :
>
> Is there some effort for robots to do gathering
> informations in cooperative work style?
> That is, Sharing informations gathered by the other kind of
> robots with some communication between robots like
> the that of intelligent agents in Intelligent Agent area.
>
> - Byung-Gyu Chang
>
I'm not sure if there is any official cooperation going on, but I'm
currently enhancing my web crawler (http://funnelweb.net.au) to include
support for this type of operation. Basically, here's what I'm planning:
The current web crawler, based in Australia, limits it's searching and
collection to countries in the South Pacific. I'm planning to enhance
this such that any URL's found (during the crawling process) for non-South
Pacific countries will be forwarded to the web crawler responsible for
that domain (as determined by a simple config file - maybe an automated
registration process in the future). Similarly, the search engine will
allow ANY individual country(s) to be searched (as is the case now for
only South Pacific countries), and will fork the request off to the
appropriate engine.
Is this the type of info you were after?
Regards,
David Eagles
From owner-robots Sun Nov 26 09:10:54 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA17874; Sun, 26 Nov 95 09:10:54 -0800
X-Sender: narnett@hawaii.verity.com
Message-Id:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Sun, 26 Nov 1995 09:10:32 -0800
To: robots@webcrawler.com
From: narnett@Verity.COM (Nick Arnett)
Subject: Re: Q: Cooperation of robots
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
At 11:19 PM 11/25/95, Byung-Gyu Chang wrote:
>Is there some effort for robots to do gathering
>informations in cooperative work style?
>That is, Sharing informations gathered by the other kind of
>robots with some communication between robots like
>the that of intelligent agents in Intelligent Agent area.
There are various efforts, but the most significant one is probably the
Harvest project at the University of Colorado. I can't remember their URL
at the moment, but I know we have a link to it from:
http://www.verity.com/customers.html
Nick
From owner-robots Sun Nov 26 16:57:32 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA11355; Sun, 26 Nov 95 16:57:32 -0800
Date: Mon, 27 Nov 95 09:57:15 JST
From: francis@cactus.slab.ntt.jp (Paul Francis)
Message-Id: <9511270057.AA12772@cactus.slab.ntt.jp>
To: robots@webcrawler.com
Subject: Re: Smart Agent help
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
>
> A am developing sites for numerous large associations. I want to provide a
> service to the members by which they can choose from selected topics..say
> mortgage interest rates..and a robot goes out and searches selected sites
> and provides either by e-mail a formated "newsletter" or return a
> "newsletter" in html.
>
> Any suggestions?
A number of people are working towards the ability to search
selected sites, though I haven't heard of anyone trying to
put the result in a newletter format.
Harvest allows the user to custom build his own database,
which is then locally accessed at search time.
(http://harvest.cs.colorado.edu/)
MetaCrawler, Silk, IBMinfoMarket, and no doubt many others
query multiple pre-configured search databases at search
time.
(http://metacrawler.cs.washington.edu:8080/home.html
http://services.bunyip.com:8000/products/silk/silk.html
http://www.infomkt.ibm.com/about.htm)
I'm looking forward to the day when two of these
"meta" search services point to each other and create
an infinite search loop....
PF
ps.
If you're going to the WWW conference in Boston, I'll
be chairing a BOF on distributed searching. Please see
http://rodem.slab.ntt.jp:8080/paulStuff/
From owner-robots Sun Nov 26 18:28:42 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA16185; Sun, 26 Nov 95 18:28:42 -0800
X-Sender: narnett@hawaii.verity.com
Message-Id:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Sun, 26 Nov 1995 18:28:33 -0800
To: robots@webcrawler.com, owner-robots@webcrawler.com
From: narnett@Verity.COM (Nick Arnett)
Subject: Re: BOUNCE robots: Admin request
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
At 10:08 AM 11/8/95, wrote:
>Whereas lots of robots don't. Obviously it is recursive in that you
>do pull urls out of pages and eventually follow them, but it doesn't
>feel recursive. The 'fuzzy' stuff is a complete red herring - except
>for the special case of 'fuzzy logic' (not what's being done here) the
>word 'fuzzy' in the information retrieval context is a marketing term
>without semantic content.
Minor point -- let's not assume that no one on the list is using fuzzy
logic to decide which links to follow. After all, some of us have search
engines that use fuzzy logic operators. I'm fascinated by using evidential
reasoning to build agents that explore.
Nick
From owner-robots Sun Nov 26 19:43:06 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA20091; Sun, 26 Nov 95 19:43:06 -0800
Date: Mon, 27 Nov 95 12:42:56 JST
From: francis@cactus.slab.ntt.jp (Paul Francis)
Message-Id: <9511270342.AA14195@cactus.slab.ntt.jp>
To: robots@webcrawler.com
Subject: Re: Q: Cooperation of robots
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
>
> Is there some effort for robots to do gathering
> informations in cooperative work style?
> That is, Sharing informations gathered by the other kind of
> robots with some communication between robots like
> the that of intelligent agents in Intelligent Agent area.
>
I haven't seen anything, but I only pay so much
attention to this list. I know that one problem is
that many robots run to support profit- (or planned
profit-) based services, so don't want to share their
info.
What do you see as the advantage to sharing information?
It is offhand not clear to me that much is to be gained
by it. For instance, given that each robot-running
organization usually has their own way of processing
the resources they find, then they have to go out and
retrieve the resources in any event. Thus, not much
may be saved by sharing information....
PF
From owner-robots Mon Nov 27 01:14:04 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA06867; Mon, 27 Nov 95 01:14:04 -0800
From: Jaakko Hyvatti
Message-Id: <199511270913.LAA29177@krisse.www.fi>
Subject: Re: Q: Cooperation of robots
To: robots@webcrawler.com
Date: Mon, 27 Nov 1995 11:13:46 +0200 (EET)
In-Reply-To: <9511270342.AA14195@cactus.slab.ntt.jp> from "Paul Francis" at Nov 27, 95 12:42:56 pm
X-Mailer: ELM [version 2.4 PL22]
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Content-Length: 1744
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
francis@cactus.slab.ntt.jp (Paul Francis):
> I haven't seen anything, but I only pay so much
> attention to this list. I know that one problem is
> that many robots run to support profit- (or planned
> profit-) based services, so don't want to share their
> info.
We at http://www.fi/ have a good coverage of the www-resources of
Finland. You are right, we are clearly not willing to share our
information base with other search engines in Finland (there is
another one). On the other hand, it might be possible to share the
database with some or all of the international search engines as a
promotion. We would not lose any markets here in finland, 'cause
always our site would be the fastest way for Finnish customers to
perform searching.
> What do you see as the advantage to sharing information?
> It is offhand not clear to me that much is to be gained
> by it. For instance, given that each robot-running
> organization usually has their own way of processing
> the resources they find, then they have to go out and
> retrieve the resources in any event. Thus, not much
> may be saved by sharing information....
If the two co-operating parties agree of common set of
information to stre about each individual page, both could
modify their robots to comply with this. Possibly even
just a compressed .tar.gz archive of the pages could do.
Anyway it saves bandwidth in international connections
and annoys the servers less.
I do not believe that our current database would suit anybody elses
needs, but maybe the next time we collect all the pages we could fetch
all the information necessary to someone else too.
Feel free to contact me at Jaakko.Hyvatti@www.fi if you are
interested. We cover almost all of Finland.
From owner-robots Mon Nov 27 08:27:10 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA04759; Mon, 27 Nov 95 08:27:10 -0800
Message-Id: <9511271626.AA04714@webcrawler.com>
Original-Received: from research by ns
Pp-Warning: Illegal Received field on preceding line
X-Mailer: exmh version 1.6.4 10/10/95
From: Fred Douglis
To: Andrew Daviel
Cc: libwww-perl@ics.UCI.EDU, /CN=robots/@nexor.co.uk,
Daniel Terrer
Subject: Re: wwwbot.pl problem
In-Reply-To: Your message of "Thu, 23 Nov 1995 12:42:51 PST."
X-Face: *lvs`^NFil3,
KN{Fk?$+k063Tiv(F~;?02MoaTUP/:+;eeHIOHWf_Ob-s*iTugCX^)YVicQB<1:
{??RaMPnky^1nA7'2!$REBJNc=skHq:poEZcirD$]R#_f8~qT,O[Vc)x,
G
bKn>8,
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA17259; Mon, 27 Nov 95 12:29:54 -0800
Message-Id: <199511272029.PAA14228@lexington.cs.columbia.edu>
To: robots@webcrawler.com
Subject: harvest
Date: Mon, 27 Nov 1995 15:29:38 -0500
From: "John D. Pritchard"
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
there's been some mention of harvest.. the URL is
http://harvest.cs.colorado.edu/
this provides a ton of infrastructure for implementing robots on top of, in
the form of gatherers and or brokers.
harvest sites cooperate so that once (with caching) a set of data (ftp,
http, gopher, wais, etc.) has been "harvested" (or gathered), the global
harvest database can reuse the gathered info without re-harvesting
(re-gathering) from the target data site.
this is "responsible"* robots that dont load up data sites with redundant
automated downloading and cooperative robots, via brokering.
* or ethical:
http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/eichmann.ethical/eichmann.html
see http://harvest.cs.colorado.edu/harvest/technical.html for more.
for a linear robot cooperation, harvest provides Summary Object Interchange
Format (SOIF), http://harvest.cs.colorado.edu/Harvest/brokers/soifhelp.html
arbitrary extensions to SOIF are on the object, object-attribute model.
for nonlinear robot cooperation or interaction, brokers can be defined
arbitrarily.
i'm presently working on an associative AI which i had developed as a
standalone program, but am stripping my lame gathering and brokering code
for the sophistication of harvest.
-john
From owner-robots Mon Nov 27 14:39:00 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA18292; Mon, 27 Nov 95 14:39:00 -0800
Date: Mon, 27 Nov 95 15:55:32 EST
From: Jason_Murray_at_FCRD@cclink.tfn.com
Message-Id: <9510278175.AA817518051@cclink.tfn.com>
To: robots@webcrawler.com
Subject: Re: Smart Agent help
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Give me a call (617) 345-2465 or send email (netsoft@aol.com). We are
in process of creating just such an agent.
Jason Murray
DataMarket
306 Union St
Rockland MA 02370
Fax 617-871-5816
From owner-robots Mon Nov 27 14:58:48 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA18458; Mon, 27 Nov 95 14:58:48 -0800
Message-Id: <30BA6C06.444C@infi.net>
Date: Mon, 27 Nov 1995 17:55:18 -0800
From: Michael Goldberg
Organization: Media Access Group
X-Mailer: Mozilla 2.0b2a (Windows; I; 16bit)
Mime-Version: 1.0
To: robots@webcrawler.com
Subject: Re: harvest
References: <199511272029.PAA14228@lexington.cs.columbia.edu>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Received your email through the robots listserv...
I need an application built for a site I am developing...
THe application allows users of the site to tailor a specified
areas of interest,...say mortgages.. and search specific WWW sites
and retrieve the information eith by email or a formatted newsletter.
Can Harvest do this?
From owner-robots Mon Nov 27 16:38:38 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA19231; Mon, 27 Nov 95 16:38:38 -0800
Message-Id: <199511280038.TAA14968@lexington.cs.columbia.edu>
To: robots@webcrawler.com
Subject: mortgages with: Re: harvest
In-Reply-To: Your message of "Mon, 27 Nov 1995 17:55:18 PST."
<30BA6C06.444C@infi.net>
Date: Mon, 27 Nov 1995 19:38:34 -0500
From: "John D. Pritchard"
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
> Received your email through the robots listserv...
> I need an application built for a site I am developing...
> THe application allows users of the site to tailor a specified
> areas of interest,...say mortgages.. and search specific WWW sites
this is the kind of thing that harvest provides for. basically, in
"tailoring" information dynamically (as opposed to going to a static menu
system) your user is faced with (recursively) traversing an association
graph. the user wants to see data with mortgage numbers. associativity is
the service we are providing. better associativity, however, classes data,
eg, via SOIF, so that the user has more coherent domains to search through
than "every document with numeric strings and the string 'mortgage'".
presently, SOIF provides for arbitrary degrees of data classification which
is a strong solution for most applications, and generally an optimal
solution for applications involving fairly regular data formats, eg,
reports or forms.
harvest provides for sites to cooperate or interoperate efficiently for
applications such as these since no one site could ever have space to
replicate the entire internet, or even a significant associative slice of
it, in providing a monolithic internet database.
basically the talent of harvest in linear interoperability, via SOIF, is
providing the architecture for this recursively infinite association graph
traversal in most forms of data, especially business data.
> and retrieve the information eith by email or a formatted newsletter.
> Can Harvest do this?
certainly you could put an email or such interface on the system, but your
users would probably be happier with something more responsive and flexible
like a web interface. an interactive interface provides the opportunity
for refining data collection, for discovering new sources of data, etc.
-john
From owner-robots Mon Nov 27 19:52:36 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA26675; Mon, 27 Nov 95 19:52:36 -0800
Date: Mon, 27 Nov 1995 22:52:30 -0500
From: Skip Montanaro
Message-Id: <199511280352.WAA24695@dolphin.automatrix.com>
To: robots@webcrawler.com
Subject: How frequently should I check /robots.txt?
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
I'm working on a specialized robot to identify Web sites with concert
itineraries (by scoring the contents of the file against expected patterns).
I will announce it here when I begin exercising it outside my local network.
I'm a bit confused about how often I should update my local copy of a site's
/robots.txt file. Clearly I shouldn't check it with each access, since that
would double the number of accesses my robot would make to a site.
I saw nothing in my server's access logs that would suggest that any of the
robots that visit our site ever perform a HEAD request for /robots.txt
(indicating they were checking for a Last-modified header).
So how about it? How often should /robots.txt be checked?
Thx,
Skip Montanaro skip@calendar.com (518)372-5583
Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com
Internet Conference Calendar: http://www.calendar.com/conferences/
>>> ZLDF: http://www.netresponse.com/zldf <<<
From owner-robots Mon Nov 27 20:31:52 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA00165; Mon, 27 Nov 95 20:31:52 -0800
Date: Mon, 27 Nov 1995 23:27:08 -0600 (CST)
From: gil cosson
To: robots@webcrawler.com
Cc: robots@webcrawler.com
Subject: Re: How frequently should I check /robots.txt?
In-Reply-To: <199511280352.WAA24695@dolphin.automatrix.com>
Message-Id:
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
How about adding an entry to the robots.txt file that specifies how
frequently the robots.txt file should be checked?
gil.
==========================================================================
"Everybody can be great because anybody can serve. You don't have to have a
college degree to serve. You don't have to make your subject and verb agree
to serve. You don't have to know the second theory of Thermo Dynamics and
physics to serve. You only need a heart full of grace. A soul generated by
love." Martin Luther King Jr.
On Mon, 27 Nov 1995, Skip Montanaro wrote:
>
> I'm working on a specialized robot to identify Web sites with concert
> itineraries (by scoring the contents of the file against expected patterns).
> I will announce it here when I begin exercising it outside my local network.
>
> I'm a bit confused about how often I should update my local copy of a site's
> /robots.txt file. Clearly I shouldn't check it with each access, since that
> would double the number of accesses my robot would make to a site.
>
> I saw nothing in my server's access logs that would suggest that any of the
> robots that visit our site ever perform a HEAD request for /robots.txt
> (indicating they were checking for a Last-modified header).
>
> So how about it? How often should /robots.txt be checked?
>
> Thx,
>
> Skip Montanaro skip@calendar.com (518)372-5583
> Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com
> Internet Conference Calendar: http://www.calendar.com/conferences/
> >>> ZLDF: http://www.netresponse.com/zldf <<<
>
From owner-robots Mon Nov 27 23:22:57 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA03527; Mon, 27 Nov 95 23:22:57 -0800
Message-Id: <9511280722.AA03518@webcrawler.com>
To: robots@webcrawler.com
Subject: Re: How frequently should I check /robots.txt?
In-Reply-To: Your message of "Mon, 27 Nov 1995 23:27:08 CST."
Date: Mon, 27 Nov 1995 23:22:54 -0800
From: Martijn Koster
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
In message , gil
cosson writes:
> How about adding an entry to the robots.txt file that specifies how
> frequently the robots.txt file should be checked?
Hmm.. and then how often do you check if the checking frequency has
changed? :-)
Seriously though I don't think there'd be a lot of benefit; as an admin
you tend not to know when you'll make the next change.
From an http point of view robots could be smart, and look at the Expires
header.
Deciding how often to check for the /robots.txt depends highly on how
you run your robot: how many runs per week, how many documents when,
etc. I'd say a week is a reasoneable time. If your robot supports
end-user submissions you could of course be clever about people
submitting their /robots.txt URL; that would give them more influence.
-- Martijn
__________
Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html
From owner-robots Wed Nov 29 18:16:32 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA11717; Wed, 29 Nov 95 18:16:32 -0800
Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu>
Content-Type: text/plain
Mime-Version: 1.0 (NeXT Mail 3.3 v118.2)
From: Christopher Penrose
Date: Wed, 29 Nov 95 18:15:27 -0800
To: robots@webcrawler.com
Subject: McKinley Spider hit us hard
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
A spider from magellan.mckinley.com hit us hard today and did a
deep recursive search of our web tree. Not very friendly, but their
spider did check /robots.txt which indicates that they may have
successfully implemented the robot exclusion protocol.
Christopher Penrose
penrose@ucsd.edu
http://www-crca.ucsd.edu/TajMahal/after.html
here is their internic info if anyone else wants to complain to them:
The McKinley Group (MCKINLEY-DOM)
85 Liberty Ship Way Suite 201
Sausalito, CA 94965
Domain Name: MCKINLEY.COM
Administrative Contact, Technical Contact, Zone Contact:
Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM
415-331-1884 FAX
Record last updated on 21-Sep-95.
Record created on 14-Jul-94.
Domain servers in listed order:
NS1.NOC.NETCOM.NET 204.31.1.1
NS2.NOC.NETCOM.NET 204.31.1.2
From owner-robots Wed Nov 29 18:58:31 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA15455; Wed, 29 Nov 95 18:58:31 -0800
From: Adminstrator
To: robots@webcrawler.com
Subject: Mail failure
Date: Thu, 30 Nov 95 03:57:00 PST
Message-Id: <30BD9C2C@mailgate.austria.attgis.com>
Encoding: 50 TEXT
X-Mailer: Microsoft Mail V3.0
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
User mail received addressed to the following unknown addresses:
AUSTRIA/ATTAUST1/mostendo
------------------------------------------------------------------------------
Return-Path:
Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu>
Content-Type: text/plain
Mime-Version: 1.0 (NeXT Mail 3.3 v118.2)
From: Christopher Penrose
Date: Wed, 29 Nov 95 18:15:27 -0800
To: robots@webcrawler.com
Subject: McKinley Spider hit us hard
Sender: owner-robots@webcrawler.com
Precedence: bulk
Reply-To: robots@webcrawler.com
A spider from magellan.mckinley.com hit us hard today and did a
deep recursive search of our web tree. Not very friendly, but their
spider did check /robots.txt which indicates that they may have
successfully implemented the robot exclusion protocol.
Christopher Penrose
penrose@ucsd.edu
http://www-crca.ucsd.edu/TajMahal/after.html
here is their internic info if anyone else wants to complain to them:
The McKinley Group (MCKINLEY-DOM)
85 Liberty Ship Way Suite 201
Sausalito, CA 94965
Domain Name: MCKINLEY.COM
Administrative Contact, Technical Contact, Zone Contact:
Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM
415-331-1884 FAX
Record last updated on 21-Sep-95.
Record created on 14-Jul-94.
Domain servers in listed order:
NS1.NOC.NETCOM.NET 204.31.1.1
NS2.NOC.NETCOM.NET 204.31.1.2
From owner-robots Wed Nov 29 19:16:42 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA17176; Wed, 29 Nov 95 19:16:42 -0800
From: Adminstrator
To: robots@webcrawler.com
Subject: Mail failure
Date: Thu, 30 Nov 95 04:15:00 PST
Message-Id: <30BDA075@mailgate.austria.attgis.com>
Encoding: 70 TEXT
X-Mailer: Microsoft Mail V3.0
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
User mail received addressed to the following unknown addresses:
AUSTRIA/ATTAUST1/mostendo
------------------------------------------------------------------------------
Return-Path:
From: Adminstrator
To: robots@webcrawler.com
Subject: Mail failure
Date: Thu, 30 Nov 95 03:57:00 PST
Message-Id: <30BD9C2C@mailgate.austria.attgis.com>
Encoding: 50 TEXT
X-Mailer: Microsoft Mail V3.0
Sender: owner-robots@webcrawler.com
Precedence: bulk
Reply-To: robots@webcrawler.com
User mail received addressed to the following unknown addresses:
AUSTRIA/ATTAUST1/mostendo
------------------------------------------------------------------------------
Return-Path:
Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu>
Content-Type: text/plain
Mime-Version: 1.0 (NeXT Mail 3.3 v118.2)
From: Christopher Penrose
Date: Wed, 29 Nov 95 18:15:27 -0800
To: robots@webcrawler.com
Subject: McKinley Spider hit us hard
Sender: owner-robots@webcrawler.com
Precedence: bulk
Reply-To: robots@webcrawler.com
A spider from magellan.mckinley.com hit us hard today and did a
deep recursive search of our web tree. Not very friendly, but their
spider did check /robots.txt which indicates that they may have
successfully implemented the robot exclusion protocol.
Christopher Penrose
penrose@ucsd.edu
http://www-crca.ucsd.edu/TajMahal/after.html
here is their internic info if anyone else wants to complain to them:
The McKinley Group (MCKINLEY-DOM)
85 Liberty Ship Way Suite 201
Sausalito, CA 94965
Domain Name: MCKINLEY.COM
Administrative Contact, Technical Contact, Zone Contact:
Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM
415-331-1884 FAX
Record last updated on 21-Sep-95.
Record created on 14-Jul-94.
Domain servers in listed order:
NS1.NOC.NETCOM.NET 204.31.1.1
NS2.NOC.NETCOM.NET 204.31.1.2
From owner-robots Wed Nov 29 19:29:34 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA18110; Wed, 29 Nov 95 19:29:34 -0800
From: Adminstrator
To: robots@webcrawler.com
Subject: Mail failure
Date: Thu, 30 Nov 95 04:28:00 PST
Message-Id: <30BDA376@mailgate.austria.attgis.com>
Encoding: 91 TEXT
X-Mailer: Microsoft Mail V3.0
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
User mail received addressed to the following unknown addresses:
AUSTRIA/ATTAUST1/mostendo
------------------------------------------------------------------------------
Return-Path:
From: Adminstrator
To: robots@webcrawler.com
Subject: Mail failure
Date: Thu, 30 Nov 95 04:15:00 PST
Message-Id: <30BDA075@mailgate.austria.attgis.com>
Encoding: 70 TEXT
X-Mailer: Microsoft Mail V3.0
Sender: owner-robots@webcrawler.com
Precedence: bulk
Reply-To: robots@webcrawler.com
User mail received addressed to the following unknown addresses:
AUSTRIA/ATTAUST1/mostendo
------------------------------------------------------------------------------
Return-Path:
From: Adminstrator
To: robots@webcrawler.com
Subject: Mail failure
Date: Thu, 30 Nov 95 03:57:00 PST
Message-Id: <30BD9C2C@mailgate.austria.attgis.com>
Encoding: 50 TEXT
X-Mailer: Microsoft Mail V3.0
Sender: owner-robots@webcrawler.com
Precedence: bulk
Reply-To: robots@webcrawler.com
User mail received addressed to the following unknown addresses:
AUSTRIA/ATTAUST1/mostendo
------------------------------------------------------------------------------
Return-Path:
Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu>
Content-Type: text/plain
Mime-Version: 1.0 (NeXT Mail 3.3 v118.2)
From: Christopher Penrose
Date: Wed, 29 Nov 95 18:15:27 -0800
To: robots@webcrawler.com
Subject: McKinley Spider hit us hard
Sender: owner-robots@webcrawler.com
Precedence: bulk
Reply-To: robots@webcrawler.com
A spider from magellan.mckinley.com hit us hard today and did a
deep recursive search of our web tree. Not very friendly, but their
spider did check /robots.txt which indicates that they may have
successfully implemented the robot exclusion protocol.
Christopher Penrose
penrose@ucsd.edu
http://www-crca.ucsd.edu/TajMahal/after.html
here is their internic info if anyone else wants to complain to them:
The McKinley Group (MCKINLEY-DOM)
85 Liberty Ship Way Suite 201
Sausalito, CA 94965
Domain Name: MCKINLEY.COM
Administrative Contact, Technical Contact, Zone Contact:
Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM
415-331-1884 FAX
Record last updated on 21-Sep-95.
Record created on 14-Jul-94.
Domain servers in listed order:
NS1.NOC.NETCOM.NET 204.31.1.1
NS2.NOC.NETCOM.NET 204.31.1.2
From owner-robots Wed Nov 29 20:03:47 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA20992; Wed, 29 Nov 95 20:03:47 -0800
From: Adminstrator
To: robots@webcrawler.com
Subject: Mail failure
Date: Thu, 30 Nov 95 04:44:00 PST
Message-Id: <30BDA71C@mailgate.austria.attgis.com>
Encoding: 113 TEXT
X-Mailer: Microsoft Mail V3.0
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
User mail received addressed to the following unknown addresses:
AUSTRIA/ATTAUST1/mostendo
------------------------------------------------------------------------------
Return-Path:
From: Adminstrator
To: robots@webcrawler.com
Subject: Mail failure
Date: Thu, 30 Nov 95 04:28:00 PST
Message-Id: <30BDA376@mailgate.austria.attgis.com>
Encoding: 91 TEXT
X-Mailer: Microsoft Mail V3.0
Sender: owner-robots@webcrawler.com
Precedence: bulk
Reply-To: robots@webcrawler.com
User mail received addressed to the following unknown addresses:
AUSTRIA/ATTAUST1/mostendo
------------------------------------------------------------------------------
Return-Path:
From: Adminstrator
To: robots@webcrawler.com
Subject: Mail failure
Date: Thu, 30 Nov 95 04:15:00 PST
Message-Id: <30BDA075@mailgate.austria.attgis.com>
Encoding: 70 TEXT
X-Mailer: Microsoft Mail V3.0
Sender: owner-robots@webcrawler.com
Precedence: bulk
Reply-To: robots@webcrawler.com
User mail received addressed to the following unknown addresses:
AUSTRIA/ATTAUST1/mostendo
------------------------------------------------------------------------------
Return-Path:
From: Adminstrator
To: robots@webcrawler.com
Subject: Mail failure
Date: Thu, 30 Nov 95 03:57:00 PST
Message-Id: <30BD9C2C@mailgate.austria.attgis.com>
Encoding: 50 TEXT
X-Mailer: Microsoft Mail V3.0
Sender: owner-robots@webcrawler.com
Precedence: bulk
Reply-To: robots@webcrawler.com
User mail received addressed to the following unknown addresses:
AUSTRIA/ATTAUST1/mostendo
------------------------------------------------------------------------------
Return-Path:
Message-Id: <9511300215.AA04718@grasshopper.ucsd.edu>
Content-Type: text/plain
Mime-Version: 1.0 (NeXT Mail 3.3 v118.2)
From: Christopher Penrose
Date: Wed, 29 Nov 95 18:15:27 -0800
To: robots@webcrawler.com
Subject: McKinley Spider hit us hard
Sender: owner-robots@webcrawler.com
Precedence: bulk
Reply-To: robots@webcrawler.com
A spider from magellan.mckinley.com hit us hard today and did a
deep recursive search of our web tree. Not very friendly, but their
spider did check /robots.txt which indicates that they may have
successfully implemented the robot exclusion protocol.
Christopher Penrose
penrose@ucsd.edu
http://www-crca.ucsd.edu/TajMahal/after.html
here is their internic info if anyone else wants to complain to them:
The McKinley Group (MCKINLEY-DOM)
85 Liberty Ship Way Suite 201
Sausalito, CA 94965
Domain Name: MCKINLEY.COM
Administrative Contact, Technical Contact, Zone Contact:
Cohen, Alexander J. (ASC2) xcohen@MCKINLEY.COM
415-331-1884 FAX
Record last updated on 21-Sep-95.
Record created on 14-Jul-94.
Domain servers in listed order:
NS1.NOC.NETCOM.NET 204.31.1.1
NS2.NOC.NETCOM.NET 204.31.1.2
From owner-robots Thu Nov 30 10:45:43 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA12365; Thu, 30 Nov 95 10:45:43 -0800
Date: Thu, 30 Nov 1995 13:43:58 -0500
From: alain@ai.iit.nrc.ca (Alain Desilets)
Message-Id: <9511301843.AA28288@ksl1000.iit.nrc.ca>
To: robots@webcrawler.com
Subject: Re: Looking for a spider
Cc: alain@ai.iit.nrc.ca
X-Sun-Charset: US-ASCII
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Dear Marilyn,
Just thought I'd check out the status of your robot testbed.
My ListSeeker software (http://ai.iit.nrc.ca/II_public/WebView/ListSeeker.html)
is now ready for testing. So if your robot testbed is ready for public use, I
am prepared to try it out.
Sincerely,
Alain Desilets
Institute for Information Technology
National Research Concil of Canada
Building M-50
Montreal Road
Ottawa (Ont)
K1A 0R6
e-mail: alain@ai.iit.nrc.ca
Tel: (613) 990-2813
Fax: (613) 952-7151
From owner-robots Thu Nov 30 12:30:51 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA18231; Thu, 30 Nov 95 12:30:51 -0800
Date: Thu, 30 Nov 1995 21:29:30 +0100 (MET)
From: Karoly Negyesi
X-Sender: chx@turan
To: robots@webcrawler.com
Subject: Small robot needed
Message-Id:
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Hi!
I'd need a very small robot which download a given URL (most probably a
HTML page) and everything directly referenced (HREFs LINKs SRCs)
Thanks,
___ ___ Charlie Negyesi chx@cs.elte.hu ___ ___
{~._.~} {~._.~} (+361) 203-5962 (7pm-9pm) {~._.~} {~._.~}
_( Y )_ ( * ) Hungary, Budapest ( * ) _( Y )_
(:_~*~_:) ()~*~() H-1462, P.o.box 503 ()~*~() (:_~*~_:)
(_)-(_) (_)-(_) May the Bear be with you! (_)-(_) (_)-(_)
From owner-robots Thu Nov 30 13:15:32 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA20570; Thu, 30 Nov 95 13:15:32 -0800
Date: Thu, 30 Nov 1995 16:15:21 -0500
From: Skip Montanaro
Message-Id: <199511302115.QAA04958@dolphin.automatrix.com>
To: robots@webcrawler.com
Subject: New robot turned loose on an unsuspecting public... and a DNS question
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
No, it's not really another Godzilla movie. I started running the Musi-Cal
Robot today. It has the following properties:
1. Understands (and obeys!) the robots.txt protocol.
2. Doesn't revisit the same server more than once every 10 minutes.
3. Doesn't revisit the same URL more than once per month.
4. Only groks HTTP URLs at the moment.
5. Announces itself in requests as "Musi-Cal-Robot/0.1".
6. Gives my email ("skip@calendar.com") in the From: field of the
request.
7. It's looking for music-related sites, so you may never see it.
8. The HTML parser I'm using is rather slow, which helps avoid
network congestion.
9. You should only ever see it running from dolphin.automatrix.com,
a machine connected via 28.8k modem - again, a fine
network/server congestion avoidance tool.
10. It randomizes its list of outstanding URLs after every pass
through the list to minimize beating up a single server.
If there's anything I've forgotten to do (like announce it somewhere on
Usenet) or any parameter needs obvious tweaking, let me know.
I have been struggling with DNS resolution and was wondering if people could
give me some feedback. Ideally, I want to make sure I treat all aliases for
a server as the same server, so I was attempting to execute
gethostbyaddr(gethostbyname('www.wherever.com'))
but that seemed terribly slow and tcpdump traces suggested that it would get
stuck banging on the same server. Then I tried just the gethostbyname(),
but that wasn't much better. For now, I just accept what I have for a host
name and map a couple places I know that do round-robin DNS back into the
canonical name.
What do other robot writers do about name resolution? Feedback appreciated.
Thanks,
Skip Montanaro skip@calendar.com (518)372-5583
Musi-Cal: http://www.calendar.com/concerts/ or mailto:concerts@calendar.com
Internet Conference Calendar: http://www.calendar.com/conferences/
>>> ZLDF: http://www.netresponse.com/zldf <<<
From owner-robots Thu Nov 30 17:40:52 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA09699; Thu, 30 Nov 95 17:40:52 -0800
Message-Id: <199512010140.RAA28005@fiji.verity.com>
X-Authentication-Warning: fiji.verity.com: Host localhost.verity.com didn't use HELO protocol
To: skip@calendar.com
Cc: robots@webcrawler.com
Subject: Re: New robot turned loose on an unsuspecting public... and a DNS question
In-Reply-To: Your message of "Thu, 30 Nov 1995 16:15:21 EST."
<199511302115.QAA04958@dolphin.automatrix.com>
Date: Thu, 30 Nov 1995 17:40:32 -0800
From: Thomas Maslen
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
> What do other robot writers do about name resolution?
In our case... cache the results of lookups so that we only do the
gethostbyname("foo") once for any particular "foo". This still gives pretty
evil behaviour on, say, a page of links to cool places where almost every
link points to a different host, but the average behaviour is much better
than not caching.
Also, if you're looking for a canonical representation for hosts so that you
can test "is this host the same as that one?", I'd suggest that you _not_
try matching the hostnames: rather, do the gethostbyaddr() and then look
for an intersection in the sets of IP addresses (but be prepared to rewrite
the code next year to deal with IPv6 addresses!). In other words, the
canonical representation for a host should be the set of IP addresses, not
the hostname strings.
Thomas
tmaslen@verity.com My opinions, not Verity's
From owner-robots Fri Dec 1 08:24:24 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA00727; Fri, 1 Dec 95 08:24:24 -0800
Date: Fri, 1 Dec 95 10:33:28 EST
From: wulfekuh@cps.msu.edu (Marilyn R Wulfekuhler)
Message-Id: <9512011533.AA14431@pixel.cps.msu.edu>
To: robots@webcrawler.com
Subject: Re: Looking for a spider
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Hi,
Sorry to say, we had a disk problem and lost the original data. In the
meantime, we have ordered a new (9 gig) disk, and also uncovered some more
bugs in htmlgobble, and are trying to get things back. The known bugs are
fixed, but the word on the new disk is still "any day now".
You've been patient so far: sorry I didn't let you know the status earlier.
I'll try to keep you informed, and when we have stuff (even before I announce
it to the list), I'll let you know.
Thanks for your patience,
Marilyn
From owner-robots Fri Dec 1 08:59:15 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA03564; Fri, 1 Dec 95 08:59:15 -0800
Date: Fri, 1 Dec 1995 17:20:47 +0200 (EET)
From: Cristian Ionitoiu
X-Sender: cristi@tempus5
To: robots@webcrawler.com
Subject: inquiry about robots
Message-Id:
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Hi to everybody,
I'm quite new on the list, and I'm interested in Internet navigating
robots.
I would like to know if there is any robot which offer a certain API for
the programmer?
Or if there any public available robot together with its sources? And I
would prefer an non-perl implementation.
Thank you in advance for all your information!
--Cristian
==============================================================================
CRISTIAN IONITOIU - Computer Science Department, "Politehnica" University of
teaching Timisoara.
assistant Email: cristi@utt.ro, cristi@ns.utt.ro, cristi@cs.utt.ro
WWW: http://www.utt.ro/~cristi
Office: Bdul. Vasile Parvan No. 2, 1900 Timisoara, Romania
Private: O.P. 5, C.P. 641, 1900 Timisoara, Romania
Fax&Phone: (office): +40 56 192 049
______________________________________________________________________________
Science is what happens when preconception meets verification.
==============================================================================
From owner-robots Fri Dec 1 09:26:06 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA06317; Fri, 1 Dec 95 09:26:06 -0800
Date: Fri, 1 Dec 1995 12:24:16 -0500
From: alain@ai.iit.nrc.ca (Alain Desilets)
Message-Id: <9512011724.AA00940@ksl1000.iit.nrc.ca>
To: robots@webcrawler.com
Subject: Re: Looking for a spider
Cc: alain@ai.iit.nrc.ca
X-Sun-Charset: US-ASCII
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
> Hi,
>
> Sorry to say, we had a disk problem and lost the original data.
That's a bummer...
> In the
> meantime, we have ordered a new (9 gig) disk, and also uncovered some more
> bugs in htmlgobble, and are trying to get things back. The known bugs are
> fixed, but the word on the new disk is still "any day now".
>
> You've been patient so far: sorry I didn't let you know the status earlier.
>
> I'll try to keep you informed, and when we have stuff (even before I announce
> it to the list), I'll let you know.
>
Don't worry about me. We have some data here that I can use to test my
approach on a small scale, and I am talking to some other people about
getting about 1G of additional data.
Your data would be a good addition to that (the more data the better).
Good luck with your work and let me know how it goes.
Alain
From owner-robots Fri Dec 1 09:41:33 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA07351; Fri, 1 Dec 95 09:41:33 -0800
Date: Fri, 1 Dec 1995 09:40:26 -0800
Message-Id: <199512011740.JAA05988@ix13.ix.netcom.com>
From: wessman@ix.netcom.com (Gene Essman )
Subject: Re: Looking for a spider
To: robots@webcrawler.com
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
You wrote:
>
>
>Hi,
>
>Sorry to say, we had a disk problem and lost the original data. In
the
(snip)
Sorry to seem so ignorant, but I have just been hanging around the
Internet a short time. In that time, I have wondered about the
whole "robot/spider" thing and have a couple of questions. Perhaps
someone could take the time to help me out.
Are robots for sale or can one "hire" someone who has one to do some
work, or how does that whole thing work. Thanks,
Gene Essman
From owner-robots Fri Dec 1 10:28:36 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA10173; Fri, 1 Dec 95 10:28:36 -0800
X-Sender: narnett@hawaii.verity.com
Message-Id:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Fri, 1 Dec 1995 10:28:27 -0800
To: robots@webcrawler.com
From: narnett@Verity.COM (Nick Arnett)
Subject: Re: Looking for a spider
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
At 9:40 AM 12/1/95, Gene Essman wrote:
>Are robots for sale or can one "hire" someone who has one to do some
>work, or how does that whole thing work.
Verity offers a couple of variations of its Web robot, but they are
designed specifically to build Verity search indexes, not as
general-purpose robots. The only generally available robot-ish code that I
know about is the Harvest Gatherer code. Its primary purpose is to index
the server on which is it running, but it's a fairly small step to make it
do the same over the wire.
I think there's a widespread reluctance to push robots hard in the
commercial space, since marketing success would fairly quickly breed
failure -- having lots of robots doing redundant work would be a huge
inefficiency.
Nick
From owner-robots Fri Dec 1 17:22:39 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA03387; Fri, 1 Dec 95 17:22:39 -0800
Message-Id:
Date: Sat, 2 Dec 1995 00:12:58 +0000
From: Ted Sullivan
Subject: Re: Looking for a spider
To: robots
X-Mailer: Worldtalk (NetConnex V3.50a)/MIME
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Not to make this a sales pitch but if you need a real specialized spider for
commercial work then we can build one for you that interfaces with
ObjectStore a Object Database and any other applications you might have
around.
Ted Sullivan
----------
From: robots
To: robots
Subject: Re: Looking for a spider
Date: Friday, December 01, 1995 9:40AM
You wrote:
>
>
>Hi,
>
>Sorry to say, we had a disk problem and lost the original data. In
the
(snip)
Sorry to seem so ignorant, but I have just been hanging around the
Internet a short time. In that time, I have wondered about the
whole "robot/spider" thing and have a couple of questions. Perhaps
someone could take the time to help me out.
Are robots for sale or can one "hire" someone who has one to do some
work, or how does that whole thing work. Thanks,
Gene Essman
From owner-robots Fri Dec 1 19:54:36 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA03896; Fri, 1 Dec 95 19:54:36 -0800
Date: Fri, 1 Dec 1995 20:52:46 -0700
Message-Id: <199512020352.UAA24347@web.azstarnet.com>
X-Sender: drose@azstarnet.com
X-Mailer: Windows Eudora Version 1.4.4
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: robots@webcrawler.com
From: drose@AZStarNet.com
Subject: Re: Looking for a spider
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Ted:
I very much need a specialized spider. Could you let me know something
about your capabilities? Assume that I want to research *everything* on the
web about, say, stamp collecting (not) on an historical and contemporary
basis, how would your spider work?
I look forward to hearing from you.
-David M. Rose
>
>Not to make this a sales pitch but if you need a real specialized spider for
>commercial work then we can build one for you that interfaces with
>ObjectStore a Object Database and any other applications you might have
>around.
>
>Ted Sullivan
> ----------
From owner-robots Fri Dec 1 20:47:03 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA04060; Fri, 1 Dec 95 20:47:03 -0800
Message-Id: <30BF86FE.183@mcc.tamu.edu>
Date: Fri, 01 Dec 1995 22:51:42 +0000
From: Lance Ogletree
X-Mailer: Mozilla 2.0b3 (Macintosh; I; PPC)
Mime-Version: 1.0
To: robots@webcrawler.com
Subject: MacPower
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Interested in Power Macintosh Computers?
Stop by a site on the web.
MacPower!!!!!!!!
http://mccnet.tamu.edu/MacPower/MacPower.html
From owner-robots Sat Dec 2 08:11:51 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA05472; Sat, 2 Dec 95 08:11:51 -0800
Message-Id:
Date: Sat, 2 Dec 1995 05:09:58 +0000
From: Ted Sullivan
Subject: Re: Looking for a spider
To: robots
X-Mailer: Worldtalk (NetConnex V3.50a)/MIME
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
Could you send me your e-mail address to tsullivan@snowymtn.com so we can
have this discussion offline the robots mailing list. I am sure the other
would appreciate it.
Ted
----------
From: robots
To: robots
Subject: Re: Looking for a spider
Date: Friday, December 01, 1995 7:52PM
Ted:
I very much need a specialized spider. Could you let me know something
about your capabilities? Assume that I want to research *everything* on the
web about, say, stamp collecting (not) on an historical and contemporary
basis, how would your spider work?
I look forward to hearing from you.
-David M. Rose
>
>Not to make this a sales pitch but if you need a real specialized spider
for
>commercial work then we can build one for you that interfaces with
>ObjectStore a Object Database and any other applications you might have
>around.
>
>Ted Sullivan
> ----------
From i.bromwich@nexor.co.uk Mon Dec 4 02:24:00 1995
Return-Path:
Received: from lancaster.nexor.co.uk by webcrawler.com (NX5.67f2/NX3.0M)
id AA00398; Mon, 4 Dec 95 02:24:00 -0800
X400-Received: by /PRMD=NEXOR/ADMD= /C=GB/; Relayed;
Mon, 4 Dec 1995 10:23:23 +0000
X400-Received: by mta lancaster.nexor.co.uk in /PRMD=NEXOR/ADMD= /C=GB/;
Relayed; Mon, 4 Dec 1995 10:23:23 +0000
Date: Mon, 4 Dec 1995 10:23:23 +0000
X400-Originator: i.bromwich@nexor.co.uk
X400-Recipients: non-disclosure:;
X400-Mts-Identifier: [/PRMD=NEXOR/ADMD= /C=GB/;lancaster.ne:166150:951204102333]
Content-Identifier: XT-MS Message
Priority: Non-Urgent
From: "i.bromwich"
Message-Id:
To: robots-archive
Reply-To: mak
X-Mua-Version: XT-MUA 1.4 (dornier) of Tue Aug 22 03:03:53 BST 1995
// martijn, can't think of any other way to get these to you
easily. Get in
// touch if you need more help
get
stop
From owner-robots Mon Dec 4 04:36:17 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA00646; Mon, 4 Dec 95 04:36:17 -0800
From: Jaakko Hyvatti
Message-Id: <199512041236.OAA16470@krisse.www.fi>
Subject: Re: MacPower
To: robots@webcrawler.com
Date: Mon, 4 Dec 1995 14:36:07 +0200 (EET)
In-Reply-To: <30BF86FE.183@mcc.tamu.edu> from "Lance Ogletree" at Dec 1, 95 10:51:42 pm
X-Mailer: ELM [version 2.4 PL22]
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Content-Length: 174
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
> Interested in Power Macintosh Computers?
> Stop by a site on the web.
> MacPower!!!!!!!!
> http://mccnet.tamu.edu/MacPower/MacPower.html
No, I am not very interested.
From owner-robots Mon Dec 4 04:47:22 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA00689; Mon, 4 Dec 95 04:47:22 -0800
From: Jaakko Hyvatti
Message-Id: <199512041247.OAA16694@krisse.www.fi>
Subject: Re: MacPower (an apology, I am very sorry)
To: robots@webcrawler.com
Date: Mon, 4 Dec 1995 14:47:14 +0200 (EET)
In-Reply-To: <199512041236.OAA16470@krisse.www.fi> from "Jaakko Hyvatti" at Dec 4, 95 02:36:07 pm
X-Mailer: ELM [version 2.4 PL22]
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Content-Length: 148
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
> > http://mccnet.tamu.edu/MacPower/MacPower.html
>
> No, I am not very interested.
I am very sorry this reply to the spam got into the list.
From owner-robots Tue Dec 5 12:57:01 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA19484; Tue, 5 Dec 95 12:57:01 -0800
From: Michael Van Biesbrouck
Message-Id: <199512052056.PAA24672@mobius07.math.uwaterloo.ca>
Subject: Re: McKinley Spider hit us hard
To: robots@webcrawler.com
Date: Tue, 5 Dec 1995 15:56:33 -0500 (EST)
In-Reply-To: <9511300215.AA04718@grasshopper.ucsd.edu> from "Christopher Penrose" at Nov 29, 95 06:15:27 pm
X-Mailer: ELM [version 2.4 PL23]
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Content-Length: 1110
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
> A spider from magellan.mckinley.com hit us hard today and did a
> deep recursive search of our web tree. Not very friendly, but their
> spider did check /robots.txt which indicates that they may have
> successfully implemented the robot exclusion protocol.
>
>
> Christopher Penrose
> penrose@ucsd.edu
> http://www-crca.ucsd.edu/TajMahal/after.html
>
> here is their internic info if anyone else wants to complain to them:
The spider in question is Wobot/1.00; the correct person to bother
with complaints is cedeno@mckinley.com.
They visited a site that I watch over on 21 Nov and did nothing after
reading /robots.txt. The robots.txt is somewhat long, but not very
restrictive.
However, it seems to have gone ballastic today on another machine. As a
result I will be complaining. In this case it came from
radar.mckinley.com.
I sugest that other people check their logs and complain if necessary.
--
"You're obviously on drugs, Michael Van Biesbrouck
but not the right ones." ACM East Central Winning Team
-- bwross about mlvanbie http://csclub.uwaterloo.ca/u/mlvanbie/
From owner-robots Tue Dec 5 22:02:14 1995
Return-Path:
Received: by webcrawler.com (NX5.67f2/NX3.0M)
id AA25735; Tue, 5 Dec 95 22:02:14 -0800
Date: Tue, 5 Dec 1995 22:02:01 -0800
X-Sender: julian @best.com
Message-Id:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: robots@webcrawler.com
From: julian@ugorilla.com (Julian Gorodsky)
Subject: Re: Returned mail: Service unavailableHELP HELP!
Sender: owner-robots
Precedence: bulk
Reply-To: robots@webcrawler.com
>The original message was received at Mon, 4 Dec 1995 20:35:54 -0800
>from julian.vip.best.com [206.86.2.106]
>
> ----- The following addresses had delivery problems -----
> (unrecoverable error)
>
> ----- Transcript of session follows -----
>... while talking to surfski.webcrawler.com.:
>>>> RCPT To:
><<< 554 ... 550 User unknown
>554