* Dominique HazaÅ½l-Massieux wrote:
>- for the greater plan (e.g. site validation with the new v.w.o), I
>think it would be cool to start thinking to a way for a site to indicate
>to a particular agent what kind of crawling it accepts; I agree that
>extending robots.txt doesn't seem very reasonable, so we should start
>thinking to another way of doing it...
We should first figure out what the actual problem is. Site Validaton
for the Markup Validator would likely be a subscription service like,
login, start a site validation and come back later to see the results.
This would enable us to sleep between requests as much as we like. It
would solve the heavy load problem.
There are however other problems that would not be solved, for example a
webmaster might not wish that someone performs site validation on their
site. If that is a problem that we want to solve we would need to figure
out how to determine whether someone is authorized to request site
validation for a site.
We could, for example, let the webmaster@ opt-in for such a service.
robotos.txt and similar mechanisms would not really help here, while
the real geocities.com webmaster might want to use the feature, he
might not want to allow any normal geocities.com user to use it.
An alternate approach would be to limit how many times site validation
would validate individual documents in a week. This is also difficult,
for example, how would that work for example.org/?SID=1234567890ABCDEF
which might vary for every request. We could ignore query parts when
determining URI equality which would make the service quite useless
for people using things like example.org/page.php?page=1.
Or instead of using email as verification system we could require users
to encode their validator.w3.org user name in the robots.txt file like
User-Agent: W3C-Markup-Validator/bjoern
Allow: /
or
User-Agent: W3C-SiteValid/c7713a0f32cd6bfb57b6142d80f5d7a1c73e2402
(where the ID is my email address as sha1_hex). If there is no entry for
the Markup Validator it assumes Dissallow:/. Of course, this would not
work for all users either.