This specifically ruled out
special network-level protocols,
platform-specific solutions,
or changes to clients or servers.

Instead, the mechanism uses a specially formatted resource,
at a know location in the server's URL space.
In its simplest form the resource could be a text file
produced with a text edittor, placed in the root-level server directory.

This formatted-file approach satisfied the design considerations:
The administration was simple, because the format of the file was easy
to understand, and required no special software to produce.
The implementation was simple, because the format was simple to parse
and apply.
The deployment was simple,
because no client or server changes were required.

Indeed the majority of robot authors rapidly embraced this proposal,
and it has received a great deal of attention in both Web-based
documentation and the printed press.
This in turn has promoted awareness and acceptance amongst users.

Problems and Feature Requests

In the years since the inital proposal,
a lot of practical experience with the
SRE has been gained,
and a considerable number of suggestions for improvement or
extensions have been made.
They broadly fall into the following categories:

operational problems

general Web problems

further directives for exclusion

extensions beyond exclusion

I will discuss some of the most frequent suggestions
in that order,
and give some arguments in favour or against them.

One main point to keep in mind is that it is difficult to gauge
how much of an issue these problems are in practice, and how
wide-spread support for extensions would be.
When considering further development of the SRE it is
important to prevent second-system syndrome.

Operational problems

These relate to the administration of the SRE, and as such
the effectiveness of the approach for the purpose.

Administrative access to the /robots.txt resource

The SRE specifies a location for the resource,
in the root level of a server's URL space.
Modifying this file generally requires administrative access
to the server, which may not be granted to a user who would
like to add exclusion directives to the file.
This is especially common in large multi-user systems.

It can be argued this is not a problem with the SRE,
which after all does not specify how the resource is administered.
It is for example possible to programatically collect individual's
'~/robots.txt' files, combining them into a single '/robots.txt' file
on a regular basis.
How this could be implemented depends on the operating system,
server software, and publishing process.
In practice users find their adminstrators unwilling or
incapable of providing such a solution.
This indicates again how important it is to stress simplicity;
even if the extra effort required is miniscule, requiring changes
in practices, procedures, or software is a major barrier for deployment.

Suggestions to alleviate the problem have been producing a CGI script
which combines multiple individual files on the fly, or listing multiple
referral files in the '/robots.txt' which the robot can retrieve and combine.
Both these options suffer from the same problem; some administrative access is
still required.

This is the most painful operational problem, and cannot be sufficiently
addressed in the current design.
It seems that the only solution is to move the robot policy closer to the user,
in the URL space they do control.

File specification

The SRE allows only a single method for specifying parts of the URL
space: by substring anchored at the front.
People have asked for substrings achored at the end,
as in "Disallow: *.shtml",
as well as generlised regular expression parsing,
as in 'Disallow: *sex*'.
XXX

The issue with this extension is that it increases complexity of both
administration and implementation. In this case I feel this may be justified.

Redundancy for specific robots

The SRE allows for specific directives for individual robots.
This may result in considerable repetiton of rules common to all robots.
It has been suggested that an OO inheritance scheme could address this.

In practice the per-robot distinction is not that widely used, and the
need seems to be sporadic. The increased complexity of both adminstration
and implementation seems prohibitive in this case.

Scaleability

The SRE groups all rules for the server into a single file.
This doesn't scale well to thousands or millions of individually
specified URL's.

This is a fundamental problem, and one that can only be solved by
moving beyond a single file, and bringing the policy closer to the
individual resources.

Web problems

These are problems faced by the Web at large, which could be addressed
(at leats for robots) separately using extensions to the SRE. I am
against following that route, as it is fixing the problem in the wrong
place. These issues should be addressed by proper general solution
separate from the SRE.

"Wrong" domain names

The use of multiple domain names sharing a logical network interface is a
common practice (even without vanity domains), which often leads to
problems with indexing robots, who may end up using an undesired domain
name for a given URL.

This could be adressed by adding a "preferred" address, or even encoding
"preferred" domain names for certain parts of a URL space. This again
increases complexity, and doesn't solve the problem for non-robots which
can suffer the same fate.

The issue here is that deployed HTTP software doesn't have a facility to
indicate the host part of the HTTP URL, and a server therefore cannot
use that to decide the availability of a URL. HTTP 1.1 and later address
this using a Host header and full URI's in the request line. This will
address this problem accross the board, but will take time to be
deployed and used.

Mirrors

Some servers, such as "webcrawler.com", run identical URL spaces on several
different machines, for load balancing or redundancy purposes.
This can lead to problems when a robot uses only the IP address to uniquely
identify a server; the robot would traverse and list each instance of
the server separately.

It is possible to list alternative IP addresses in the /robots.txt
file, indicating equivalency.
However, in the common case where a single domain name is used for these
separate IP addresses this information is already obtainable from the DNS.

Updates

Currently robots can only track updates by frequent revisits. There seem
to be a few:
the robot could request a notification when a page changes,
the robot could ask for modification information in bulk,
or the SRE could be extended to suggest expirations on URL's.

This is a more general problem, and ties in to caching issues and
the link consistency. I will not go into the first two options as they
donot concern the SRE. The last option would duplicate existing
HTTP-level mechanisms such as Expires, only because they are currently
difficult to configure in servers. It seems to me this is the wrong place
to solve that problem.

Further directives for exclusion

These concern further suggestions to reduce robot-generated problems
for a server. All of these are easy to add, at the cost of more
complex administration and implementation. It also brings up the
issue of partial compliance; not all robot may be willing or able
to support all of these. Given that the importance of these extensions
is secondary to the SRE's purpose, I suggest they are to be listed
as MAY or SHOULD, not MUST options.

Multiple prefixes per line

The SRE doesn't allow multiple URL prefixes on a single line,
as in "Disallow: /users /tmp". In practice people do this, so
the implementation (if not the SRE) could be changed to condone
this practice.

Hit rate

This directive could indicate to a robot how long to wait between requests
to the server. Currently it is accepted practice to wait at least 30
seconds between requests, but this is too fast for some sites, too slow
for others.

A limitation is that this would specify a value for the entire site,
whereas the value may depend on specific parts of the URL space.

ReVisit frequency

This directive could indicate how long a robot should wait before revisiting
pages on the server.

A limitation is that this would specify a value for the entire site,
whereas the value may depend on specific parts of the URL space.

This appears to duplicate some of the existing (and future)
cache-consistency measures such as Expires.

Visit frequency for '/robots.txt'

This is a special version of the directive above; specifying how often
the '/robots.txt' file should be refreshed.

Again Expires could be used to do this.

Visiting hours

It has often been suggested to list certain hours as "preferred hours"
for robot accesses. These would be given in GMT, and would probably
list local low-usage time.

A limitation is that this would specify a value for the entire site,
whereas the value may depend on specific parts of the URL space.

Visiting vs indexing

The SRE specifies URL prefixes that are not to be retrieved. In practice we
find it is used both for URL's that are not to be retrieved, as ones
that are not to be indexed, and that the distinction is not explicit.

For example, a page with links to a company's employees pages may not be
all that desirable to appear in an index, whereas the employees pages
themselves are desirable; The robot should be allowed to recurse on the
parent page to get to the child pages and index them, without indexing
the parent.

This could be addressed by adding a "DontIndex" directive.

Extensions beyond exclusion

The SRE's aim was to reduce abuses by robots, by specifying what is
off-limits.
It has often been suggested to add more constructive information.
I strongly believe such constructive information would be of immense
value, but I contest that the '/robots.txt' file is the best place for
this. In the first place, there may be a number of different schemes
for providing such information; keeping exclusion and "inclusion"
separate allows multiple inclusions schemes to be used, or the inclusion
scheme to be changed without affecting the exclusion parts. Given the
broad debates on meta information this seems prudent.

Some of you may actually not be aware of ALIWEB, a separate pilot project I set
up in 1994 which used a '/site.idx' file in IAFA format, as one way of making
such inclusive information available. A full analysis of ALIWEB is beyond
the scope of this document, but as it used the same concept as the
'/robots.txt' (single resource on a known URL), it shares many of the problems
outlined in this document. In addition there were issues with the exact
nature of the meta data, the complexity of administration, the
restrictiveness of the RFC822-like format, and internationalisation issues.
That experience suggests to me that this does not belong in the
'/robots.txt' file, except possibly in its most basic form: a list of URL's
to visit.

For the record, people's suggestions for inclusive information included:

list of URI's to visit

perl-URL meta information

site administrator contact information

description of the site

geographic information

Recommendations

I have outlined most of the problems and missed features of the SRE.
I also have indicated that I am against most of the extensions to the
current scheme, because of increased complexity, or because the '/robots.txt'
is the wrong place to solve the problem.
Here is what I believe we can do to address these issues.

Moving policy closer to the resources

To address the issues of scaling and administrative access, it is clear
we must move beyond the single resource per server. There is currently
no effective way in the Web for clients to consider collections
(subtrees) of documents together. Therefore the only option is to associate
policy with the resources themselves, ie the pages identified with a URL.

This association can be done in a few ways:

Embedding the policy in the resource itself

This could done using the META tag, e.g.
<META NAME="robotpolicy" CONTENT="dontindex">.
While this would only work for HTML,
it would be extremely easy for a user to add this information
to their documents.
No software or administrative access is required for the user,
and it is really easy to support in the robot.

Embedding a reference to the policy in the resource

This could be done using the LINK tag, e.g.
<LINK REL="robotpolicy" HREF="public.pol">
This would give the extra flexibility of sharing a policy
among documents, and supporting different policy encodings
which could move beyond RFC822-like syntax.
The drawback is increased traffic (using regular caching)
and complexity.

Using an explicit protocol for the association

This could be done using PEP, in a similar fashion to PICS.
It may even be possible or beneficial to use the PICS
framework as the infrastructure, and express the policy
as a rating.

Note that this can be deployed independently of, and can be used
together with a site '/robots.txt'.

I suggest the first option should be an immediate first step,
with the other options possibly following later.

Meta information

The same three steps can be used for descriptive META information:

Embedding the meta information in the resource itself

This could done using the META tag, e.g.
<META NAME="description" CONTENT="...">.
The nature of the META information could be the Dublin core
set, or even just "description" and "keywords".
While this would only work for HTML,
it would be extremely easy for a user to add this information
to their documents.
No software or administrative access is required for the user,
and it is really easy to support in the robot.

Embedding a reference to the policy in the resource

This could be done using the LINK tag, e.g.
<LINK REL="meta" HREF="doc.meta">
This would give the extra flexibility of sharing meta information
among documents, and supporting different meta encodings
which could move beyond RFC822-like syntax (which can even be
negotiated using HTTP content type negotiation!)
The drawback is increased traffic (using regular caching)
and complexity.

Using an explicit protocol for the association

This could be done using PEP, in a similar fashion to PICS.
It may even be possible or beneficial to use the PICS
framework as the infrastructure, and express the meta information
as a rating.

I suggest the first option should be an immediate first step,
with the other options possibly following later.

Extending the SRE

The meaures above address some of the problems in the SRE in a more
scaleable and flexible way than by adding a multitude of directives to
the '/robots.txt' file.

I believe that of the suggested additions, this one will have the
most benefit, without adding complexity:

PleaseVisit

To suggest relative URL's to visit on the site

Standards...

I believe any future version of the SRE should be documented
either as an RFC or a W3C-backed standard.