This document represents some informal extentions that have yet to be agreed
upon. A preliminary version of this document was posted to the
robots
mailing list(robots-request@webcrawler.com). This document is based
upon that preliminary
version.

This is not an official standard backed by a standards body, or owned by any
commercial organization. It is not enforced by anyone, and there are no
guarantees that all current and future robots will use it. These are just
proposed extentions to the current robot
exclusion standard.

For some time now, it has been apparent that the current
robot exclusion standard has some deficicencies in allowing the
administrators of web servers more control over where robots are allowed and
not allowed to visit. There are also no mechanism in place stating what are
good times for robots to visit, nor a mechanism stating how fast a robot can
safely pull documents, as well as the reasons stated in the original
robot exclusion standard.

This document proposes such extensions to the
robot exclusion standard. The original standard is referred to as
Version 1.0.0Version 1.0, while the extentions
proposed here
are referred to as Version 2.0.0Version
2.0 of the robot exclusion standard.

More information about robots in general can be found on the
Robots ExclustionWorld Wide Web Robots, Wanderers, and Spiders page.

The following is taken verbatim from the original
robot exclusion standard. That standard covers Version
1.0.0Version 1.0 of
the robot exclusion standard, while this covers Version
2.0.0Version 2.0 of
the robot exclusion standard. The method hasn't changed between
Version 1.0.0Version 1.0 and
Version 2.0.0Version 2.0 though.

The method used to exclude robots from a server is to
create a file on the server which specifies an access
policy for robots.
This file must be accessible via HTTP on the local URL
"/robots.txt".
The contents of this file are specified below.

This approach was chosen because it can be easily
implemented on any existing WWW server, and a robot can find
the access policy with only a single document retrieval.

A possible drawback of this single-file approach is that only a
server administrator can maintain such a list, not the
individual document maintainers on the server. This can be
resolved by a local process to construct the single file
from a number of others, but if, or how, this is done is
outside of the scope of this document.

The choice of the URL was motivated by several criteria:

The filename should fit in file naming restrictions of all
common operating systems.

The filename extension should not require extra server
configuration.

The filename should indicate the purpose of the file
and be easy to remember.

Any text following a "#" up to the end of a line is
to be ignored. The "#" character can appear
at any portion only after a blank (or
whitespace) character, or at the start of a line. Some
examples:

# this is a comment line
#so is this
User-agent: fredsbot # this is another comment
Disallow: * # we don't like this bot

<general&gt - Version 1.0.0Version 1.0 general string match format.

The general match is included for compatibility with
Version 1.0.0Version 1.0
of the robots exclusion
standard. General matches do not contain regular
expression characters, but are treated as if they contain the
character "*", which is used to match zero or more
characters, at the end of the string. An example would be:
/helpme, which is to be treated as:
/helpme*.

This exists solely for Version 1.0.0Version 1.0
compatibility and their usage can be determined by context.

<explicit> - Version 2.0.0Version 2.0 explicit string match format.

This contains no regular expression characters and
any string to be matched using an explicit string match must
match all the characters present exactly.

where <data> depend upon the directive and items in
"[" and "]" are optional.
Unless otherwise noted, each directive can appear more than once in a given
rule set. The following directives are defined for Version
2.0.0Version 2.0

The version is a three part number, with each part
separated by a period.
The version is a two part number, separated by a period.

The first number indicates major revisions to the
robots.txt standard. The second number represents
clarifications or fixes to the robots.txt standard.
The first part is the major version number of the
robots.txt standard. Only drastic changes to the
standard shall cause this number to be increased.
Valid numbers for this part are 1 and 2.

The second part is for partial upgrades, clarifications
or small added extensions. My intent is to follow the
Linux Kernel numbering convention here and have even
numbers be stable (or agreed upon) standards, and odd
numbers to be experimental, with possible differing
interpretations of headers.

The final number is a revision of the current major and
minor numbers. It is hoped that this number will be 0
for even versions of the robots.txt standard.

This will follow the User-agent: header. If it does
not immediately follow, or is missing, then the robot
is to assume the rule set follows the Version 1.0.0
standard.

Only oneRobot-version: header per
rule set is allowed.

A version number of 1.0.0 1.0 is allowed.

When checking the version number, a robot can assume (if
the second digit is even) that a higher version number than
its looking for is okay (i.e. if a robot is looking for
version 2.0.0 2.0 and comes across
2.2.0 2.2, then it can still use the rule
set).

If a robot comes across a lower version number, then it
will have to correctly parse the headers according to
that version.

A robot, if it comes across an experiment version number,
should probably ignore that rule set and use the default.

It has been suggested that the version number present is more
for documentation purposes than for content negotiation. This
is still being decided, but a version number should
be included.

See the new
RFC Drafr for the Version 1.0 behavior,
except to note that a general match can be turned into a
regular express match by adding a "*" to the
end of the string.

Version 2.0

Pending discussion, Version 2.0 semantics
of this directive may not be implemented.

This directive (if included) and the Disallow:
directive are to be processed in the order they appear in the
rule set. This is to simplify the processing, avoid ambiguity
and allow more control over what is and isn't allowed.

If a URL is not covered by any allow or disallow rules, then
the URL is to be allowed (as per the Version 1.0
spec).

An explicit match string has the highest precedence and
grants the robot the explicit permission to retrieve the
URL stated.

A regular expression has the lowest precedence
and only grants the robot permission to retrieve the URLs
matching only if any disallow rules do not
filter out the
URL (see Disallow).

If there are no disallow rules, then the robot is only
allowed to retrieve the URLs that match the explicit
and/or regular expressions given.

See the current
robots.txt for the Version 1.0.0Version 1.0 behavior, except
to note that a general match can be turned into a regular
expression match by adding a "*" to the end of the
string.

Version 2.0.0 Version 2.0

Pending discussion, Version 2.0 semantics of
this directive may not be implemented.

This directive and the Allow:
directive (if included) are to be processed in the order they
appear in the
rule set. This is to simplify the processing, avoid ambiguity
and allow more control over what is and isn't allowed.

If a URL is not covered by any allow or disallow rules, then
the URL is to be allowed (as per the Version 1.0
spec).

Any URL matching the explicit match or the wild
card/regular expression is not to be retreived.

If there are no allow rules, then any URL not matching the
rule(s) can be retrieved by the robot.

If there are allow rules, then explicit allows have a higher
precedence than a disallow rule. Disallow rules have a
higher precedence than regular expression allow rules.
Any URL not matching the disallow rules have to then pass
(any) regular expression allow rules. If there are
no allow rules, then anything not covered
by the disallow rule set is allowed.

The robot is requested to only visit the site between the
given times. If the robot visits outside of this time,
it should notify its author/user that the site only
wants it between the times specified.

This can only appear once per rule set.
More than one can appear in a rule set, allowing several
windows of access to a robot.

These are comments that the robot is encouraged to send
back to the author/user of the robot. All Comment:'s
in a rule set are to be sent back (at least, that's the
intention). This can be used to explain the robot policy
of a site (say, that one government site that hates
robots).

An empty /robots.txt file has no associated semantics, it will be
treated as if it was not present, i.e. all robots will consider themselves
welcome.

Please note that these examples use the Allow: and
Disallow: directives as defined in this document. These directives
may or may not be in the final draft as defined here.

The follow examples are based upon a fictitious site called www.frommitz.biz
with the following structure:

/index.html

/images/

index.html

fromlogo.jpg

navbar.jpg

blueball.gif

redball.gif

usamap.gif

portrait.jpg

/products.html

/order.html

/order.shtml

/order.cgi

/blackhole/

index.html

info98.html

info98.shtml

info99.html

info99.gif

info8.html

page3.html

info/

index.html

page1.html

page2.shtml

page4.html

thankyou.html

/overview.html

/thankyou.html

Given the lack of the Robot-version: directive, the following rule
set automattically defaults to the Version 1.0.0Version 1.0 robots exclustion standard. Therefore, the only files
that fredsbot, pandabot, and chives
will be able to retrieve will be /index.html,
/products.html, /overview.html and thankyou.html.
Everything else is offlimits to these robots.

Again, due to the lack of the Robots-version: directive, the
following rule
set follows the Version 1.0.0Versin 1.0
robot exclusion standard. Note that if
we only want certain pages indexed, we need to exclude almost explicitly what
isn't allowed. As the site grows, this can create large rule sets
which may break certain robots.

#------------------------------------------------------------------
# The following robot only understands the 1.0.0 1.0 spec,
# but since the search engine it represents is really popular, we want more
# pages to be indexed. That means we need a longer rule set for this
# particular robot.
#-------------------------------------------------------------------
User-agent: popularsite
Disallow: /images
Disallow: /order.shtml
Disallow: /order.cgi
Disallow: /blackhole/info98.shtml
Disallow: /blackhole/info99.html
Disallow: /blackhole/info99.gif
Disallow: /blackhole/info8.html
Disallow: /blackhole/info/page2.shtml

While this is a slightly larger rule set than the last example, it reasonably
covers all the cases so that as the site grows, this rule set doesn't have to.
Also note that alfred, newchives and oscarbot are
allowed to retrieve /images/index.html, since it is explicitly
stated, but that other references under /images are not
allowed. Also any references to server side include files are not allowed,
nor are any images or CGI scripts.

The following robot is instructed to only retrieve HTML documents (and only
HTML documents) between the hours of 6:00 am and 8:45 am UT (GMT), which in
this example, is 1:00 am and 3:45 am EST (the location of the fictitious web
site).

#------------------------------------------------------------------------
# the following robot also understands the 2.0.0 2.0 spec, but
# we want to limit when it can visit the site
#------------------------------------------------------------------------
User-agent: suckemdry
Robot-version: 2.0.0 2.0
Allow: *.html # only allow HTML pages
Disallow: * # and nothing else
Visit-time: 0600-0845 # and then only between 1 am to 3:45 am EST

The following robots can retrieve any HTML document, but depending upon the
time they visit, are limited to how fast they are to retrieve the documents.
Also, a comment is given explaining why they're being limited the way they
are.

#-----------------------------------------------------------------------
# okay robots - but since they seem to keep trying over and over again,
# so let's limit them and attempt to keep them accessing us during slow
# times.
#------------------------------------------------------------------------
User-agent: vacuumweb
User-agent: spanwebbot
User-agent: spiderbot
Robot-version: 2.0.0 2.0
Request-rate: 1/10m 1300-1659 # 8:00 am to noon EST
Request-rate: 1/20m 1700-0459 # noon to 11:59 pm EST
Request-rate: 5/1m 0500-1259 # midnight to 7:59 am EST
Comment: because you guys try all the time, I'm gonna limit you
Comment: to how many documents you can retrieve. So there!
Allow: *.html
Disallow: *

The next example states that no robot should visit further. This follows
Version 1.0.0Version 1.0 of the spec and
all robots should understand this format.

Sean Conner <sean@conman.org>
is now an independant software programmer. Back in 1996 when this document
was being developed he was the Vice President of Research and Development for
Armigeron Information Services, Inc.,
which allowed him the time and resources to produce and host this
document.