You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact contact us.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere.

The second paragraph indicates that the robot called 'googlebot' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.

The third paragraph indicates that all other robots should not visit URLs starting with /cgi-bin or /log. Note the '*' is a special token, meaning "any other User-agent"; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.

Two common errors:

Wildcards are not supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp'.
You shouldn't put more than one path on a Disallow line (this may change in a future version of the spec)
Ultimately, without the use of robots.txt files on your servers/domains, you are risking a variety of potential problems including, unauthorized access to your cgi directory, unauthorized viewing of your site stats, possible spamming of the search engines by accidental crawling of doorway pages.

One distinct advantage however of having a robots.txt file on your server is that, quite simply, you will be able to tell when and where your site has been indexed or potentially indexed as, all robots will automatically call for the robots.txt file BEFORE any other page on your server so, as long as you keep an eye open for any calls of this file, you can see who is knocking at your site for indexing purposes.

Below is a robots.txt example that you can copy and paste into a text document to use on your own server:

<!--Start Copy Below This Line-->

User-agent: *
Disallow: /cgi-bin
Disallow: /logs

<!--End Copy Above This Line-->

The above will allow all spiders to crawl all of your site except the subdirectory's 'cgi-bin' and 'logs' which, may be altered to suit any subdirectory's you do not wish the spiders to crawl on your server.

robots.txt. document may be the main major component that permits different web spiders in addition to seek robots to be able to cache your site. You possibly can constantly take advantage of this document to allow or maybe retain bots via caching your site.