Getting Started

Introduction

Search engines generally go through two main stages to make content
available for users in search results: crawling and
indexing. Crawling is when
search engine crawlers access publicly available webpages. In general, this
involves looking at the webpages and following the links on those pages,
just as a human user would. Indexing involves gathering
together information about a page so that it can be made available ("served")
through search results.

The distinction between crawling and indexing is critical.
Confusion on this point is common and leads to webpages appearing or not
appearing in search results. Note that
a page may be crawled but not indexed; and, in rare cases, it may be indexed
even if it hasn't been crawled. Additionally, in order to properly prevent
indexing of a page, you must allow crawling or attempted crawling of the URL.

The methods described in this set of documents helps you control
aspects of both crawling and indexing, so you can determine how
you would prefer your content to be accessed by crawlers as well as how you
would like your content to be presented to other users in search results.

In some situations, you may not want to allow crawlers to access areas
of a server. This could be the case if accessing those pages uses the
limited server resources, or if problems with the URL and linking structure
would create an infinite number of URLs if all of them were to be followed.

In some cases it may be preferable to control how content is
indexed and made available in search results. For instance, you may not want
your pages to be indexed at all, or want them to appear without a snippet
(summary of the page shown below the title in search results); or you may
not want users to be able to view a cached version of the page.

Warning:
Neither of these methods is suitable for controlling access
to private content. If content should not be accessible by the
general public, it's important that proper authentication mechanisms are in
place. Our Help Center has more information on blocking Google from accessing or showing private
content.

Note:
Pages may be indexed despite never having been crawled: the two
processes are independent of each other. If enough information is available
about a page, and the page is deemed relevant to users, search engine algorithms
may decide to include it in the search results despite never having had access
to the content directly. That said, there are simple mechanisms such as
robots meta tags to make sure that pages are not indexed.

Controlling crawling

The robots.txt file is a text file that allows you to specify how
you would like your site to be crawled. Before crawling a website,
crawlers will generally request the robots.txt file from the server.
Within the robots.txt file, you can include
sections for specific (or all) crawlers with instructions ("directives") that
let them know which parts can or cannot be crawled.

Location of the robots.txt file

The robots.txt file must be located on the root of the website host that it
should be valid for. For instance, in order to control crawling on all URLs
below http://www.example.com/, the robots.txt file must be
located at http://www.example.com/robots.txt. A robots.txt file
can be placed on subdomains (like
http://website.example.com/robots.txt) or on non-standard
ports (http://example.com:8181/robots.txt), but it cannot be
placed in a subdirectory (http://example.com/pages/robots.txt).
There are more details regarding the location in the
specifications.

Content of the robots.txt file

You can use almost any text editor to create a robots.txt file. The text
editor should be able to create standard ASCII or UTF-8 text files; don't use
a word processor (word processors often save files in a proprietary format and
can add unexpected characters, such as curly quotes, which may cause problems
for crawlers). A general robots.txt file might look like this:

Some sample robots.txt files

These are some simple samples to help get started with the robots.txt
handling.

Allow crawling of all content

User-agent: *
Disallow:

or

User-agent: *
Allow: /

The sample above is valid, but in fact if you want all your content to be
crawled, you don't need a robots.txt file at all (and we recommend that you
don't use one). If you don't have a robots.txt file, verify that your
hoster returns a proper 404 "Not found" HTTP result code when the URL is
requested.

Disallow crawling of the whole website

User-agent: *
Disallow: /

Keep in mind that in some situations URLs from the website may still be indexed,
even if they haven't been crawled.

Disallow crawling of certain parts of the website

User-agent: *
Disallow: /calendar/
Disallow: /junk/

Remember that you shouldn't use robots.txt to block access to
private content: use proper authentication instead. URLs disallowed by the
robots.txt file might still be indexed without being crawled, and the robots.txt
file can be viewed by anyone, potentially disclosing the location of your
private content.

Using the robots meta tag

In this example, robots meta tag is specifying that no search engines
should index this particular page (noindex). The name
robots applies to all search engines. If you want to block
or allow a specific search engine, you can specify a user-agent name in
the place of robots.

Using the X-Robots-Tag HTTP header

In some situations, non-HTML content (such as document files) can
also be crawled and indexed by search engines. In these cases, it's not
possible to add a meta tag to the individual pages—instead, an HTTP
header element can be sent with the response. This header element is not
directly visible to users as it's not a part of the content directly.

The X-Robots-Tag is included with the other HTTP header tags. You can
see these by checking the HTTP headers, for example using
"curl":