Google Crawling and Indexation 101

What are Google Crawling and Indexing
Google finds, explores, stores and sorts all the indexable pages of the Web, to make them findable through search. The process of discovery and exploration is called Crawling. Google uses several robust programs called bots, or robots, to crawl the whole web. The main among them is Googlebot, but there are other major ones, including the Google blog bot. The process of storing the web pages for referencing in search, and sorting them in some appropriate order, is called indexing.

Why Do They Matter?
If your site is not crawled properly or if its pages aren’t indexed, you will be unfndable on Google. An insufficient Google crawl rate and incomplete indexation are the scourge of many websites, especially those large and new. In this forum and others, many members report indexation issues and ask how to solve them. The same is the case with SEO clients. My advice here is focused on Google, but the same or very similar general principles apply to crawling and indexation by Bing and Yahoo! as well.

Issues, Best Practices, Troubleshooting
First of all, Google indexation is hard to measure for a large site. There can be false alarms having to do with people using Google’s site: operator, supposed to report the site’s indexation count. It works well for small sites but is wildly unreliable for large ones and tends to severely underreport the count. Webmaster Tools is better for this, but possibly also unreliable. If your site is enormous, there is simply no certain way of knowing how many pages Google has indexed. For additional helpful data, check Google Analytics to see the total number of pages that have received visits. I also recommend that you manually run cache: checks of all your most important pages and of various random secondary pages to get a further idea of how your site is doing on the indexation front.

The Google crawl rate cannot be reliably controlled, but it can influenced by positive factors (listed here roughly in receding order of importance).
• Domain importance. Google’s Matt Cutts has recently admitted, interviewed by Eric Enge, that your site’s crawl rate and depth of crawling are roughly proportional to PR. SEOs have long known this.
• Backlinks. PR is computed based on backlinks, which are absolutely central to indexation. If a site’s page count is growing fast but the site is not earning enough new links, this may suggest to Google that the content is of low quality (guaranteed reduce your crawl and indexation rates).
• Deep Linking. Backlinks to individual pages (so-called “deep linking”) are an effective way to ensure the indexation of those pages and their keep in the main Google index (as distinct from the supplementary index). Internal links to the same pages also help. Make sure that at least your most important pages get enough of both kinds of links. These need to be followed links (i.e. they should not contain the rel=”nofollow” attribute).
• Site navigation and hierarchy. To the extent possible, a flat site hierarchy should be used. (An exemplary illustration is fanbase.com, with all the main categories appearing in the top-level navigation, enabling quick drilldown to individual pages.) This means (a) as few subdomains, subfolders and subdirectories as possible and (b) that all important pages must be reachable via the fewest clicks possible from the home page (more than 3-4 clicks is problematic).
• XML sitemaps. This a must. Here is one good tool — xml-sitemaps.com — for generating sitemaps; there are others too. Submit your sitemaps to the search engines via webmaster tools. Further notes:
o Sitemaps generally support <changefreq> and <priority> attributes, whose use may influence the crawl, although the impact is likely to be minor.
o Check WMT for sitemap errors and fix them.
o Michael Gray has recommended that creating small sitemaps of (100 pages or less) to supplement your regular sitemaps can help get new content indexed faster. He has found using a dedicated sitemap for fresh content to be highly effective.
• [b]In addition to sitemaps, you can use the “Fetch as Googlebot” feature in your Google webmaster tools: its effect on indexing can be similar to that of submitting a sitemap.
• Duplicate content reduction. In general, duplicate content on a site is not a significant problem and does not entail “Google penalties” even after the Panda updates, unless that duplicate content is spammy. Nevertheless, you should maintain a healthy economy and minimize duplicate content on your site. Especially on very large sites high-volume duplicated content (identical pages sitting under different URLs) can confuse Google and impede proper indexing. One classic example of duplication occurs under different forms of site URLs: those that include the www. subdomain and those that don’t (e.g. example.com/file1.html and example.com/file1.html typically have the same content). The way to handle this and other kinds of duplication it is via some form of URL canonicalization (see next item).
• URL canonicalization means creating a single SEO-friendly and user-friendly URL for each page and letting Google know that that URL is canonical. SEO reasons for canonicalization are various go beyond indexation issues: (1) Google, in spite of occasional denial, may assigns less importance to pages that contain extra slashed (subdirectories); (2) Google may sometimes have difficulties with URLs that are parameter-laden; (3) long ugly URLs are a turnoff for site visitors; (4) a clear, well-structures consistent URL convention is best for the user, for branding and for SEO; (5) canonicalization consolidates PageRank and link equity to the canonical version of the page, giving it a better chance to rank. Depending on your platform, various rewrite engines (see en.wikipedia.org/wiki/Rewrite_engine) can be used to automate the rewriting of URLs from “ugly” into friendly ones. URL canonicalization can be performed in any of 3 different ways:
o 301-redirect (“moved permanently”) of all duplicate URLs to the canonical. IMHO this is the most reliable method of canonicalization, but it may have certain overheads.
o rel=”canonical”: Place a link of the form <link rel=”canonical” href=”http://example.com/canonical-url-example.html”> at the end of the <head> of each duplicate page. (Yes, it’s OK for the canonical version to include this link to itself; and no, there is no limit on how many canonical links you can have.)
o “Display URLs as”: the effect of this setting in the Google Webmaster Tools is similar to that of rel=”canonical” and is the easiest option if you prefer not to write any code.
• URL stability and page uniqueness. While the issues surrounding duplicate content are fairly well known, one potential problem that is rarely discussed is the opposite. The term I have coined for it is multitasking URLs. Some applications may display different dynamically generated content under the same URL (for example, content specific the user’s geographical location). Additionally, the title tags for such pages may also be generated on the fly and contradictory. I have seen this lead to a variety of indexation and search issues. For best results, the content of each page, whether dynamic or static, must be unique and must appear under its proper, unique and stable URL and title tag.
• Unique title tags. If you use the same title tags across multiple pages, Google may assume that those pages are duplicate and be reluctant to index them. Make your titles unique.
• Manual crawl rate setting. Google’s Webmaster Tools offer a choice between letting Google determine the crawl rate automatically and setting it manually via a slide bar. Although setting it manually to max is unlikely to boost the crawl rate dramatically, it may brings about marginal improvement.
• Original content. It’s good for all your important pages to have significant and unique original content.
• Updates, feeds, pinging. Frequent content updates both site-wide and on individual pages can significantly improve the crawl rate. Further, exporting RSS feeds and implementing automated search engine pinging have a beneficial effect. Pinging resources include pingomatic.com and pingler.com.
• Social Media. Links from social media, although they are nofollow, help Google discover and index new content. Including sharing buttons on your pages and promoting them on social media sites can help get your pages into the index faster.

Technical Note
The most important update to Google’s indexing system have is Google Caffeine, first launched in August 2009 and completing its search index on June 8 , 2010. It has replaced the old multi-layered static index with a system that crawled and indexed the web dynamically, in manageable segments, and practically in real time. Caffeine started paying attention to signals from Facebook and Twitter.

FURTHER DETAIL
I have dated most of the sources below. As far as I know, the information in them is still current and accurate. If you find new relevant information, please let me know and I’ll update this post.

Google’s official explanation of crawling and indexing basics.

Google’s official tips for troubleshooting crawling and indexing .
support.google.com/webmasters/answer/34441?hl=en

Matt Cutts covers the basics of crawling and indexing here and goes into some interesting details:
youtube.com/watch?v=KyCYyoGusqs
(April 2012)

It’s a common myth that you must disallow the crawling of your scripts and CSS. Matt Currs says don’t disallow the crawling of your JavaScript and CSS:
youtube.com/watch?v=LW3pjQeCqqk
(March 2013)

Google has officially announced on Nov 1, 2013 that it is starting to index mobile applications as websites.
googlewebmastercentral.blogspot.com/2013/10/indexing-apps-just-like-websites.html

And here is a Matt Cutts video in which he covers some aspects of mobile indexing, and puts to rest worries about “duplicate content” arising the current state of mobile indexing.
youtube.com/watch?v=mY9h3G8Lv4k

Here he goes over the basics of crawling and covers details of the Google cache date:
youtube.com/watch?v=8lmZS7TknQc
(April 2011)