How Google Index Is Updated

Updated on July 7, 2010

Search Engine Results Index Updates

Whether it is a webpage ranked number one in Google search results or a tiny little junk webpage languishing in the dark corner of the Internet on a test server, the SERPs all come from one place, the index built by programs called spiders.

Search Engine Results Pages, or SERP, are the result of ranking the billions of webpages found all across the web. Each page is found the same way, by tiny automated programs that follow links from here to there and from there to here. They follow links everywhere.

Getting indexed by the search engines will happen eventually, but
making it faster requires some basic knowledge of where search engine
results come from and how webpages and how hubs get
found and indexed by Google and others.

When a user goes to
Google.com or Bing.com and runs a search by typing keywords into the
search box field, the computers in the Google datacenters run an
algorithm against an enormous database of all the webpages that have
been indexed. The top search results from that algorithm are reported on
the first page of Google search results. The main goal of every search
engine company is that the #1 search result on any given search is the
"best" or most relevant webpage for the specific search that was run.

In
order to achieve this goal, the search engines must first build that
database of webpages and all of the content on those websites in order
to run their search ranking algorithms. This database of online content
is known as the index and it is built by specialized programs called
spiders.

How Indexing Search Spiders Work

Spiders are tiny automated bots that crawl across the Internet by following the links they encounter as they index individual webpages. For example, if a search engine spider were crawling the freelance writing business of Arctic Llama homepage, it would (theoretically) follow all of the links from www.arcticllama.com. After following those links, the spider would then read those webpages, add them to the index, and then follow all the links on that page as well.

The idea is that eventually, the search engine spiders would find every webpage and website on the Internet, so long as at least one link somewhere out there points to the page.

In reality, the Internet is actually too big to be indexed in this manner in a useful way. Spiders following links from wherever they start all the way through until they find no more links would take decades to do there jobs and eventually overwhelm not only the servers they were crawling on, but the ones that the search engines have in the giant datacenters as well.

Instead, spiders like the Google search-bots are programmed with limited intelligence and autonomy. They do go out and follow links while cataloging all the content and text they find on each webpage, but they do not continue down the linking superhighway forever. Instead, they don't follow some links and they only follow other links to one level.

That means that in order to get a webpage indexed, it has to get in the way of a running spider.

Spiders crawl frequently updated pages more often, so updates help get pages indexed. They also are more likely to crawl pages with more incoming links, partly due to purely mathematical reasons (the more links, the more likely one is to be followed at random) and also due to webpages with more incoming links being seen as more important or authoritative.

When a spider has run its pre-determined course it terminates, allowing new spiders to be sent out to investigate what the original spider sent back during its run. The process repeats into infinity, indexing the entire Internet along the way, if only for a few seconds each.