Thursday, December 14, 2006

Before you can understand how Google works, you probably should have a basic idea of how the Web works. When you visit a website, your browser is actually contacting a web server, a computer whose job is to deliver web pages. So when you click a link, your browser contacts the server and says, "Send me this page." The server takes the request and then sends the page to the browser, which displays it on your computer.Key TermServer A computer whose job is to perform a specialized task and deliver information. For example, a web server serves up websites, while an email server sends or receives email.

Okay, now that you have that basic background down, let's see how Google works.

In some very basic ways, Google works just like other search engines. Its basic operations are exactly the same. Like all search engines, Google is composed of three parts:

• A spider, also called a crawler This spider "crawls" the Web and finds content on web pages.

• An indexer This software takes all the information the spider gives it and creates a giant index that can be searched.

• A query engine This is what takes your search request, sends it to the indexer, and reports the results to you.

Key TermSearch engine A site that allows you to search the Web.

The SpiderThe spider part of the Google search engine is an automated piece of software, also called a robot, that requests many thousands of pages from hundreds of websites simultaneously. When it finds links on pages, it follows those, and requests those as well.

The main Google spider is the GoogleBot, and it essentially crawls the Web once a month. Obviously, many sites change more than once a month, and so Google also has a crawler named FreshBot that crawls pages constantly.

The IndexerThe spiders send information about all the pages they find to the indexer part of the search engine. The indexer then does a pretty amazing jobit creates an index of every word on every page sent to it by the Google spider. Not only does it index every word and every URL, it also keeps a record of where every word is on every page.

Multiple copies of this index are kept on various Google servers. A single server wouldn't be able to keep up with all the search requests that are done.

The Query EngineThe only part of Google that you see is the query engine, and you only see part of that. It's the public face of Googlethat inviting search box at the top of Google pages.

When you type a search term, a Google web server sends your request to the indexer, which is housed on multiple indexing servers. The index servers look through the index and match what they find with your request. The index server then sends that information to document servers, which retrieve the correct information and format it so your browser can understand it. That formatted information is then sent to your browser.

And it all happens in a fraction of a second.

Google's Special SauceAll this search engine logic is nothing new or revolutionary. This technology has been around for years, long before Google was a glimmer in its founders' eyes.

So why is Google so good at what it does?

Google uses better algorithms than any other search engine, and constantly refines them. Algorithms are sets of rules for performing a particular task. In Google's case, its algorithms are responsible for taking your search request and deciding which results to show you.

Key TermAlgorithm A set of rules for performing a task. In Google's case, algorithms are what determines which pages it says match your search requests.

Google's algorithms aren't particularly easy for mere mortals to understand, they're changing all the time, and they're not made public. Google uses more than 100 factors in its algorithms. For every search you do, it considers all of those factors and then calculates a score for every possible matching page. The page with the highest score is the first search result. The page with the second-highest score is the second search result, and so on.

Some of the metrics are fairly obviousthe search term needs to appear on a page, for example. Google's algorithms also factor in the number of times the term appears on a page, whether the term appears on a prominent part of a page, whether it appears in the title of a page, and many other factors.

None of this is particularly revolutionary, either. Many search engines do the same thing.

Google's real brilliance is in harnessing the collective intelligence of the Web to figure out what was truly relevant, instead of merely relying on these kinds of rules. Google also gives a great deal of weight to the number and kinds of pages that link to a web page. For example, Google figures that if a web page has many sites linking to it, the odds are very good that the page is an important one. And if important sites are linking to that page, it's even more important.

So Google calculates a page rank for each page, and that page rank becomes a very important part of the calculation as well. For example, you do a search, and Google finds your search term five times on a page to which hardly any pages link, but three times on a page (such as on the New York Times website) that has many sites linking to it. The more important page (the New York Times page) appears higher on the search results list, even though the search term appears on it less frequently.