Google and Yahoo Search Engine Technology Comparison

With so much talk about relevance these days, I thought I’d introduce you to some of the technology behind the search engine, and what the potential differences between them could be. There are some interesting takes on search technology from pre-ranking results on the fly to neural networks to community based searching.

Comparing the ‘Big Four’

In these articles, I will focus on ‘The Big Four.’ These are the engines considered to own the search space. They are Google, Yahoo!, MSN ,and Ask Jeeves. First up will be Google and Yahoo!

Google – Google is probably the most well-known search engine. When they launched they were considered the most relevant.

How Google Works

They determined relevancy primarily on their PageRank algorithm. PageRank essentially says that a site that has more inbound links than their competitors is likely a better site, therefore should rank higher. Webmasters soon realized this, and also realized that all they had to do was build an increased number of links – enough links to outpace their competitors – in order to rank highly. Google of course reacted by changing the ranking algorithm somewhat. Now there are elements of authority and relevancy applied to the PageRank algorithm.

Google employs thousands of servers to calculate these rankings. They look at hundreds of factors – both on the page and off the page (such as inbound links). They use hundreds of algorithms to perform these calculations. Essentially there should be one algorithm per factor. The algorithms weight the pages, and assign their values. These values are then stored for later use.

When a user performs a query, yet another set of algorithms weigh the previously calculated values against one another to determine overall relevance. Results are then outputted to the users browser.

As one can imagine, this type of processing power requirement must be huge. In addition, based on how fast Google returns results, not much data can be written to the hard drives of the individual servers. Therefore, one must assume that most of the Google index resides virtually in memory. Or at least the parts that are served to users.

the next time you perform a search look at how quickly Google returns results. I searched for “serach engine” (I intentionally mis-spelled it) and it returned 68,900 results. In addition, the engine returned some sponsored results across the side of the page, as well as a spelling suggestion. All in 0.36 seconds.

And for popular queries the engine is even faster. For example, searches for Hurricane Kathrina or MTV awards (both recent events) took less than .2 seconds each.

And Google is famous for decentralization and redundancy. For every single cached page there is likely 2-3 copies stored, perhaps even more. Google breaks the index into very small parts – as small as 2 Megabytes each, and as I mentioned earlier, these 2 Megabyte sections are stored all over the Google infrastructure. Each 2 Megabyte section may be stored next to an unrelated section. For example, there may be a few pages from a pet site next to pages from a blog, next to pages from an e-commerce site.

While each datacenter acts independent of the other, there is likely some overlap in tasks.

Imagine a room with thousands of computers running in unison with each other. Now imagine that same room copied over and over to all the other data centers spread out throughout North America.

It is because of these different data centers, each acting separately, but with the same end goal, that we used to experience the “Google Dance” monthly. The Google Dance was that period of time when Google would update their search results across the data centers. Further, each data center would update on its own, so pages that may have ranked #1 in one data center may not have appeared in the top 30 on other data centers.

Of course the factors Google has used to rank pages has changed over time. They are placing less emphasis on PageRank, but it is still important. Its important to note that moving different factors around within the calculation can greatly impact a site’s rankings. For example, if the site has a high PageRank, but a low keyword density, it may rank #1 if PageRank affects the calculation later, however the site may disappear from the results if PageRank is considered earlier.

And this is probably what is happening now – Google has essentially moved the PageRank factor to somewhere else in the final calculation. Remember, there are likely hundreds of factors affecting rankings. By rearranging the order in which they are applied to the final rankings can have a dramatic impact on overall placement on the search results page.

Google also appears to have moved from a once per month update to a more perpetually updating index. We only rarely notice the changes happen, but they do happen on a more incremental level, with more major updates happening less frequently.

I guess one could view Google as a series of layers – each layer building on the work performed by the layer before. The uppermost layer is the only one we are exposed to via the browser, however that page that you see would not exist without the work performed by the lower layers.

Now, Let’s Look at Yahoo

Yahoo! – While no one other than Yahoo!s engineers know for sure, we can speculate that Yahoo! search technology works very similar to Google’s

The reason Yahoo! is so difficult to gauge is because they haven’t really built a search engine from the ground up like Google or MSN. Of course the Yahoo! search you see is unique unto itself, however Yahoo! has built its search on the backs of other technologies they have purchased in previous years.

It was just around Christmas 2002 when Yahoo! purchased search service Inktomi. Up until then Yahoo! had received their search results either from Inktomi or more recently Google. In fact, up until the time they purchased Inktomi there was speculation that Yahoo! would buy Google.

It was just a few months after this that Overture (a pay-per-click advertising company) purchased Altavista – one of the first and strongest search engines out there. Then, just a few weeks after that Overture purchased Alltheweb.com from FAST.

It was clear that Overture was going to move into the algorithmic search space.

But shortly after this rumblings began that Yahoo! may be interested in purchasing some or all of Overture’s technology. And in July 2003 Yahoo! did indeed buy Overture.

We didn’t hear much about Yahoo! search until February 2004 – that’s when the company launched it’s own version of algorithmic search. And it wasn’t what many expected. Some thought that they’d simply rebrand Inktomi, while others thought they would rebrand one of the Overture purchases and turn either Altavista or Alltheweb search into Yahoo! search.

But that isn’t what happened. Yahoo! built their own search, cobbling together features from all the technology they owned.

They had the super fast Inktomi and Altavista crawlers, as well as the surprisingly good Alltheweb and Altavista ranking algorithms. So they mashed that all together to get Yahoo! Search.

Yahoo! Search isn’t much different that Google. Their own website says that they analyze pages using many factors to determine relevance to a search query, and the results of that analysis are what the user sees when they perform a query.

Of course Yahoo! like all the other engines, has spent the past year or more working to improve its ranking algorithms. When they first came out, it seemed that they placed a lot of emphasis on the home page of a given site, with less emphasis on inbound links, or even the other site pages.

However, over the past few months we’ve noticed a subtle shift from homepage only rankings to multiple site pages ranking where the home page once ranked.

In addition, they tend to rank inbound links differently than Google. When you perform a link check on Google and the same check on Yahoo! the Google results almost always tend to be lower. Google says this is because they only show a snapshot of the “relevant” links whereas Yahoo! shows them all regardless of relevance.

And there are other differences as well, but there are too many to go through in this article.

Suffice to say that Google and Yahoo! use roughly the same technology to return similar results. Granted you will see differences in the rankings, but this is due to many things. For example, Yahoo! appears to update less frequently than Google. I’ve worked with sites that have new pages indexed and ranking in Google within days of creation and sometimes it can take months for Yahoo! to do the same.

Essentially what I’m saying is this: If all you are concerned with is rank – then optimizing for Google will get you decent rankings in Yahoo! but it may just take longer for you to show up in Yahoo! search results. That is because, in the end, the technology behind both Yahoo! and Google is very similar.

Tomorrow, however, I will introduce you to two unique engines. One that claims to use Neural Network technology and one that uses Community as the basis for its rankings.