Analysis

Last week, Google announced the launch of its new search index, but why would the world’s leading search engine even need such a thing? Doesn’t Google already control a gigantic proportion of the search business, as well as its Siamese twin, the phenomenally lucrative search-related keyword advertising business? The biggest clue is in the name of the new system: Caffeine. Google needed a boost to help it respond faster to a world that’s become increasingly real-time. Not just because it wants to — because it has to.

The Problem: The Real-Time Web

Google first announced it was working on an update to its index last August. The timing — coincidentally or not — coincided with news of a search partnership between Microsoft and Yahoo, in which the software giant’s Bing search engine would power the results at all of Yahoo’s properties. But that wasn’t the big threat to Google’s business that prompted the shift to Caffeine. The biggest push came from the simple fact that the web is speeding up all around us — thanks largely to the skyrocketing popularity of social media sites like Twitter and Facebook, as well as other real-time web publishing tools (such as PubSubHubbub).

Google’s previous indexing system accumulated large batches of updates for websites and pages “crawled” (by the engine’s automated search bots) every few weeks to detect changes. But one result of this process was that any pages in the update pool couldn’t be accessed by searchers until the entire batch was finished processing. That meant large quantities of results were up to several weeks old, even though there were newer results in the update.

While search results a few days old might have been fine even a year or two ago, the web has become far more real-time than ever before — thanks to the volumes of status updates, photos and other information coming from social networks such as Facebook and Twitter. Facebook has more than 500 million users, many of whom are posting updates, links and photos multiple times a day, and Twitter’s COO Dick Costolo recently estimated that his social network sees more than 65 million messages posted every day.

Google responded to that pressure in part by licensing the full “firehose” feed of updates from Twitter and adding those and updates from Facebook to its search results under a separate tab in its recent redesign. It’s more than just those social networks, however — more media outlets are publishing pages via blogging platforms that deliver results dozens of times throughout the day, rather than just once, and sites that aggregate news and other content are doing the same.

That kind of deluge of information places increasing pressure on a search engine like Google to become more real-time in its results. There are other search players who are trying to solve that problem as well, including OneRiot and Socialmention, and while none have approached the comprehensiveness of Google so far, the search giant likely doesn’t want to lose any more ground to such upstarts than it has to. Microsoft’s Bing is also an ongoing threat; the Redmond Giant took some time to get its technology in order, but it has been improving (and gaining market share) steadily.

Google’s Solution: Get Caffeinated

With Caffeine, Google decided to make more frequent, but also smaller, updates to the index — meaning that in aggregate there would be more fresh results. In fact, the company says the new Caffeine results are 50 percent fresher than the previous system.

The Caffeine update effectively makes most of the search engine’s results almost real-time. “When you search Google, you’re not searching the live web. Instead, you’re searching Google’s index of the web which, like the index in the back of a book, helps you pinpoint exactly the information you need,” Google software engineer Carrie Grimes explained, adding that “expectations for search are higher than they used to be. Searchers want to find the latest relevant content and publishers expect to be found the instant they publish.”

Grimes said the Caffeine system analyzes hundreds of thousands of web pages each second in parallel and adds new information to the index at a rate of hundreds of thousands of gigabytes per day. Nearly 100 million gigabytes of data are stored in one Caffeine database. Google not only has to consume that large a quantity of information as quickly as possible, it also has to filter it, find connections between pages and content and rank those using its PageRank algorithm, so it can place them in order of importance.

The latter part of this equation is where Microsoft is hoping to chip away at Google’s hold on the market. Hoping to gain the upper hand with search users, Bing’s approach to the explosion of web content thanks to real-time tools has been to focus on context for search results rather than simply their freshness; a recently launched ad campaign pushes the “Bing and decide” tagline and some observers have noted that the site’s UI puts a stronger focus on e-commerce activities.

Why It Matters

Grappling with real-time isn’t just important for Google because its motto is to “index all the world’s information,” and that information is coming faster all the time, but also because the advertising Google counts on is dependent to a large extent on that same information. The challenge for Google is that social networks and the information shared through them are becoming more and more of a competitive threat to the company, as Om Malik noted in a report last year. If people find the information they need through their friends and connections on social networks, why do they need Google and its ads?

The key point in both of these things is relevance. Search results work because they are relevant to what a user is looking for, and the keyword-related ads that Google shows alongside its search results are similarly the most effective when they are relevant. Google’s hypothesis with Caffeine is that relevance is also a function of time — in other words, the more timely the results, the more targeted they will be to what a user is interested in and/or is searching for.

The bottom line is that the increasing demand for real-time information, whether from social networks or just the broader web, puts pressure not just on Google, but on anyone whose business involves information and dealing with users or customers who have access to that information.

The issue for many companies and industries — whether they are marketing-related or manufacturing or distribution-based — is that their business processes and information systems are fundamentally not designed to function in real-time. In most cases, information flows slowly through a chain of command that is several layers deep, and then decisions take days or even weeks to make, and the outcome of those decisions is similarly delayed as it trickles down to the front lines.

In other words, many companies operate the same way that Google’s index did before it was re-engineered. Those firms and industries will need to ingest their own jolts of caffeine before they can take full advantage of the real-time nature of the web.