Google Caffeine jolts worldwide search machine

Web colossus claims 50% more freshness

Google has completed the roll-out of its next-generation search infrastructure, the indexing system overhaul known as "Caffeine." According to the company, the new setup provides "50 per cent fresher" results than its previous system.

Mountain View rolled out a public test of the Caffeine index in a single data center last August, and a wider deployment has been expected since the company completed this test in early November. Google tells us it is now backing all user searches worldwide, and presumably it has been rolled across its worldwide network of data centers.

"Caffeine is the web-indexing technology that powers all of Google search," the company tells The Reg. "It is available in all countries and languages where we offer search."

This spring, Google Research head Peter Norvig said the company was updating its index every ten seconds. But as the company said today in a blog post announcing the completion of Caffeine, its index was previously separated into layers, with some layers updated faster than others. The main layer wouldn't be updated for a good two weeks. Caffeine takes a more holistic approach, continuously updating the entire index.

"To refresh a layer of the old index, we would analyze the entire web, which meant there was a significant delay between when we found a page and made it available to you," the blog post says. "With Caffeine, we analyze the web in small portions and update our search index on a continuous basis, globally. As we find new pages, or new information on existing pages, we can add these straight to the index. That means you can find fresher information than ever before — no matter when or where it was published."

In essence, Google has moved from a batched indexing system to a system that updates on the fly. "Our technology enables us to add pages to the index as soon as we crawl them," the company tells us. "In the past, we would index pages in large batches (often billions of documents) because we would analyze the entire web each time we updated the index. With Caffeine we can analyze the web in small portions, so we can update the index continuously.

"Another way to think about this is that we’ve gone from indexing batches of billions of documents to processing billions of 'batches' (each with one document)."

According to Google, Caffeine processes hundreds of thousands of pages each second across its famously distributed infrastructure. "If this were a pile of paper," the blog post reads, "it would grow three miles taller every second." At the moment, the index takes up nearly 100 million gigabytes of storage in one (distributed) database, and new information is added at a rate of hundreds of thousands of gigabytes each day.

For some reason, Google prefers to think of all this data in terms of Jobsian music players. "You would need 625,000 of the largest iPods to store that much information. If these were stacked end-to-end they would go for more than 40 miles," it says.

No doubt, the company will soon be chastised by an army of fanbois for building its pointless data storage analogies around outdated hardware. How many iPads would you need to run Caffeine? 1,562,500, and they would stretch for 235.75 miles. That's almost 240 miles of no Adobe Flash.

With Caffeine, Google has rolled out more than just a new indexing system. It has also debuted a revamped software architecture that will likely underpin all of its online applications for years to come. Last year, über-Googler Matt Cutts confirmed with The Reg that Caffeine is built atop a complete overhaul of the company's custom-built Google File System. At least informally, Google refers to this file system redux as GFS2, and it was two years in the making.

"There are a lot of technologies that are under the hood within Caffeine, and one of the things that Caffeine relies on is next-generation storage," Cutts said. "Caffeine certainly does make use of the so-called GFS2."

Asked whether Caffeine also includes updates to MapReduce, Google's distributed number-crunching platform, or BigTable, its distributed real-time database, Cutts declined to comment. But he played down the possibility of major updates to these pieces of its back-end infrastructure. He did say, however, that with Caffeine, Google is testing multiple platforms that could be applied across its entire infrastructure.

"I wouldn't get caught up on next-generation MapReduce and next-generation BigTable. Just because we have next-generation GFS does not automatically imply that we've got other next-generation implementations of platforms we've publicly talked about," he said. "But certainly, we are testing a lot of pieces that we would expect to — or hope to — migrate to eventually.

This includes brand new back-end technologies Google hasn't publicly discussed. "There are certainly new tools in the mix," he said.

But at the moment, Google has merely updated its search indexing system — i.e., the system that builds a database of all known websites, complete with all the metadata needed to describe them. With Caffeine, Google has rewritten this system from the ground up.

"Caffeine is a fundamental re-architecting of how our indexing system works," Cutts said. "It's larger than a revamp. It's more along the lines of a rewrite. And it's really great. It gives us a lot more flexibility, a lot more power. The ability to index more documents. Indexing speeds — that is, how quickly you can put a document through our indexing system and make it searchable — is much much better."

But Google is concerned with more than just short-term speed. Caffeine is meant to accommodate the web as it continues to expand. "We've built Caffeine with the future in mind," reads today's blog post. "Not only is it fresher, it's a robust foundation that makes it possible for us to build an even faster and comprehensive search engine that scales with the growth of information online." Google promises more infrastructure updates "in the months to come". ®