Thursday, December 09, 2010

Papers on specialized databases at Google

Googlers have published two papers recently at academic conferences detailing new specialized databases that are heavily used within Google.

The first paper is a USENIX 2010 paper describing Percolator. Percolator is the database powering Caffeine, which is Google's new system to provide fresher search results by adding new documents and updates to documents to their search index in near real-time.

The Percolator paper is titled "Large-scale Incremental Processing Using Distributed Transactions and Notifications" (PDF). An excerpt:

We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.

Percolator is built on top of the Bigtable distributed storage system .... Percolator was built specifically for incremental processing and is not intended to supplant existing solutions for most data processing tasks. Computations where the result can't be broken down into small updates (sorting a file, for example) are better handled by MapReduce. Also, the computation should have strong consistency requirements; otherwise, Bigtable is sufficient. Finally, the computation should be very large in some dimension (total data size, CPU required for transformation, etc.); smaller computations not suited to MapReduce or Bigtable can be handled by traditional DBMSs.

Percolator is a specialized database that adds new consistency guarantees (as well as triggers, which they call "observers") to Bigtable. One thing that is interesting is how specialized this system is. Percolator, for example, is in no way intended for online operations and can, in some cases, delay a transaction for tens of seconds due to stray locks. But, that is fine for the near real-time search index update task for which it is designed.

The second paper is a VLDB 2010 paper on Dremel. Dremel is a column store database designed to be orders of magnitude faster for some interactive database queries than MapReduce.

This paper is titled "Dremel: Interactive Analysis of Web-Scale Datasets" (PDF). An excerpt:

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing.

Dremel can execute many queries over such data that would ordinarily require a sequence of MapReduce (MR) jobs, but at a fraction of the execution time. Dremel is not intended as a replacement for MR and is often used in conjunction with it to analyze outputs of MR pipelines or rapidly prototype larger computations .... Dremel provides a high-level, SQL-like language to express ad hoc queries. In contrast to layers such as Pig and Hive, it executes queries natively without translating them into MR jobs.

The paper includes some fun motivational examples describing how people use Dremel for rapid prototyping of new ideas. There is a huge advantage in spending just seconds rather than hours to examine the potential of a new feature for a classifier or a new signal for relevance rank. Dremel lets Googlers twiddle the feature multiple times to optimize it in just a few minutes, then run a big, multi-hour MapReduce job to get the final data, a huge advantage over rivals that might take days to do the same investigation.

Dremel, like all column stores, it works best when selecting just a few columns from the data, but that is a very common case well worth optimizing for. One fun aside briefly mentioned in the paper is that they see another order of magnitude or two speedup if they can stop after only looking at 98% of the data or so because waiting for straggler chunks causes big slow downs. So, if you are willing to have (unusually slightly) inaccurate results, you can get huge boosts in speed from stopping queries early. That is also not new, but again a very common case and worth thinking about.

In both of these papers, what I find so remarkable is how willing Google is to build specialized databases to speed up tasks. Percolator can only be used for tasks have huge data, strong consistency requirements, and can tolerate occasional latency of tens of seconds, but that is perfect for near real-time search index updates. Dremel can only be used for selecting a few columns of data, but it is extremely common that a MapReduce job wants to ignore almost all the data in each record. It reminds me of the specialized search index structures and kernel twiddles Google does, which are other interesting examples of the lengths Google is willing to go to maximize the usefulness of their internal tools and the performance of their systems.