This is an open forum for discussion of papers
presented at OSDI 2006. Please add your comments
to these postings. We invite comments from anyone
who has read the paper or heard the presentation;
please note that the papers themselves are not available for free online access until 2007.

Abstract

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

How do you deal with server expansion? Will tablets be redistributed immediately?

It depends on the load on the rest of the tabletservers in the system. If many servers are significantly overloaded, tablets will migrate quickly (perhaps 15 seconds or so). If things are not overloaded, migration will happen more slowly (many minutes or tens of minutes).

How do you handle incoming update requests when a global compaction is being conducted?

A major compaction for a tablet grabs all the existing SSTables, plus any existing uncompacted log segments, and uses those to generate a new, single compacted SSTable.Any updates that come in during the compaction are added to the end of the commit log (not covered by the active compaction) and are buffered in memory, as happens with mutations even without an active compaction. Lookups continue to serve from the in-memory state plus the existing sstables until the major compaction is completed.

How do you update bloomfilter after rows are deleted? Do you need to reconstructed bloomfilter periodically?

A Bloom filter is associated with each SSTable, and is generated during the compaction that produced the SSTable (so Bloom filter updating happens as part of the compaction process).

Are GFS cohosted w/ BigTable, or GFS are separated host behind bigtable? Which layer is responsible for tablet replication? It will be lovely to understand their different responsibility.

We don't have a requirement one way or the other, but our typical configuration is that tabletservers run on the same machines as GFS chunkserver processes for the underlying GFS cell. In some cases this allows us to avoid one network transfer for reads and writes (if one of the chunk replicas for an underlying SSTable is stored on the local chunkserver).

We don't allow replication of a tablet in our system, so at any given time, a tablet is being served by a single tabletserver. The master and tabletservers cooperate to make tablet migration decisions to handoff responsibility for a tablet from one tabletserver to another, and also to assign a new tabletserver to serve a tablet when a tabletserver fails.

What happans when the tablet/chunk server crashed? If some tablets are only available on that server, you will have lost those tablets. That does not sound right.

The persistent state of a tablet is stored on GFS. Each file in GFS has multiple replicas of each chunk in the file (typically 3 replicas per chunk), stored on three separate machines. So failure of a single machine (or two machines) doesn't make the data inaccessible. The bigtable system decides on a new tabletserver to take responsibility of serving a tablet, and that tabletserver reads the necessary state (just the tail of the uncompacted log entries) from GFS into its in-memory state for the tablet and starts serving the tablet, accessing the SSTables in the tablet's state from GFS by reading from the other chunk replicas.

Meanwhile, the GFS master initiates re-replication for the chunks stored on the failed chunkserver machine, to bring all the file chunks up to their desired level of replication (e.g. from 2 replias to 3 replicas, typically).

I understand that Bigtable doesn't aim to provide all the functionalities of a database system. However, for applications that do want to enforce data integrity constraints such as referential integrity -- how are these handled in Bigtable?

Further, since the tablets store only a subset of columns for any given row, does enforcing these constraints across columns that span multiple tablets lead to large overheads?

The speaker mentioned that existing parallel databases were insufficient. Yet, he also admitted that the techniques used arose from the database community. So, is latency the reason why you chose to write BigTable and provide a low-level, table-based API to applications (as opposed to SQL) ? While 800TB is large, it is not unheard-of for vendors like Teradata, so I cannot imagine that data-size is the reason why BigTable was built. Why did databases designed to scale, not scale for your purposes ?

On a related note, BigTable provides versioning via the timestamp attribute. (Something that databases don't do well). Could you please comment on how and how heavily this dimension is used ? Also, please elaborate on the complications that arise as a result of overuse of the timestamp dimension.

Do you have one write buffer for each tablet? If so, how do you handle 100's of tablets per server?

Yes, we have one write buffer per tablet. If a tabletserver is currently reponsible for 100s of tablets, it'll have 100s of write buffer data structures.

How do you deal with tablet split when some data are still in write buffer?

The write buffer is implemented as a mutable sorted map from key->value (think binary tree). When we split a tablet, the write buffer is logically split into two separate write buffers for the two split ranges.

I understand that Bigtable doesn't aim to provide all the functionalities of a database system. However, for applications that do want to enforce data integrity constraints such as referential integrity -- how are these handled in Bigtable?

For cases where the client has designed their schema in such a way that both the source and the destination of the data are in different columns in the same row, these checks can be done using a single-row transaction. Other solutions include background processing of data in the table to identify and fixup cases that violate the application's desired constraints.

Further, since the tablets store only a subset of columns for any given row, does enforcing these constraints across columns that span multiple tablets lead to large overheads?

A given tablet contains all columns for a particular row (i.e. tablets are always at row boundaries). So enforcing constraints across columns in the same row isn't usually that expensive.

I am just trying to imagine a scenario where there is a need of 3 most recent versions of a URL ( or a data item) and need for ms granularity for time stamps.

This is more of a client application question, rather than a BigTable question, but, for our crawling application example, it's often useful to be able to look at multiple versions of page over time to see if the page has changed at all, whether it has changed in relatively minor ways (e.g. a last modified date at the bottom of the page), or if the content on the page has changed substantially. Allowing storage of multiple versions in our data model enables clients to do analyses like this relatively easily.

Our timestamps are actually at microsecond granularity, not millisecond granularity. Some applications don't need that much resolution and are happy with just storing timestamps that are accurate to 1 second or something, but other applications have found it useful to have very fine-grained timestamps.

The speaker mentioned that existing parallel databases were insufficient. Yet, he also admitted that the techniques used arose from the database community. So, is latency the reason why you chose to write BigTable and provide a low-level, table-based API to applications (as opposed to SQL) ? While 800TB is large, it is not unheard-of for vendors like Teradata, so I cannot imagine that data-size is the reason why BigTable was built. Why did databases designed to scale, not scale for your purposes ?

I suspect that there are some parallel database solutions that could scale from an absolute performance/size perspective, but if you talk to them about pricing for storing 800 TB of data with the ability to handle millions of operations per second, I suspect that the performance/$ will not be as good as what we can achieve with BigTable. Furthermore, it might not run on our commodity hardware, which might make the system more difficult to manage from an operational standpoint (since we like to have a mostly homogeneous hardware infrastructure). Finally, if you then had a different 800 TB dataset to store, I suspect you'd end up with a significantly higher marginal cost for the second system with a commercial database system than with our more home-grown system.

On a related note, BigTable provides versioning via the timestamp attribute. (Something that databases don't do well). Could you please comment on how and how heavily this dimension is used ? Also, please elaborate on the complications that arise as a result of overuse of the timestamp dimension.

(I haven't run detailed stats, so these are just rough characterizations) Perhaps half of our applications don't use the tiemstamp dimension at all. The rest use it in a variety of ways. Perhaps anopther quarter don't want/need to keep multiple versions of the same row/columns, but do want to have a timestamp value on the data that they do store so that they can tell when it was generated/updated. The remaining 25% of applications keep some history information for some or all of their columns, with policies ranging from "keep all versions" to "keep a couple of months of history" to "keep the last K versions".

So, it's not useful for all applications, but we do have multiple client applications that have used timestamps to good effect.

When tables in Bigtable are used as inputs/outputs to a Map/Reduce job, does the map/reduce job use BigTable APIs to read/write them? Or does it work on the GFS files below the SSTables?

MapReductions that read from BigTable use the BigTable API: they don't read the SSTables directly (the set of SSTables is constantly changing as tabletservers do compactions, and the particular way that data is encoded in SSTables is internal to the bigtable code).

Are columns and locality groups defined during table creation? Just wonder whether we see new columns and/or locality groups are added several weeks after table creation.

Yes, it's fine to add a column family or locality group anytime after a table has been created, and it's also fine to change the column family to locality group mappng at any time: the system transparently handles this as part of its normal operation.

Would you share the API for creating tables, local groups etc?

The paper should give sufficient details about these interfaces: there's nothing unexpected there.

Wilson Hsieh described Bigtable, a storage system designed for in-house use at Google. Bigtable developed in response to a need to store information on over 10 billion URLs, with wide variety in the size of objects associated with a given URL, and variety in usage patterns. Commercial databases are obviously unsuitable due the scale (petabytes of data on billions of objects, with thousands of clients and servers). Bigtable stores data in a three-dimensional, sparse sorted map. Data values are located by their row, column, and timestamp.

Since these tables can be quite large, Bigtable allows dynamic partitions of row, called tablets, which are distributed over many Bigtable servers. Clients can manage locality by choosing row keys such a way that data that should be grouped together achieves spatial locality. A given tablet is owned by one server, but load balancing can shift tablets around. Similarly to tablets, locality groups are partitions of column instead of row. These locality groups, however, segregate data within a tablet. All data is stored as files in the Google File System (GFS), with different locality groups stored as different files in GFS. One master Bigtable server controls the many tablet servers. Clients access a tablet by requesting a handle from the Chubby lock service (described in another OSDI talk), then doing read/write directly with the tablet server that owns a given tablet. Client only talks to master when wants to manipulate metadata (create table, manipulate ACLs, et cetera). Currently, Bigtable is deployed on over 24 thousand machines in approximately 400 clusters, and is used by over 60 projects at Google.

The first question concerned the poor random read performance noted in the paper, and noted that they attributed this to shared network links. The questioner added that it seems an easy optimization to move tablet servers closer to the GFS storage that the tablets actually reside on. Wilson answered that, by default, if a GFS server is collocated with the tablet server, the tablet's data will be storage on that GFS server. Atul Adya from Microsoft Research asked how they deal with hierarchical data---does this generate thousands of columns in their table paradigm? Wilson noted that Bigtable is not a database, and such hierarchical data situations are noted suited for Bigtable. He noted that their clients need to conform to how Bigtable works, not the other way around. George Candea from EPFL asked if they had any technical insights to offer based on their experiences. Wilson said that, if you have the freedom to do so, build something with a custom API that matches the needs of both clients and users.