Tuesday, May 16, 2006

C-Store and Google BigTable

I came across an interesting VLDB 2005 paper, "C-Store: A Column-oriented DBMS" (PDF).

What attracted me to this paper, other than that Mike Stonebraker is lead author, was that the goals seem to have a lot in common with what appeared to motivate Google's BigTable.

C-Store is column-oriented (values for a column are stored contiguously) instead of row-oriented like most databases. It is optimized for reads. It is designed for sparse table structures and compresses data. It is designed for high availability on a large cluster. It has relaxed consistency on reads to minimize lock contention. It is extremely fast, two orders of magnitude faster than normal row-oriented databases on reads in their preliminary tests.

Google's BigTable is also column-oriented (storing compressed <row, column, timestamp> triples in the SSTable structures). It optimized for reads. It is designed for sparse table structures and compresses data. It has relaxed consistency. It is extremely fast.

There are some big differences. BigTable is not designed to support arbitrary SQL; it is a very large, distributed map. BigTable emphasizes massive data and high availability on very large clusters more than C-Store. BigTable is designed to support historical queries (e.g. get data as it looked at time X). BigTable does not require explicit table definitions and strings are the only data type.

These unusual databases implementations are fascinating. I am not familiar with any other very large scale, high availability, distributed map like BigTable, nor have I heard of any RDMS with the same very large scale, high availability, read-optimized goals of C-Store.

Update: Speaking of Michael Stonebraker, his new startup, Vertica, just raised $16.5M to build a new database that "provides extremely fast ad hoc SQL query performance, even for very large databases." The underlying technology apparently is based on C-Store.

Hi, Curt. I was referring to what appears to be relaxed read consistency described in section 6 and 6.1. From the paper:

In C-Store, we isolate read-only transactions using snapshot isolation. Snapshot isolation works by allowing read-only transactions to access the database as of some time in the recent past, before which we can guarantee that there are no uncommitted transactions ... We call the most recent time in the past at which snapshot isolation can run the high water mark (HWM) ... There is also a low water mark (LWM) which is the earliest effective time at which a read-only transaction can run.

I forget the acronym and don't feel like chasing down the reference, but I recall there being a paper that compared the traditional relational storage, column oriented and a hybrid that used pages as in relational storage, but was column oriented inside each page.

The hybrid has most of the cache benefits of column storage, but doesn't suffer the penalty of row reconstruction.

Column-oriented DB is a rather old idea, Greg. Sybase is doing it for something like 10 years. KDB is in the same playfield too (although they do not support SQL). I would be interested to have comparison between column-oriented DBs because to compare column-oriented DBs with row-oriented doesn't make much of a sense (it's the type of the workload which will decide for the one or the other)

The bigtable papers do not remove all doubt, but it appears that the natural order of the data are {row,colname,timestamp}. This is NOT a column oriented store in the same sense as C-Store, which stores data in column stripes, "C-Store physically stores a collection of columns, each sorted on some attribute(s)." (from http://db.csail.mit.edu/projects/cstore/vldb.pdf). bigtable is essentially a distributed btree using a variable length byte[] key whose components are choosen as {row, colname, timestamp}.

bigtable does have the notion of column families and latency control by column family, which suggests that the total key order might be {colfam,row,colname,timestamp}, which would naturally group all data in the same column family together in the same region of the index and would therefore support better latency control by simply wiring the index segments in to memory for the index partition(s) backing the key range for the column family. One of the stated goals for column families was to move heavy weight data (e.g., web page contents) out of the way for better performance on scans and such that do not need to resolve the large objects.

Another uncertainty that I have about bigtable column families is whether bigtable provides ACID update for a "row" across all column families -- I suspect not.

Right, it is similar but not identical to column-oriented databases. BigTable organizes and compresses data using "locality groups", which are related to their "column families". From the paper:

Bigtable locality groups realize similar compression and disk read performance benefits observed for other systems that organize data on disk using column-based rather than row-based storage, including C-Store and commercial products such as Sybase IQ, SenSage, KDB+, and the ColumnBM storage layer in MonetDB/X100.

I think BigTable does support an atomic update on data in different column families in a row. From the paper:

Bigtable supports single-row transactions, which can be used to perform atomic read-modify-write sequences on data stored under a single row key. Bigtable does not currently support general transactions across row keys, although it provides an interface for batching writes across row keys at the clients.

However, I am not sure this means Bigtable is ACID compliant for a row update. If I understand it correctly, Bigtable has relaxed consistency that violates the ACID rules in some situations.

Google Bigtable is ACID complient but it does not provide full HA, even though it was 5 replicas of Chubby, that is its SPOF.Google should do some publishing on Bigtable so people like us could stop speculating of what it can or can`t do.