Column family databases are probably most known because of Google’s BigTable implementation. The are very similar on the surface to relational database, but they are actually quite different beast. Some of the difference is storing data by rows (relational) vs. storing data by columns (column family databases). But a lot of the difference is conceptual in nature. You can’t apply the same sort of solutions that you used in a relational form to a column database.

That is because column databases are not relational, for that matter, they don’t even have what a RDBMS advocate would recognize as tables.

Nitpicker corner: this post is about the concept, I am going to ignore actual implementation details where they don’t illustrate the actual concepts.

Note: If you want more information, I highly recommend this post, explaining about data modeling in a column database.

The following concepts are critical to understand how column databases work:

Column family

Super columns

Column

Columns and super columns in a column database are spare, meaning that they take exactly 0 bytes if they don’t have a value in them. Column families are the nearest thing that we have for a table, since they are about the only thing that you need to define upfront. Unlike a table, however, the only thing that you define in a column family is the name and the key sort options (there is no schema).

Personally, I think that column family databases are probably the best proof of leaky abstractions. Just about everything in CFDB (as I’ll call them from now on) is based around the idea of exposing the actual physical model to the users so they can make efficient use of that.

Column families – A column family is how the data is stored on the disk. All the data in a single column family will sit in the same file (actually, set of files, but that is close enough). A column family can contain super columns or columns.

A super column is a dictionary, it is a column that contains other columns (but not other super columns).

A column is a tuple of name, value and timestamp (I’ll ignore the timestamp and treat it as a key/value pair from now on).

It is important to understand that when schema design in a CFDB is of outmost importance, if you don’t build your schema right, you literally can’t get the data out. CFDB usually offer one of two forms of queries, by key or by key range. This make sense, since a CFDB is meant to be distributed, and the key determine where the actual physical data would be located. This is because the data is stored based on the sort order of the column family, and you have no real way of changing the sorting (except choosing between ascending or descending).

The sort order, unlike in a relational database, isn’t affected by the columns values, but by the column names.

Let assume that in the Users column family, in the row “@ayende”, we have the column “name” set to “Ayende Rahine” and the column “location” set to “Israel”. The CFDB will physically sort them like this in the Users column family file:

@ayende/location = “Israel”
@ayende/name = “Ayende Rahien”

This is because the sort “location” is lower than “name”. If we had a super column involved, for example, in the Friends column family, and the user “@ayende” had two friends, they would be physically stored like this in the Friends column family file:

@ayende/friends/arava= 945
@ayende/friends/rose = 14

Remember that, this property is quite important to understanding how things work in a CFDB. Let us imagine the twitter model, as our example. We need to store: users and tweets. We define three column families:

Users – sorted by UTF8

Tweets – sorted by Sequential Guid

UsersTweets – super column family, sorted by Sequential Guid

Let us create the user (a note about the notation: I am using named parameters to denote column’s name & value here. The key parameter is the row key, and the column family is Users):

In this case, the key doesn’t matter, but it does matter that it is sequential, because that will allow us to sort of it later.

Both rows have different data columns on them.

We don’t actually have any way to associate a user to a tweet.

That last bears some talking about. In a relational database, we would define a column called UserId, and that would give us the ability to link back to the user. Moreover, a relational will allow us to query the tweets by the user id, letting us get the user’s tweets. A CFDB doesn’t give us this option, there is no way to query by column value. For that matter, there is no way to query by column (which is a familiar trick if you are using something like Lucene).

Instead, the only thing that a CFDB gives us is a query by key. In order to answer that question, we need the UsersTweets column family:

And now we need more explanation about the notation. Here we insert into the UsersTweets column family, to the row with the key: “@ayende”, to the super column timeline two columns, the name of each column is a sequential guid, which means that we can sort by it. What this actually does is create a single row with a single super column, holding two columns, where each column name is a guid, and the value of each column is the key of a row in the Tweets table.

Question: Couldn’t we create a super column in the Users’ column family to store the relationship? Well, yes, we could, but a column family can contain either columns or super columns, it cannot contain both.

In essence, we execute two queries, one on the UsersTweets column family, requesting the columns & values in the “timeline” super column in the row keyed “@ayende”, then execute another query against the Tweets column family to get the actual tweets.

Because the data is sorted by the column name, and because we choose to sort in descending order, we get the last 25 tweets for this user.

What would happen if I wanted to show the last 25 tweets overall (for the public timeline)? Well, that is actually very easy, all I need to do is to query the Tweets column family for tweets, ordering them by descending key order.

Nitpicker corner: No, there is not such API for a CFDB for .NET that I know of, I made it up so it would be easier to discuss the topic.

Why is a CFDB so limiting?

You might have noticed how many times I noted differences between RDBMS and a CFDB. I think that it is the CFDB that is the hardest to understand, since it is so close, on the surface to the relational model. But it seems to suffer from so many limitations. No joins, no real querying capability (except by primary key), nothing like the richness that we get from a relational database. Hell, Sqlite or Access gives me more than that. Why is it so limited?

The answer is quite simple. A CFDB is designed to run on a large number of machines, and store huge amount of information. You literally cannot store that amount of data in a relational database, and even multi-machine relational databases, such as Oracle RAC will fall over and die very rapidly on the size of data and queries that a typical CFDB is handling easily.

Do you remember that I noted that CFDB is really all about removing abstractions? CFDB is what happens when you take a database, strip everything away that make it hard to run in on a cluster and see what happens.

The reason that CFDB don’t provide joins is that joins require you to be able to scan the entire data set. That requires either someplace that has a view of the whole database (resulting in a bottleneck and a single point of failure) or actually executing a query over all machines in the cluster. Since that number can be pretty high, we want to avoid that.

CFDB don’t provide a way to query by column or value because that would necessitate either an index of the entire data set (or just in a single column family) which in again, not practical, or running the query on all machines, which is not possible. By limiting queries to just by key, CFDB ensure that they know exactly what node a query can run on. It means that each query is running on a small set of data, making them much cheaper.

It requires a drastically different mode of thinking, and while I don’t have practical experience with CFDB, I would imagine that migrations using them are… unpleasant affairs, but they are one of the ways to get really high scalability out of your data storage.

Waiting expectantly to the commenters who would say that relational databases are the BOMB and that I have no idea what I am talking about and that I should read Codd and that no one really need to use this sort of stuff except maybe Google and even then only because Google has no idea how RDBMS work (except maybe the team that worked on AdWords).

something that is still an enigma to me is how the data is "synchronized" across machines so the results are "consistent". I quote the terms because part of nosql is letting go of 100% synchronization and consistency.

take a service like google or social networking. If I search "ayende" I expect to find this website in the top 3 results. If I log into facebook or LinkedIn I expect that searching for someone would return the same results no matter what machine I was using or where I was physically located.

if the information is sharded across machines how is this information retrieved, correlated and presented in mere seconds with high accuracy? It can't query all the machines and the data cannot be duplicated across all machines.

Are results not consistent? if so why does the information appear consistent to me?

is all the data duplicated within a geographic location where by users in the USA hit cluster 1 while users in Europe would hit cluster 2? cluster 1 and 2 would eventually update each other but a user in user in USA would not query cluster 2.

the concept of how data is stored makes sense. The code/query to access the data makes sense. The missing piece is how the software and hardware interact if we are talking about multiple application servers communicating with multiple database servers.

The most exposure I have to physically distributed machines is reviewing Rhino.DHT configuration. Would these massive applications/services use a similar technique where by the the application is configured to keep in contact with multiple machines to resemble some form of consistency?

The whole world is not relational... There are plenty of cases where a non relational model would fit just fine. I am not quite sure why people are so obsessed over fitting that square peg into a round hole. Human nature I guess.

As in previous articles you seem to be confusing a DBMS's storage engine with it's surfaced data model.

Relational databases don't don't deal with rows, they deal with RELATIONS. How that is stored on disk is up to the implementer.

Column oriented data stores have been around since the 70's many of them are relational.

A relational database can store data in rows or columns or whatever the implementers desire, although most modern RDBMS use row based storage.

The first commercial column oriented data store was SybaseIQ, which also happens to be an ANSI compliant SQL server. You can do selects,joins,inserts,updates. It is relational and just so happens to use a column oriented store. BigTables research paper references SybaseIQ and C-Store as previous column oriented dbms.

This is directly from Google: "C-Store and Bigtable share many characteristics: both systems use a shared-nothing architecture and have two different data structures, one for recent writes, and one

for storing long-lived data, with a mechanism for moving

data from one form to the other. The systems differ

signicantly in their API: C-Store behaves like a

relational database, whereas Bigtable provides a lower

level read and write interface and is designed to support

many thousands of such operations per second per server.

C-Store is also a read-optimized relational DBMS,

whereas Bigtable provides good performance on both

read-intensive and write-intensive applications."

So how is it that column databases are not relational, when Google themselves say they can be?

In this article you are not describing column database concepts, you are simply describing Bigtables specific data model, which is a multi dimensional map that is implemented on a column based storage engine.

Still waiting explanation on how to turn MySQL's "non-relational mode" on, that supposedly Google is using for ad-words, since a relational db can't possible scale up that well

I haven't been able to find much information about C-Store, but it seems to be a research project focusing on performance. That indicate to me that it doesn't consider things like what happen when some machine fails.

There is a reason that BigTable and other CFDB went with their reduce feature model, because that allows them to avoid hitting CAP head on.

The difference between BigTable and C-store is one is relational and one is not, but they are both column oriented, does that article dispute something I described, because it seems to affirm it?

The guys who developed C-store went on to make Vertica, a commercial column oriented RDBMS that is actively sold today.

CAP is a red herring, it has nothing to do with the relational model or relational scaling. CAP defines limits on ANY distributed computer system.

A relational DBMS can give up any aspect of CAP to not be limited by it, just like a NoSQL db might, this does not break the relational model. Conversely a NoSQL db can adhere to all three tenets of CAP and be limited by it.

Again CAP != Relational those are separate concerns.

Ok so you made up a new new term "Column Family Databases" and then proceed to define what that term means. Google doesn't call Bigtable a column family database, but if you want to go ahead.

Heres is Google's definition of their data model:

"Data Model

A Bigtable is a sparse, distributed, persistent multidimensional

sorted map. The map is indexed by a row

key, column key, and a timestamp; each value in the map

is an uninterpreted array of bytes."

Column Families are one of Bigtables dimensions, so are Rows and Times Stamps, yet you are not calling it a Timestamp Db are you?

Sorry to nitpick, as a software engineer I tend to pay attention to small details like what the relational model is and what it is not.