In this blog, I will provide you with some basic information about column stores. Nothing I am writing here is vendor specific IP, but merely taken from the papers published throughout history. One of the best papers that serves as an introduction is by Stonebraker:

The idea is older than that though, with the first papers published in the 1970’ies.

Shamefully standing on the shoulders of giants, I will walk you through a simple example which illustrate one of the key principles of column stores: Run Length Encoding (RLE).

The Row Format

Let us first assume we have an imaginary bar, which sells alcohol to a few clients: Thomas, Christian and Alexei. The transactions for their night out look like this in row based format:

This is our “baseline” input data for the example.

A few things to notice already: We COULD have optimized this table “Kimball Style” by replacing the columns with corresponding integer keys. But that would have made the example harder to follow, so for now, just assume that we have chosen a binary representation that is optimal for the keys and which COULD (if we so wished) join out to smaller lookup tables to reconstruct the string values.

Turning rows directly into columns – a bad idea

There is another way to represent the table in row based format. What if we gave each row a unique number and vertically partitioned the table, starting like this:

And proceeding until we got this:

Obviously, this is a very poor implementation of the data.

Side Note: Once we are at this point we COULD proceed towards a 6th Normal form model where each unique entry in each of the vertically partitioned tables is directly linked (through a m-n relation) to the corresponding rows in the Date column (date would then collapse into a single row). This would essentially be what is known as anchor modeling. We could also proceed towards a Data Vault style model and normalise Customer and Product into separate hub/sat tables and link the data to them.

However, if we did this, reconstructing the original row would require a lot of joins. And we don’t like paying join premiums on large tables.

However, there is a way to proceed from here that is not only compressed, but also very CPU/cost efficient.

Turning columns into Run Lengths

Consider again the product column. There is a more compressed way to store it, namely this:

Notice something interesting here. We have NOT normalised the data. For example, there are still 2 x Vodka entries in there. The representation in Product’ is what we call a Run Length Encoding

What we have done is to store how many repetitions exist in the data for a given sort order of the data (namely the order in which we numbered the rows).

Already now dear reader, you can derive your first easy column store optimisation from the above: Don’t store the from/to row numbers, just store the number of repetitions. If we do that, we could potentially store each run length (for example “1-2, Beer”) as only two integers: one for the key of beer (with the character value “beer” put into a dimension table) and another for the number of repetitions.

If we proceed with the un-optimised example, we can collapse the table into this structure:

There are STILL repetitions in the data, the data is NOT being taken to a new normal form here. However, we only store the repetitions that are required for efficient processing of the data when we query it, as we shall see later.

Because the typical case for large tables is that they DO have a lot of repetitions, column stores can often achieve significant compression benefits. The quoted compression rates from different vendors (as compared with the raw format) is typically in the range of 5-50x – depending on what data you store of course.

Restoring rows from Columns

How can we then restore the columns into rows again? We proceed by a form of “pseudo merge join”

Notice that because of the way data is stored (with duplicates added in the right places), this “join” can proceed in a linear, O ( n ), highly parallelised manner, no matter which column we select. This data structure also has the property of making it easy to sequentially access memory. I have described why this is important here: The effect of CPU Caches and Memory Access Patterns.

Side Note: Loops joins are notoriously CPU inefficient (because they crawl memory in a random manner, paying CPU cycles to wait for memory every time a new level of a B-tree is accessed).

The storage formats furthermore allows quick filtering of data by simply jumping over things that don’t have the right values.

A lot of other interesting things can be done to optimise column stores even more – but that is the secret sauce of individual database vendors and an area for fierce competition these days.

Summary

I hope this blog has helped you better understand what a column store is and how this storage format assists with fast reconstruction of rows through linear scans of the data.

I hope it is also clear why such a storage format lends itself particularly well to very wide tables where only a subset of the columns are requested for each query (fact tables for example). With the above picture of what a run length encoding is, you should be able to reason more clearly about what is happening inside a column store database.

Interesting thoughts, but you left out one major statement: run-length encoding means the following:
1) REQUIRES the sequences be in order, otherwise you have fragmented ID cases to deal with – which causes nested sub-query loop iterations to resolve
2) Encoding data in run-length for storage greatly increases the ETL complexity by orders of magnitude.
3) no matter how hard you try, you STILL have to have joins somewhere, your query left out this particular notion and point blank produces a Cartesian product – all rows x all rows.
4) The join penalty you are avoiding must still be coded, with between clauses and join matches on where clauses. Otherwise you cannot reconstruct what the source data truly looks like.

1) Ordering is indeed implicit in column stores (as I wrote with: “for a given sort order of the data”). This means the data will always be traversed in that order, optionally with a bitmap indexes optimisation ruling out some disk blocks. Could you provide an example of the fragmentation you describe and what you believe is the issue?

2) Not really, no. Every modern column stores does this for you automatically in the database, not in the ETL tool. Some products, like Vertica, even does the compression in the background (keeping the newly added data in a hybrid and doing the merge).

3) I think you use the terms “join” and “Cartesian product” a bit vaguely here. Because data is always traversed in the same order, you do NOT get the cartesian explosion for a column store. In the same way that a merge join on a key/FK does not create a cartesian explosion either. In fact, because the data is always traversed in an ordered manner, you can provide upper time and space bounds (a very attractive options for scalable design) on the price of star joins.

4) I am not sure what your point is here. It is true that the relational store includes code (which is executed) that does the column to row reconstruction. But this happens as part of the query execution engine, it is not something the user does. As it happens, column store execution is also orders of magnitude faster than the equivalent loop join constructs required for traversing a traditional B-trees (like the ones DB2, Oracle and SQL Server uses). Any modern database worth its salt will do the row reconstruction for you, without the user needing to write any code. I am not sure, but are are you saying the user needs to do something or simply pointing out that the relational store must do some CPU work to reconstruct the rows? Also note that filters can be pushed down MUCH more efficiently in a column store than in a row store, as you only have to do the compares ONCE per run length. A similar optimisation is available for some aggregates (like SUM, COUNT, AVERGE). Even on raw scan of the data (without aggregates) and reconstructing all the rows, the price of tuple reconstruction is typically drowned by the I/O advantages of the linear column stores scan.

While I appreciate the compliment on the vision, there is not really any vision in here. That column stores are typically faster than traditional rows stores for all but the most wide row requests is pretty much just day to day reality and pretty well established. In the context you may find the following papers informative: