Vertica and Projections

I read a fascinating paper here by folks involved with the original C-Store project that was the genesis of Vertica (thanks to this blog for providing the link to this paper). I hadn’t realized that Vertica essentially sprang right out of academia, and it’s really interesting to see how the original research concepts and principles translated to production code. Most interesting is that many of the principles held up!

One aspect of Vertica that I read mixed reviews about was the notion of a Projection. ParAccel says that projections are a crutch that Vertica uses to improve performance, Vertica says that projections are a misunderstood core feature. This confusion about projections is so rampant that Vertica wrote a three-part blog post to explain what projections are and why they’re good.

The summary on projections is as follows. Think about a regular relational table. In most relational databases, a plain table will be stored as a heap, which is an unordered pile of rows. Fast to insert into (just throw the inserted row on the pile), generally slow to query from (look at every row to find what you need). To access the data quicker in this standard scenario, you could:

1) Store the data in some sort of ordered manner instead of in a heap. The ordering will give most queries a shot at not having to scan every row.

and/or

2) Build indexes on top of the data that order the data in various ways that are useful to common queries.

#1 above is what is called a Clustered Index in SQL Server or an Index Organized Table in Oracle, and it is a good technique for improving the performance of most queries. There’s a penalty during data insertion because the trees need to be split and rebalanced, but the payoff usually outweighs the cost.

#2 above is the standard approach of ordering the data and storing pointers into the heap for data retrieval. There are variations of this, such as including additional columns in the index, that allow index access to fully satisfy the query without needing to follow the pointers to the actual data rows.

With this background, we’re close to being able to talk about Vertica’s projections. The final piece is to talk about Materialized Views. MV’s are queries (views) whose results are persisted in the database for future access. They are very powerful in relational databases because they can act as aggregate tables, denormalizations, and other precalculations without the original query directly accessing them – the optimizer simply uses them if they contain the data has been requested.

So we’ve got three notions: data stored in a heap with indexes for fast access, a Clustered Index or IOT where the data is stored in an optimized way so that indexes may not be needed, and materialized views where data is stored in a heap/CI/IOT and then stored again in a form that supports querying. How do Vertica’s projections fit in here?

Projections are the actual data storage mechanisms in Vertica. They’re not indexes on top of data, they are the data. They’re like clustered indexes in that the data is stored in sorted order to enable excellent compression and fast retrieval. They’re like materialized views in that the collection of columns they contain is arbitrarily defined by the admin. They’re like covering indexes in that excellent projections contain all the columns a query needs.

At one extreme, each column of data is stored just once. At the other extreme, various columns are stored multiple times due to their membership in different projections. Let’s look at an example. Below is a transaction table consisting of three projections:

Together, all the columns in the table are represented in at least one projection. The projections themselves are useful subsets of the overall set of columns in the table, such that the query optimizer could choose which projection contains most or all of the columns that need to be processed by the query. Each projection stores the actual data of the columns it contains, so all projections take up space.

Vertica diverges from theory in the sense that it defines a fourth (or zeroeth?) projection, the superprojection. The superprojection contains every column in the table and ensures that there’s a baseline storage and retrieval mechanism for Vertica to use without the data modeler having to explicitly define it. All the other projections are in addition to the super projection.

And now, finally, we can circle back to the mixed reviews on Vertica’s projections.

“The idea that Vertica needs a special projection for every query in order to get good performance is just wrong.” (from the link above)

This statement is in response to ParAccel and others telling prospects that Vertica is compensating for slow queries by building special ‘materialized views’ that pre-store only the data that is needed for queries. Instead, the argument is that a few well-chosen projections will result in very good performance for the majority of queries, and the standard superprojection will handle everything else. This is no different than indexes on SQL Server handling the majority of queries and the clustered index handling the rest.

There is an argument to be made that Vertica’s redundant storage of column data in the projections makes them like materialized views, but the counter to this statement is that the data compression is so significant that you can afford to carry around a few copies of a column.

Ultimately, it feels like Vertica’s core storage mechanism, the projection, is being used against it pretty well by competitors. I myself like the whole projection idea, although I think it would be ideal if the superprojection could be replaced by a complete set of projections that together represented each column in the table once and only once. This would allow you to provide an optimal ‘starter’ set of projections that would handle most queries efficiently without redundant column storage or need for additional projections in many cases.

Like this:

Post navigation

One thought on “Vertica and Projections”

Great article. In response to your wish that superprojections be “replaced by a complete set of projections that together represented each column in the table once and only once” – this was actually how Vertica was originally implemented but as described in their VLDB paper it required indexes to rejoin them which made that model more expensive than superprojections:
“C-Store uses a data structure called a join index to reconstitute
tuples from the original table using different partial
projections. While the authors expected only a few join indices
in practice, Vertica does not implement join indices
at all, instead requiring at least one super projection containing
every column of the anchoring table. In practice
and experiments with early prototypes, we found that the
costs of using join indices far outweighed their benefits. Join
indices were complex to implement and the runtime cost of
reconstructing full tuples during distributed query execution
was very high. In addition, explicitly storing row ids consumed
significant disk space for large tables. The excellent
compression achieved by our columnar design helped keep
the cost of super projections to a minimum and we have no
plans to lift the super projection requirement.”
– http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf