This blog post is focusing on some of the important details of loading data into Columnstore Indexes and the things that have changed in SQL Server 2016.

Delta-Store Compression

One of the key ways of loading data fast into a table is to have as few possible additions over the basic structure as possible – with no indexes and no compression if getting data in is all that matters to you.

One of the key implementations limitations that I have never managed to understand was the fact that in Clustered Columnstore Indexes for SQL Server 2014 the Delta-Stores were compressed with Page Compression.
I somehow believe that the original decision was made based on the amount of space that the Delta-Stores would occupy, but that was a wrong factor, because no matter how fast you load the data it will get compressed quite fast after the loading is finished, and Columnstore compression runs circles around any other compression type that you might look for in SQL Server.
What was affected with the decision of using Page Compression on Delta-Stores was the speed of how the data would get into the Columnstore Index, because of those precious CPU cycles being spent on optimising and compressing the data.

In SQL Server 2016 with a great range of improvements for data loading procedures, such as Parallel Data Insertion, there is one particular improvement, which was added in CTP 2.4 quite under the hood, but that could potentially make some good difference in a number of cases, especially if you are using very wide tables with a lot of textual columns.
The improvement is the removal of the compression on the Delta-Stores and let me share with you that I am extremely happy about it.

Let’s explore a simple example, based on the free database ContosoRetilDW.

Let’s restore a fresh copy of this free database, dropping the primary key and then adding a Clustered Columnstore Index on the FactOnlineSales table:

Now we can safely load some data into the Delta-Stores, and in order to make use of the Delta-Stores, we shall use rather small transactions of just 100.000 rows (not reaching 102.400 rows will guarantee that we are not loading data directly into the Columnstore compressed Segments).
For this test I will load data with 22 batches make :

This procedure took on average around 34 seconds on my test VM, with the SQL Server 2016 CTP 2.3, where we had the same way of loading data as in SQL Server 2014 - Delta-Stores were compressed with Page compression.

In SQL Server we have a couple of new DMV's which allow us to have a deeper insight on the internals of the Columnstore Indexes - sys.internal_partitions. Let's consult it's output to check on the compression strategy applied to the Delta-Stores that we have used:

Notice that you will need to do it relatively fast, unless you want Tuple Mover to compress your Delta-Store before you even realise that :)
You can clearly see without any extra usage of unsupported DBCC commands, what kind of compression is applied here - Page compression.

Let's do the same operation on the most recent SQL Server version available at the moment - SQL Server 2016 CTP 2.4: this time it took just around 16 seconds, making it less then 50% of time spent when compared with previous SQL Server versions (even SQL Server 2016 CTP!).
We can clearly verify this situation through the new DMV sys.internal_partitions, we are using no compression on the Delta-Stores, making them loading data really fast:

As you can see on the image, the difference between using and not using Page Compression on the Delta-Stores is absolutely huge, bear in mind that the the table I am loading is not very wide and that the impact on the compression of a table with a lot of columns can be enormous.

That was all very good news, but let's compare the impact on the disk size of the compression. For that purpose I will use the following query, taking advantage of the sys.dm_db_column_store_row_group_physical_stats DMV, another new kid on the block for SQL Server 2016:

For SQL Server 2016 CTP 2.3 we have quite a nice result, totalling in around 86 MB in the sum:

For SQL Server 2016 CTP 2.4, where our Delta-Stores are not compressed at all, we go quite crazy: the total space that will be occupied is jumping to more then 350 MB!

The picture is 1000 words worth, it is very visible that the occupied space went ballistically up, occupying more then 4 times more space then we are used to. This is a very big impact on the overall system, meaning that the System should be well prepared for receiving much bigger amounts of data, which could present a problem in some less prepared cases, because people have already gotten used to the way that the System is functioning.

I believe that it is important to have an option for controlling this behaviour, while in most cases Data Loading into a uncompressed Delta-Store will have a huge positive impact on the loading times, being able to say, that I have enough time to spend on the loading process, but the space is the issue - is definitely very important.
I hope that Microsoft will consider this option.

Together with parallelisation of the data insertion, I think that upgrading to SQL Server 2016 from SQL Server 2014 might bring a lot of happy emotions on the faces of Business Intelligence professionals :)

Vertipaq compression optimisation by Tuple Mover

In SQL Server 2015 CTP 2.4 I have noticed something that I have spotted in the online documentation a couple of months ago - the sys.dm_db_column_store_row_group_physical_stats DMV had a very interesting column - has_vertipaq_optimization, which was documented a couple of months ago, but was absent from the view until now.
In Columnstore Indexes – part 56 (“New DMV’s in SQL Server 2016”), I have put some basic description for this column, but now its time to put it to the test.

There are 2 ways of getting data from Delta-Stores into compressed Segments:
- Letting automated Tuple Mover to find closed Delta-Stores and compress them
- Invoke Alter Index ... Reorganize command which will compress all closed Delta-Stores
(note that we can also use an undocumented hint described in Columnstore Indexes – part 33 (“Tuple Mover that closes open Delta-Stores”), to close even open Delta-Stores)

Also for Clustered Columnstore we can have just one 1 single Clustered Columnstore Index and we can have multiple nonclustered B-Tree Indexes on it in SQL Server 2016, so we need to take into account those scenarios as well.
Regarding the Operational Analytics, I will be blogging very heavily about it's internals in the next 2 weeks and so I will revisit this scenario after explaining it's internals.

For this test, let's reload our test table dbo.FactOnlineSales_CCI with the same amount of data and see what changes we can notice by invoking Alter Index ... Reorganize command and also letting automatic Tuple Mover to do his work.

Now, let's wait for a couple of minutes (giving 5 minutes for Tuple Mover to do his job on the closed Delta-Stores:
We can take a look at the sys.dm_db_column_store_row_group_physical_stats DMV and check if the Vertipaq optimization was applied while measuring the final results of it by consulting the final size of the Row Groups:

The automated Tuple Mover takes advantage of the Vertipaq optimisations automatically, compressing Row Groups down to ~8MB of space from the original 180MB. This is the behaviour that one would expect from the default configured system. Notice that only compressed Row Groups have a bit value for Vertipaq Optimisations set, if we are dealing with a Delta-Store, then the value will be set to NULL.

Let's compare this behaviour with the manually invoked Tuple Mover, the main difference in invoking both is the amount of resources and parallelism which is given to the manual operation while the automated one is running very lightly in a single-threaded mode with sleeping delay of 15 seconds after processing every Row Group:

Let's consult the DMV again, to see if the manual invocation would take advantage of the Vertipaq Optimisations:We can clearly see the usage of the Vertipaq compression, especially taking in count the space that our compressed Row Groups are occupying. This is what everyone expects from the manual Tuple Mover - its spending all available resources and it is effective is the engine could possibly be.

Clustered Columnstore Indexes with Nonclustered B-Tree Indexes

After determining how Vertipaq compression is being used on a pure Clustered Columnstore table, let's consider a table with Clustered Columnstore Index plus a secondary Nonclustered B-Tree index, and let's call it dbo.FactOnlineSales_CCI_Ind:

This time you can clearly see that the usage of the Vertipaq optimisations was not activated, the column "has_vertipaq_optimization" is set to 0, and the total size of the compressed Row Groups is definitely bigger, compared to the size of the Row Groups that were compressed with Vertipaq optimisations.

Even though the size for a Row Group went from 8 MB to 15 MB, the compression is still very successful in my opinion, lowering the total size per Row Group around 10 times from around 180MB of uncompressed Delta-Stores into 15-16 MB of compressed data very fast.
The step of Vertipaq Optimisation is skipped in order to lower the impact on the Mapping Index (connection element between Columnstore and RowStore structures), and for skipping this step we getting improved times for compression.
At the moment we do not have any control whatsoever of wether we would love to keep it the way it is implemented now, or if we have enough time and we would love to compress the Columnstore Row Groups to the maximum.

Final Thoughts

Think twice before adding Nonclustered B-Tree Indexes to your Clustered Columnstore table, because you will definitely suffer the impact on your Columnstore Indexes with compression getting less effective and keeping the Mapping Index up to date will consume more resources then a pure Clustered Columnstore table.
You want to have Foreign Keys because they are so cool - beware that some things will get slower with the addition of new structures, simply because there is no free ride.

Will it stay like that forever ? Can it be optimised back ?
Just rebuild the table/partition:

You can see that every single Row Group that is compressed is taking advantage of the Vertipaq optimisation. Notice that the engine decided to shuffle the data between Row Groups for the FactOnlineSales_CCI table, and now it has 2 Row Groups with ~415.000 rows, but the number of rows per table is the same and the total size is very similar.
Also once again, notice that the dm_db_column_store_row_group_physical_stats DMV gives us precise reasons for the Row Group trimming, something that is only available with the usage of Extended Events in SQL Server 2014.

this is possible, but wes shall apply Segment Clustering for that with a MAXDOP = 1, and even in this case, Vertipaq is reordering rows within a segment, so the gains are quite marginal.
I have seen gains for the Local aggregation because of the Segment Clustering, which results in lower processing time, but still, this internal segment orders are something that seems to be under total control of the Vertipaq.