Columnstore Indexes – part 84 (“Practical Dictionary Cases”)

In this blog post I wanted to focus on 2 practical aspects of Columnstore Dictionaries compression – the ones I have seen in the past months and years of Columnstore Indexes usage.
The information found in this blog about dictionaries behaviour applies to SQL Server 2012-2016, where applicable to both clustered and nonclustered columnstore indexes. Naturally, before SQL Server 2016 you can’t update nonclustered columnstore index and for getting the results in SQL Server 2012 you would need to copy & reinsert the data in the same order as before.

The Sequences

In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed. Some of the examples of the sequences are: Prime numbers, Catalan numbers of Fibonacci numbers.
The sequence does not have to have a constant value for delta – a good example of a sequence is a Fibonacci sequence, where previous 2 numbers sum creates the next number in the sequence.

In a the relational databases the sequences are represented in a number of different ways – they can be generated manually or with some internal math functions. In SQL Server we can generate simple sequences with the help of 2 functionalities – IDENTITY and SEQUENCE.

Consider the following script below, where a table with a single integer column is created. That single column is an integer field with an ever-increasing sequence of values, thanks to the identity functionality specified in the definition.
After filling out 2 full Row Groups, the script finishes off with a creation of a Clustered Columnstore Index (you would need to create a Nonclustered Columnstore on SQL Server 2012):

As you can see in the results, the engine does not recognise nor apply any special compression for the sequences. Unlike some other engines, including Hive ORC there is nothing that allows to compress the simple integer sequences, which is a pity. Even more, the RLE compression that is applied by the engine is not very effective as the script below will show.

Let's create the same sequence of values for 2 Row Groups, this time using manual process and setting the date type to beignet to see if it affects the compression in any way.

As you can see below, the results are absolutely equal to the automated identity:

No matter if we are using integer or bigint the engine is doing a smart base compression (subtracting the base value from the beginning of the sequence and storing & compression the difference from it, you can find the base value exposed in the sys.column_store_segments DMV)

Even though we are talking about just 2.6 MB of occupied space in this case, I think that the opportunity to compress this values should not be lost easily.
Let's see how the engine behaves if we convert those values to a variable character field directly (10 chars will cover the maximum value for the integer)

alter table dbo.BatchTest2
alter column c1 varchar(10) not null;

It takes a couple of seconds to run the above query on my VM, because the engine will remove all the Segments from all the Row Groups and create new compressed Segments for all of them. This means that any operation on a big Columnstore Table, affecting data type of a number of columns will take a significant time to execute.

Let's see if there are any effects of this change from bigint to varchar

There are still no dictionaries - and trying to rebuild this table will not bring any effect at all, but take a look at the size of the segments - their size was lowered for almost 40% to ~1.6 MB!

This technic is very effective if you are compressing the columns that you do access rarely - it should be considered for the log tables for example.
Also notice that Columnstore Archival compression will not bring any significant changes - the original 2.6 MB will lower to 2.42 while the variable char column will not get any further improvements, making the improvement difference around 32%.

Warning: Do NOT use this technic without understanding the consequences - the processing of such columns will lower their effectiveness, since Predicate Pushdown will work in a very limited way, plus the Segment Elimination will not work at all.

I have had previously expressed my opinion on Twitter about sequences and their compression and I do still believe it is an interesting functionality to include into the engine.

Default Values

For a long time, SQL Server allows us to add new NULL and NOT NULL values to existing tables, and I thought that it was important to share an example of what I have observed.

We can add a new column with a default value with a following statement:

What happens under the hood is that the SQL Server engine adds a new global dictionary for the constraint that we have specified for the new column and depending if the column is NULL or is NOT NULL, it will initialise the compressed segments with the dictionary values or not.

To prove that point here are the details of the segments for our table:

You can see that the new addition of the new column has created new segments for each of the existing row groups, and so if you are adding new columns to a huge columnstore tables you will need to plan accordingly.

Another interesting aspect or technic is to consider initialising the column with the vast majority of the same value at the end of the ETL process, ensuring then that the restart values are correctly set after. This way the data loading process goes significantly faster without going into transaction log. Of course you will pay with moving all of your data into the Delta-Store, so you need to be extremely careful when doing the calculations.