SSAS Tabular – Fact Table Key Columns Datatype Impact

Intro

I’m working on a solution where warehouse uses string codes as the keys for dimensions. Therefore the fact table is a multi-column table which contains strings (nvarchar(30)) in most of the columns. We have around a billion records and we would like to extend our solution. However, we have an issue with memory during the processing. One of the reasons could be just the fact we use nvarchar strings, as data needs to be uncompressed for some time in memory. At least in the size of two segments as one is in compressing and one is in reading queue. The questions which I would like to test:

Is there a difference in speed and memory consumption when we use varchar instead of nvarchar?

What is the difference in memory, duration and size of the model, when we re-code dimensions and use surrogate integer key?

For the test I’ve created artificial data with value distribution described below. The test is a bit extreme but it’s on purpose to clearly see the potential difference.

There is not much of a difference between Nvarchar and Varchar – this is basically expected as SSAS Tabular do not at the end have Varchar Type and internally handles everything as nvarchar.

Regarding memory consumption there is not much of a difference between INT and NVARCHAR. This was a surprise for me as I would expect uncompressed segments having in memory would fit byte-based calculation so nvarchar should have much higher consumption. Any thoughts on this?

Speed of processing is much faster for INT based fact table. – The magnitude of the difference surprised me. But further investigation confirmed it’s just about the speed of data load on my virtual machine. When tested on our prod-like environment with wider table we had 10% difference only from data load speed perspective int vs nvarchar.

Dictionary size can differ per processing. – Once 300MB and once 450MB not sure about exact reason for seeing this. I wouldn’t be surprised when data size would differ with each processing because of choosing different sort order. But why dictionary size?

There is not much of a difference in model size between INT and Nvarchar – on the other hand as encoding of fact key column is value based = near zero size dictionary (Need to be set by hint (SSAS2017 only), otherwise you will get hash encoding despite of INT datatype). Effect would be more visible with bigger fact table. So, INT based model should ultimately be smaller.

There is significant saving in DB storage when using smaller datatypes. No surprise here.

It seems order matters – Processing first facts and then satellite tables is slightly faster (tested several times)

Processing process is not willing to give up all memory after processing finishes. (this was the reason I did a restart after each test and measured the difference of memory before processing start and after restart – assuming delta is real model size in memory.

Next time I’ll try to compare Query duration between Nvarchar and INT based model to see the potential impact on queries. Feel free to leave a comment or contact me if you have any input or experience with this topic.