Tag Archives: aggregations

First of all I have to thank Marco Russo for his blog post on Optimizing High Cardinality Columns in Vertipaq and also his great session at SQL PASS Rally Nordic in Stockholm last year which taught me a lot about columnar storage in general. I highly recommend everyone to read the two mentioned resources before continuing here. Most of the ideas presented in this post are based on these concepts and require at least basic knowledge in columnar storage.

When writing one of my recent posts I ended up with a Power Pivot model with roughly 40M rows. It contained internet logs from Wikipedia, how often someone clicked on a given page per month and how many bytes got downloaded. As you can imagine those values can vary very heavily, especially the amount of bytes downloaded. So in the Power Pivot model we end up having a lot of distinct values in our column that we use for our measure. As you know from Marcos posts, the allocated memory and therefore also the performance of columnar storage systems is directly related to the number of distinct values in a column – the more the worse. Marco already described an approach to split up a single column with a lot of distinct values into several columns with less distinct values to optimize storage. These concepts can also be used on columns that contain measures or numeric values in general. Splitting numeric values is quite easy, assuming your values range from 1 to 1,000,000 you can split this column into two by dividing the value by 1000 and using MOD 1000 for the second column. Instead of one column with the value 123,456 you end up with two columns with the values 123 and 456. In terms of storage this means that instead of 1,000,000 distinct values we only need to store 2 x 1,000 distinct values. Nothing new so far.

The trick is to combine those columns again at query time to get the original results as before the split. For some aggregations like SUM this is pretty straight forward, others are a bit more tricky. Though, in general the formulas are not really very complex and can be adopted very easily to handle any number of columns:

As you can see you can still mimic most kind of aggregation even if the [Value]-column is split up.

Though, don’t exaggerate splitting your columns – too many may be a bit inconvenient to handle and may neglect the effect resulting in worse performance. Marco already showed that you can get a reduction of up to 90% in size, during my simple tests I came up with about the same numbers. Though, it very much depends on the number of distinct values that you actually have in your column!

I would not recommend to always use this approach for all your measure column – no, definitely not! First check how many distinct values your data/measures contain and decide afterwards. For 1 million distinct values it is probably worth it, for 10,000 you may reconsider using this approach. Most important here is to test this pattern with your own data, data model and queries! Test it in terms of size and of course also in terms of performance. It may be faster to split up columns but it may also be slower and it may be also different for each query that you execute against the tabular model / Power Pivot. Again, test with your own data, data model and queries to get representative results!

Here is a little test that you may run on your own to test this behavior. Simple create the following Power Query using M, load the result into Power Pivot and save the workbook. It basically creates a table with 1 million distinct values (0 to 999,999) and splits this column up into two. You can just copy the workbook, remove the last step “Remove Columns” and save it again to get the “original” workbook and Power Pivot model.

We also get a reduction in size of 90%! Though, this is a special scenario …In the real world, taking my previous scenario with the Wikipedia data as an example, I ended up with a reduction to 50%, still very good though. But as you can see the reduction factor varies very much.

Some time ago Bob Duffy blogged about on how to use Power Pivot to analyze the disk usage of multidimensional Analysis Services models (here). He uses a an VBA macro to pull meta data like filename, path, extension, etc. from the file system or to be more specific from the data directory of Analysis Services. Analysis Services stores all its data in different files with specific extensions so it is possible to link those files to multidimensional objects in terms of attributes, facts, aggregations, etc. Based on this data we can analyze how our data is distributed. Do we have too big dimensions? Which attribute uses the most space? Do our facts consume most of the space (very likely)? If yes, how much of it is real data and how big are my aggregations – if they are processed at all?!? – These are very common and also important things to know for an Analysis Services developer.

So Bob Duffy’s solution can be really useful. The only thing I did not like about it was the fact that it uses a VBA macro to get the data. This made me think and I came up with the idea of using Power Query to get this data. Btw, make sure to check out the latest release, there have been a lot of improvements recently!

With Power Query you have to option to load multiple files from a folder and also from its sub folders. When we do this on our Analysis Services data directory, we get a list of ALL files together with their full path, filename, extension and most important in this case their size which can be found by expanding the Attributes-record:

The final Power Query does also a lot of other things to prepare the data so it can be later joined to our FileExtensions-table that holds detailed information for each file extension. This table currently looks like below but can be extended by any other columns that may be necessary and/or useful for you:

FileType

FileType_Description

ObjectType

ObjectTypeSort

ObjectTypeDetails

ahstore

Attribute Hash Store

Dimensions

20

Attribute

asstore

Attribute String Store

Dimensions

20

Attribute

astore

Attribute Store

Dimensions

20

Attribute

bsstore

BLOB String Store

Dimensions

20

BLOB

bstore

BLOB Store

Dimensions

20

BLOB

dstore

Hierarchy Decoding Store

Dimensions

20

Hierarchy

khstore

Key Hash Store

Dimensions

20

Key

ksstore

Key String Store

Dimensions

20

Key

kstore

Key Store

Dimensions

20

Key

lstore

Structure Store

Dimensions

20

Others

ostore

Order Store

Dimensions

20

Others

sstore

Set Store

Dimensions

20

Others

ustore

ustore

Dimensions

20

Others

xml

XML

Configuration

999

Configuration

fact.data

Basedata

Facts

10

Basedata

fact.data.hdr

Basedata Header

Facts

10

Basedata

fact.map

Basedata Index

Facts

10

Basedata

fact.map.hdr

Basedata Index Header

Facts

10

Basedata

rigid.data

Rigid Aggregation Data

Facts

10

Aggregations

rigid.data.hdr

Rigid Aggregation Data Header

Facts

10

Aggregations

rigid.map

Rigid Aggregation Index

Facts

10

Aggregations

rigid.map.hdr

Rigid Aggregation Index Header

Facts

10

Aggregations

flex.data

Flexible Aggregation Data

Facts

10

Aggregations

flex.data.hdr

Flexible Aggregation Data Header

Facts

10

Aggregations

flex.map

Flexible Aggregation Index

Facts

10

Aggregations

flex.map.hdr

Flexible Aggregation Index Header

Facts

10

Aggregations

string.data

String Data (Distinct Count?)

Facts

10

Basedata

cnt.bin

Binary

Configuration

999

Binaries

mrg.ccmap

mrg.ccmap

DataMining

999

DataMining

mrg.ccstat

mrg.ccstat

DataMining

999

DataMining

nb.ccmap

nb.ccmap

DataMining

999

DataMining

nb.ccstat

nb.ccstat

DataMining

999

DataMining

dt

dt

DataMining

999

DataMining

dtavl

dtavl

DataMining

999

DataMining

dtstr

dtstr

DataMining

999

DataMining

dmdimhstore

dmdimhstore

DataMining

999

DataMining

dmdimstore

dmdimstore

DataMining

999

DataMining

bin

Binary

Configuration

999

Binaries

OTHERS

Others

Others

99999

Others

As you can see the extension may contain 1, 2 or 3 parts. The more parts the more specific this file extension is. If you checked the result of the Power Query it also contains 3 columns, FileExtension1, FileExtension2 and FileExtension3. To join the two tables we first need to load both tables into Power Pivot. The next step is to create a proper column on which we can base our relationship. If the 3-part extension is found in the file extensions table, we use it, otherwise we check the 2-part extension and afterwards the 1-part extension and in case nothing matches we use “OTHERS”:

Then we can create a relationship between or PQ table and our file extension table. I also created some other calculated columns, hierarchies and measures for usability. And this is the final outcome:

You can very easily see, how big your facts are, the distribution between base-data and Aggregations, the Dimensions sizes and you can drill down to each individual file! You can of course also create a Power View report if you want to. All visualizations are up to you, this is just a very simple example of a report.

Enjoy playing around with it!

Downloads: (please note that I added a filter on the Database name as a last step of the Power Query to only show Adventure Works databases! In order to get all databases you need to remove this filter!)