Geek City: Build a Big Table with a Columnstore Index

I was looking all over to find a big table to use for my columnstore
examples in my SQL Server 2012 book, and the only one I found was from someone
at Microsoft and was not publicly available. When I demonstrate code in my
writings, I like to have the code available for anyone to use, so that was no
use.

Finally I realized I was just going to have to do it myself. I actually based
the script on some work by the awesome Benjamin Nevarez (blog| twitter), but I needed to make a
couple of changes to the table structure, and then the loop for adding the data
took some tweeking to make sure that uniqueness was possible, in case you need
to test out how the metadata changes when a columnstore index is built on a
table with a unique clustered index compared to a nonunique clustered index
compared to a heap.

I have just finished the chapter on indexes, and decided to make this script
available. The initial table data is based on the data in Microsoft
AdventureWorksDW2012 sample database that you can download here.
(Note that the DW version is the first one under “Other Available Downloads”’;
it’s not the one under “Recommended Download”. )

Here is the section of the script that populates most of the
FactInternetSalesBIG table. (The original 60K rows were just copied
from the original FactInternetSales table. Note that all the statements
have to be run in a single batch because the local variable’s scope is the
batch, and because the GO 9 applies to the single batch that precedes it. Of
course, you can change the number of iterations to end up with a different size
table. Mine ends up at just over 30 million rows. In addition to modifying the
SalesOrderNumber value on each iteration, I also changed the value of the
Revision column to indicate which pass through the insert loop was being
executed.

Make sure you have enough log space. My log grew from 1GB to just over 8GB
when running the script to build the 30 million row table and columnstore index,
with no clustered index.

The downloadable script has commented options to build either a clustered or
nonclustered index before you build the columnstore index. I suggest building
the clustered index before the columnstore index, because building the
clustered index will rebuild any existing nonclustered indexes, including my
columnstore index. Building the columnstore index can take a while, on my system
it was about 10 minutes. (Building the original 30+ million row table took even longer.)

After I finish the chapter on partitioning, I'll have a script to partition this table, but feel free to do that yourself!

thank you very much for your post! I did everything you suggested and than ran this query (over LAN, with Columnstore Index in place, SQL Server 2012 is running on remote machine with quadcore, 16GB RAM)

SELECT [SalesOrderNumber], sum([SalesAmount])

FROM [AdventureWorksDW2012].[dbo].[FactInternetSalesBig]

group by SalesOrderNumber

It succesfully returns 14161408 rows in 3 minutes.

When I ran the same Query vs. my Local copy of AdventureWorksDW2012 database (but without Columnstore Index and with no LAN involved (everything LOCAL on my Workstation, including SQL Server 2012) it took 6 minutes. Configuration of my Workstation is quadcore, 16GB RAM.

just in order to compare apples vs. apples, I rerun the same query locally on server I mentioned above (so no LAN involved). It returns those 14161408 rows in less then 100 seconds, almost twice faster then over LAN.

We managed to get Star optimizations appearing in our BI query execution plans at work after much database redesign. Attempted to compare the performance of these against column store indexes was a nightmare though. With no decent documentation to go on, it's impossible trying to reverse engineer the query optimizer logical as to and when, how and why it makes best use of column store indexes.

Really looking forward to seeing your book when it comes out. I've read all your previous ones and they are the backbone of my skillset. I just pasted the Microsoft Certified Master of SQL Server 2008 knowledge exam thanks in no small part to your books.

Kalen: I have a question for you that unfortunately is likely to have an 'it depends' answer. That being said, I'd like your first blush, off the cuff answer nonetheless.

I am working in an environment where, for better or worse, the BI queries done against our DW are entirely ad-hoc (ROLAP via Microstrategy). We have used ColumnStore Indexes in a partitioned Fact table, partitioned at the Day grain and this is workinq very nicely for us.

The question I have is regarding our Dimension tables. Because the queries are entirely ad-hoc, every single one of the attributes in our dimensions are key / value pairs, and every single one of the keys (integers) are indexed in separate indexes. Upserts take a long time because of the sheer number of attributes / indexes. Read-Only considerations due to the use of ColumnStore Indexes not withstanding, do you think we'd see better performance if we basically truncated a dimension, did straight Inserts and then added a ColumnStore for all the key columns instead of relying on more 'traditional' approaches to dimensional maintenance / ETL?