Under The Hood With SQL Server Batch Mode Part 2: Column Store and Batch Mode Fundamentals

In my last post I covered material which to a large degree was SQL Server agnostic, in this posting I will elaborate on why and how it relates to SQL Server. Firstly, what difference does batch mode make:

A parallel plan with a degree of 2 results in a x 12 performance improvement compared to a serial plan. The whole point of the slide which discussed latency, ranging from accessing the on CPU cache ( heart beat ) all the way through to primary storage ( undertaking a degree course ), was to illustrate the CPU cycles that get wasted when we drop out of L2/3 cache, main memory etc. Using xperf we can actually quantify how much more efficient a batch mode iterator is versus its row mode equivalent:

To join the dots from my previous post, if you remember the slide on which I mentioned how data passes from iterator to iterator row by row in row mode, in batch mode this is via batches of around 10,000 rows. The significance of 10,000 as a number is that on ‘Average’ a batch of 10,000 rows will fit inside a CPU’s L2/3 cache. The unit of transfer between disk and memory for a column store is a segment, segments which contain roughly a million rows are pipelined into the L2/3 cache as per this next slide:

At roughly the same time that I put this material together, Intel claimed that they would incorporate the 128Mb “Crystalwell” on die L4 cache into its latest Xeon lineup, you would suspect the latency involved when accessing this to fall between that of accessing the L3 cache and main memory.Theoretically the pipelining of rows could be turned into a two stage process in order to achieve further performance improvements. Segments could potentially be pipelined into the L4 cache in a batch size larger than 10,000 rows and then pipelined into the L2/3 cache using the conventional batch size of 10,000.

Remus Rasanu’s blog post: “Understanding how SQL Server Executes a Query” mentions that each execution plan iterator has a get next row method which its parent iterator invokes in order to pull data through the plan. Parallel iterators add a further nuance to this in that they have a producer and consumer side in which data is pushed through the iterator as opposed to being pulled through, refer to Paul White’s ( @SQLKIWI ) blog post: “Parallel Plan – Branches and Threads” for more information on this. Interestingly the repartition streams, gather streams and distribute streams iterators only work in row mode at the time of writing.

The call stack on the right hand side of the slide illustrates how at the top of the stack; SQLDK.dll ( SQL OS ) allocates pages of memory into which batches are loaded. In the lower part of the trace we can actually see segments being retrieved from disk. The two most salient points regarding the moving of rows around in batches, besides the fact they are sized to fit into the L2/3 cache are:

In terms of machine instructions per row this is highly efficient, it is also often referred to as “vectorized processing”.

It results in rows being stored in CPU L2/3 cache pages which are contiguous. According to the Microsoft research paper: “Enhancement to SQL Columns Stores”; SQL 2014 introduces a new blob cache:

2.2 Caching and I/O
Column segments and dictionaries are brought into memory as needed during query processing. They arenot stored in the buffer pool but in a new cache for large objects. Each object is stored contiguously on adjacent memory pages. This simplifies and speeds up scanning of a column because there are no “page breaks” to worry about.

This design feature has been included because sequential memory page scans are faster than random page scans, as per this slide from the last post:

Recurring values encountered going down the column are stored in dictionaries.

Column store dictionaries act as lookup tables whereby dictionary entry ids and run lengths are used to encode the actual data. There are two types of dictionary: local ( to each segment ) and global ( to the entire object ). The column store index build process in SQL 2014 differes from that in 2012, in that in the more recent release it is a two phase process in which the global dictionary is created first. Better performance is obtained if as much of the compression meta data as possible is stored in local dictionaries, if the global dictionary gets beyond a certain size it can spill out of the L2/3 cache.

By their nature it is not possible for an index seek to be performed on a column store index, however because local dictionairies store the lowest and highest value of each column within the segment, whole segments can be skipped during column store scan operations. With SQL Server 2014 onwards deleted values; stored in the deletion bit map are also skipped. The deletion bit map is stored as a b-tree structure on disk, but becomes an actual bit map once materialised into memory. Rows are uniquely identified within a row group via a tuple id ( new to SQL Server 2014 ). Inserts into column store indexes can be treated in one of two different ways in SQL Server 2014, bulk inserts, bulk referring to inserts of a million ( binary million ) or more are encoded and stored as segments, inserts of fewer rows are stored in delta stores, when a delta store becomes full it is marked as closed, the tuple mover will encode and compress closed delta stores. For the best possible scan performance, it is recommended that the amount of data stored in delta stores is kept as low as possible. Rebuilding a column store index effectively forces the tuple mover to run on demand and empty all closed delta stores. A blog post on the tuple mover by one of the members of the SQL Server development team can be found here.

SQL Server 2014 introduces a further level of compression known as “Column store archive” compression. The afore mention Microsoft Research paper on enhancements to column store indexes in SQL Server 2014 claims that this can result in a further increase in compression between 37 and 66%. Column store archive compression works by subjecting segments to a Microsoft implementation of the LZ77 compression algorithm before storing them on disk. The results for column store archive compression using my contrived FactInternetSales table were disappointing as compared to conventional column store compression. When I first presented this material at a SQL Server user group in Newcastle, I made the comment that without someone lending me their data warehouse I was constrained by synthetic data, one of the attendees mentioned that you can download some of the Stack Exchange data. Therefore to illustrate the often used disclaimer that “Your ( compression ) mileage will vary” depending on your data, I looked at how well data compresses with an artificially inflated FactInternetSales table from the AdventureWorksDW data warehouse . . .

compared to a posts table containing posting data ( minus comments – blob which will not compress ) from the four largest stack exchanges. I should both accredit and thank Stack Exchange for making this data publically available.

In terms of “Real world” data the Stack Exchange data beats my inflated FactInternetSales table hands down, unfortunately the Stack Exchange data is not modelled as a Kimball star schema, also because there is not that much of it, a full table scan of my contrived Posts table from a cold buffer cache using my slow spinning disk based storage completes in a second, no where near long enough to ‘Stress’ batch mode.

In my next post in the series I will delve into sequential scan performance using column stores and different types of storage, join performance and how well a simple fact dimension table join scales as the degree of parallelism is increased.