Category Archives: SQL Server 2016

I am heavily relying on virtualization in my home lab. I prefer to build separate SQL Server environments for different projects keeping them clean and isolated from each other. I can work with multiple SQL Server versions side by side without worrying how some components in my setup affect others.

This setup is also great for presentations and trainings I deliver. I just need to reset VM to the “starting” checkpoint and everything would be in the known and healthy state. Believe me, this saves you from a lot of pain and embarrassment when you are staying in front of people. Troubleshooting of non-working demos is not fun when you are on the stage.

There is the problem with this setup, however. It is much easier and faster to checkpoint existing VM rather than provision the new one for the new setup. In the end, unless you have self-discipline I lack, you may end up with something like shown in Figure 1 below. This is the screen shot of one of my VMs checkpoints.

01. Mess in VM configuration

On the surface, this setup may provide you required isolation between environments. However, there is the problem with maintainability. Consider how much efforts it would take to apply new SP or CU to all of your checkpoints. In the end, you will be in the situation when you always want to rebuild your lab and you never have time to do it properly. Thus, I decided to look at other options and evaluate Docker and SQL Server in containers when I recently had to setup the new travel laptop. I found that containers are better suited to my needs comparing to VM-based virtualization and I want to share my experience with you.

As the disclaimer, I am going to focus on the infrastructure side and maintainability and not going to discuss underlying technologies and containerization. You can read about it here if you want.

While both, containers and VMs rely on virtualization, they are conceptually different on the infrastructure side. Figure 2 illustrates the general virtual machine. .

02. VM and Checkpoints

Each virtual machine may consist of multiple checkpoints/snapshots, which fork it to the multiple states. However, only one state may be active at any giving point of time. Neither could you run multiple instances of the single VM simultaneously – you’ll need to clone them creating separate VMs for that.

Containers, on the other hand, operate differently as shown in Figure 3. The environment consists of multiple images, which conceptually similar to checkpoints/snapshots. However, every image may have multiple containers (running or stopped), which are the active instances/environments. It is worth noting that Docker API allows you to checkpoint a container, but I see little benefits of doing that.

03. Images and Containers

That conceptual difference force you to use different methods to solve problems. Think about the situation when you need to setup individual development environments for each member of the team. With VM-based virtualization, you will have to create separate VMs. With containers, you need to create the base image and start multiple containers from this image.

Both approaches have pros and cons. The containers will use less disk space than VMs – you will have only one image on disk. They are also easier to maintain and refresh – you need to build the single image. VMs, on the other hand, provide true separation. If one developer is working on the complex feature in its own branch, there is no dependency on the shared image – you can refresh other VMs keeping his one intact.

Nothing, obviously, prevents you from having multiple images achieving the same degree of separation with containers. You just need to manage them a bit differently. I, personally, believe that containers will be easier to maintain and lead to the cleaner setup in the long run.

Anyway, let’s do the setup. Mind you, I do not consider myself as the Docker expert – I am sure that there are the better ways to achieve the goals. Nevertheless, I want to share several tips and tricks I did in my environment.

I am going to use SQL Server 2017 in Windows containers setting everything up on Windows 10 machine (you need to have Anniversary Update Build 1607 or later) . I am sure that you can use similar approach in different OS and/or with SQL Server in Linux.

As the first step, you need to install Docker. You can download it from here – you can read the instruction – it is straightforward. Just make sure to choose Windows Containers during setup. After Docker setup is done, you may test your setup using Hello World image as outlined here. I did not do it – after all, we are about to run SQL Server anyway.

As the next step, let’s pull SQL Server 2017 Developer Windows image. Make sure that Docker is running, open PowerShell and run the following command.

docker pull microsoft/mssql-server-windows-developer:latest

It will take some time to download but then you are pretty much done – you have SQL Server “installed”. The following command will display you the list of images you have installed on your machine:

docker image ls

Figure 4 shows the output you would see.

04. SQL Server Image

As the side note, you can remove images using one of the following commands (obviously do not run it now). In some cases, for example if you had an error when you built the image, you may end up with the images that do not have names and you’ll need to use ID to clean them up.

docker image rm <name>
docker image rm <image_id>

Now, let’s create the container with SQL Server 2017 by running the following command. This will create the container with name sqlserver.

The -p switch sets up the port mapping between local instance and the container. In our case, both ports are the same and SQL Server in container will work (and look like) as if you have SQL Server default instance installed locally. All requests host received on port 1433 would be redirected to the container. Obviously, you need to change the port mapping if you already have SQL Server installed on your box and/or if you have multiple containers running. Also, do not forget to open the port in the firewall if you accessing the image remotely.

Now, you can connect to SQL Server by using localhost or localhost,1433 as the server name and sa/Pas!w0rd12345 for the authentication.

Alternatively, you can find IP address of the container. First, let’s run the following command to see list of active containers in the system

docker container ls -all

Figure 5 shows the output – as you see, we have just one running container in the system. You can use either name or container_id during container management.

05. Running Containers

Next, let’s run the command below to get container information.

docker inspect sqlserver

Figure 6 illustrates the partial output of the statement – the configuration is presented in JSON format. You can look at IPAddress element there.

06. Inspect Container Configuration

You can also run the following command to parse the output of docker inspect obtaining IP address without any additional information:

In my environment, I prefer to connect to containers using IP Addresses rather than port mapping. By default, Docker assigns random IP Address every time you start container, which is inconvenient. You can workaround it by assigning specific IP to the container when you start it.

Let’s locate the subnet used by the Docker. You can do it by using the commands below. In Windows environment and with the current docker build, you should look at nat network.

docker network list
docker network inspect nat

Figure 7 shows the output of the commands. You can pick any IP address from the available range.

Now, we can create another container with the command below. In addition to IP Address, there are a couple of other useful parameters. -m indicates how much memory you want to allocate to container. –cpu-count shows number of cores to allocate.

As I already mentioned, you can start multiple containers from the single image. Just use different IP, Name and port mapping for each of them.

Building the Image

Obviously, in the real life, we do not work with vanilla SQL Server installation. We need to customize it by changing SQL Server settings and logins, creating and/or restoring the databases and do other actions. There are a couple of ways how you can do that.

The first approach is customizing existing container manually and creating the image from it using docker container commitcommand. After that, you can start the new containers from created image the same way as we already discussed. We will cover a couple ways to move data to and from containers later.

There is the better way, however. You can automate this process by utilizing docker buildcommand. The process is very simple. You just need to define DockerFile, which contains the reference to the main image and specifies the build actions. You can copy scripts and database backups into the image, run SQLCMD, BCP and PowerShell scripts there – you, pretty much, have the full control. Internally, Docker runs every command inside deployment containers (creating and destroying them during the process) saving the final one as the target image.

Let’s look at the simple DockerFile below. As you can see, it specifies our SQL Server 2017 image as the base for the build. Next, it copies files from /Scripts folder into the image/deployment container and run Build.ps1 PowerShell script there.

Obviously, you have all flexibility here. For example, you can copy and restore database backup instead of recreating it with scripts.

You can run docker build command as shown below from the same folder where DockerFile is located. You may notice that I am using -m parameter – by default, Docker allocates very limited amount of resources to deployment containers during the build. You may consider increasing the memory for the container to avoid any potential issues if you run heavy scripts.

docker build -t internalsimage . -m 2g

There is the small caveat here – security. Docker runs the build using Windows Administrator/sysadmin user, which will be removed after process is completed. It should not be a problem for majority of the cases, unless some of your code changes security context with EXECUTE AS OWNER clause and/or relies on the proper database ownership. As the workaround, you can create another sysadmin SQL Login in the beginning of the build process and use it when you run the scripts or restoring the databases later.

The beauty of this setup is maintainability. Consider the situation when you want to refresh the environment after Microsoft published newer SQL Server image in the repo. The only thing you need to do is changing the base image reference in DockerFile and re-run the build process. All images will be recreated with minimum work involved.

You can use your own images as the base images during the build. Consider the example with individual development environments we discussed earlier. Let’s assume that we want to refresh the image with production data on the weekly basis and support the situation when we need to keep some old images to run containers for complex feature development.

In this case, you may want to create the setup as shown in Figure 8 with one image that contains all configuration settings without the database using it as the base image for weekly refreshes.

08. Possible Images/Containers Configuration for Dev Teams

A Few Other Tips

There are a couple of other commands I want to mention.

The docker execcommand allows you to run a command within the container. Listing below starts interactive PowerShell session in internals container. You can create folders, run PS commands and do whatever you need to do there.

docker exec -it internals PowerShell

You can move files in between host and container using docker cpcommand. This is the perfect way to restore the database from the backup in the container.

Wrapping Up

While Docker-based setup works great in majority of the cases, there are a few caveats I need to mention besides security context issue during the docker build. You would also have some limitations if you rely on Windows Authentication. You cannot make containers the members of domain. You can use Group Managed Service accounts (gMSA) to handle Windows Authentication but, obviously, this setup is much more complicated. Some details are available here.

Next, this setup is not necessarily the best for performance-related demos and/or load testing. I found that under heavy OLTP load, the bottleneck shifts towards network layer invalidating the results. Also, I did not find a way to connect to in-container SQL Server performance counters with PerfMon application. You can workaround it with sys.dm_os_performance_counters view, but it is less convenient.

Neither would it work very well for High Availability-related scenarios. HA in containers world be achieved in the different way and VM-based virtualization is the better choice if you need to test AlwaysOn Availability Groups and Failover Clusters.

Fortunately, you are not limited to the single technology. You can use VMs for some and containers for other scenarios. The choice is yours.

Docker PS cheatsheet and docker build demo is available for download from here.

A long time ago in a galaxy far, far away, I had to troubleshoot interesting performance issue in SQL Server. Suddenly, the CPU load on the server started to climb up. Nothing changed in terms of workload. The system still processed the same amount of requests. The execution plans of the critical queries stayed the same. Nevertheless, the CPU usage grew up slowly and steadily by a few percent per hour.

Eventually, we nailed it down. The problem occured in very busy OLTP system with very volatile data. We noticed that system performed much more I/O (logical and physical) than before. It was very strange, because nothing should have changed that day. Finally, we found that we have large number of deleted rows in the database that had not been cleaned up by ghost cleanup task.

It was the most impactful for the few “queue” tables in the system that handled hundreds of inserts and deletes per second. The tables had millions of data pages despite that they stored just a handful of rows. The tables supposed to be small and there were bunch of queries that performed Clustered Index Scan over them burning a lot of CPU resources to perform mullions of logical and physical reads.

The output of sys.dm_db_index_physical_stats looked similar to Figure 1 below. Very little record_count with very large number of page_count and version_ghost_record_count. The latter one indicates how many ghosted rows cannot be cleaned up due to row-versioning transactions in the system.

01. Output of sys.dm_db_index_physical_stats

We looked at the active transaction and discovered, that one of our readable secondaries ran terrible SELECT that cross-join several hundred-million row tables. Unfortunately, SQL Server allowed it to execute and this select was running for more than 12 hours deferring ghost and version store clean-up on primary node. The bottom line – non-optimized workload on the readable secondary node can affect your primary. Heh.

You can read why it happens in my recent blog post at Apress.com and in my upcoming book on locking, blocking and concurrency (should be published in October). However, I also wanted to mention it here and provide you the demo scripts and video that demonstrates it.

Remember about this overhead. Secondary nodes are great to scale your read workload but they are by no means “set it and forget about it” kind of solution.

If you worked with SQL Server for a while, you should have noticed how landscape changed over the years. We are dealing with the different problems now. Five years ago, majority of the issues I saw were related to non-optimized queries. There were the huge scans with a lot of physical I/O and bad performance.

You do not see them as often nowadays. It is very cheap to solve the problems by upgrading the server. Put a couple hundred GBs of RAM and cache all the data; add more CPUs and problems magically disappear. The root-cause has not been fixed but who cares?

Surprisingly, there is one category of the issues that did not went away – concurrency. It even becomes worse. Modern servers handle more users and problems that did not exist with 50 concurrent users may put the server to its knees with 5000 users. I’ve been constantly involved in the troubleshooting of various concurrency issues and, in fact, I see more and more of them overtime.

Troubleshooting of the blocking and concurrency issues is, in the nutshells, a simple process. You need to identify the processes involved in blocking conditions or deadlocks and analyze why those processes acquire the locks on the same resources. In majority of cases, you need to analyze queries and their execution plans identifying possible inefficiencies that led to excessive number of locks being acquired.

Collecting this information is not a trivial task. The information is exposed through DMVs (you can download the set of scripts here); however, it requires you to run the queries at time when blocking occurred. Fortunately, SQL Server allows you to capture blocking and deadlock conditions with the blocked process report and deadlock graph, analyzing them later.

There is the caveat though. Neither blocked process report nor deadlock graph provide you execution plans of the statements. Nor do they always include affected statements in the plain text. You may need to query plan cache and other DMVs to get this information and longer you wait lesser is the chance that the information is available. Moreover, SQL Server may generate enormous number of blocked process reports in cases of prolonged blocking and complex blocking chains, which complicates the analysis.

This analysis may become very time consuming, especially if you need to deal with the large number of servers. Over the years, I have created the set of routines, which simplify it for me. I have been thinking to publish my scripts for a while, but I’ve never had time to polish them enough for public consumption. Until now – and I am very happy to share my collection with all of you. So allow me introduce the Blocking Monitoring Framework, which I am using with majority of my servers!

This framework is using Event Notifications. It captures blocked process report and deadlock graphs and parses them at time when event occurred and all data is still available in the system. All information is persisted in the set of tables for the further analysis.

The first version is available for download. I also promise you that I am going to support and enhance it in the future publishing the new versions on the regular basis.

Please feel free to contact me in case of any questions. I would also appreciate if you provide me any blocked process reports and deadlock graphs that framework was unable to parse. I will address the issues as quickly as I could.

As strange as it sounds, I have never considered transaction management to be the topic that requires explanation. However, it seems that some aspects of it are confusing and may benefit from the separate, long overdue, blog post.

Transaction Types

There are three types of transactions in SQL Server – explicit, autocommitted and implicit.

Explicit transactions are explicitly controlled by the code. You can start them by using BEGIN TRAN statement. They will remain active until you explicitly call COMMIT or ROLLBACK in the code.

In case, when there are no active transactions present, SQL Server would use autocommitted transactions – starting transactions and committing them for each statement it executes. Autocommitted transactions work on per-statement rather than per-module level. For example, when a stored procedure consists of five statements; SQL Server would have five autocommitted transactions executed. Moreover, if this procedure failed in the middle of execution, SQL Server would not roll back previously committed autocommitted transactions. This behavior may lead to logical data inconsistency in the system.

For the logic that includes multiple data modification statements, autocommitted transactions are less efficient than explicit transactions due to the logging overhead they introduce. In this mode, every statement would generate transaction log records for implicit BEGIN TRAN and COMMIT operations, which leads to the large amount of transaction log activity and degrade performance of the system.

Figure 1 illustrates this condition. INSERT_1, UPDATE_1 and DELETE_1 operations run in autocommitted transactions generating additional log records and forcing log buffer to flush on each COMMIT. Alternatively, INSERT_2, UPDATE_2 and DELETE_2 operations run in implicit transaction, which leads to more efficient logging.

01. Transaction Logging with Autocommitted and Explicit Transactions

You can run the code below to see this overhead in action. It performs INSERT/UPDATE/DELETE sequence 10,000 times in the loop in autocommitted and explicit transactions respectively, measuring execution time and transaction log throughput with sys.dm_io_virtual_file_statsview.

In my environment and this test, explicit transactions are about two times faster and generated three less log activity than autocommitted ones as shown in Figure 2.

02. Explicit and Autocommitted Transaction Performance

As the side note, SQL Server 2014 and above allows you to improve transaction log throughput by using delayed durability. In this mode, SQL Server does not flush log buffers when COMMIT log records are generated. This reduces the number of disk writes at cost of potential small data loss in case of disaster.

SQL Server also supports implicit transactions, which you can enable with SET IMPLICIT_TRANSACTION ON statement. When this option is enabled, SQL Server starts the new transaction when there is no active explicit transactions present. This transaction stays active until you explicitly issue COMMIT or ROLLBACK statement.

Implicit transactions may make transaction management more complicated and they are rarely used in production. However, there is the caveat – SET ANSI_DEFAULT ON option also automatically enables implicit transactions. This behavior may lead to unexpected concurrency issues in the system.

Error Handling

The error handling in SQL Server is the tricky subject especially with transactions involved. SQL Server would handle exceptions differently depending on error severity, active transaction context and several other factors.

Let’s look how exceptions affect control flow during execution. Listing below creates two tables- dbo.Customers and dbo.Orders – and populates them with the data. Note the existence of foreign key constraint defined in dbo.Orders table.

Let’s run three DELETE statements in one batch as shown below. The second statement will trigger a foreign key violation error. @@ERROR system variable provides the error number for the last T-SQL statement executed (0 means no errors).

As you can see in Figure 4, the error was caught in the CATCH block and the third deletion statement has not been executed.

04. Running Three Autocommitted Transactions in TRY..CATCH block

Remember that non-critical exceptions do not automatically rollback explicit or implicit transactions regardless if TRY..CATCH block is present. You still need to commit or rollback transaction after the error.

Depending on severity of the error, transaction in which error occurred may be committable or become uncommittable and doomed. SQL Server would not allow you to commit uncommittable transactions and you must roll it back to complete it.

The XACT_STATE() function allows you to analyze the state of transaction and it returns one of three values:

0 indicates that there is no active transactions present.

1 indicates that there is an active and committable transaction present. You can perform any actions and data modifications committing transactions afterwards.

-1 indicates that there is an active uncommittable transaction present. You cannot commit such transaction.

There is very important SET option- XACT_ABORT– which allows you to control error-handling behavior in the code. When this option is set to ON, SQL Server treats every run-time error as severe, making transaction uncommittable. This prevents you from accidentally committing transactions when some data modifications failed with non-critical errors.

When XACT_ABORT is enabled, any error would terminate the batch when TRY..CATCH block is not present. For example, if you run the code from the second code sample above again using SET XACT_ABORT ON, the third DELETEstatement would not be executed and only the row with CustomerId=1 will be deleted. Moreover, SQL Server would automatically rollback doomed uncommitted transaction after the batch completes.

The code below shows this behavior. The stored procedure dbo.GenerateError sets XACT_ABORT to ON and generates an error within the active transaction. @@TRANCOUNT variable returns the nested level of transaction (more on it later) and non-zero value indicate that transaction is active.

create proc dbo.GenerateError
as
begin
set xact_abort on
begin tran
delete from dbo.Customers where CustomerId = 2; -- Error
select 'This statement will never be executed';
end
go
exec dbo.GenerateError;
select 'This statement will never be executed';
go
-- Another batch
select XACT_STATE() as [XACT_STATE()], @@TRANCOUNT as [@@TRANCOUNT];
go

Figure 5 illustrates the output of the code. As you can see, batch execution has been terminated and transaction has been automatically rolled back at the end of the batch.

505. XACT_ABORT behavior (no TRY..CATCH block)

TRY..CATCH block, however, will allow you to capture the error even with XACT_ABORT set to ON.

As you can see in Figure 6, exception has been trapped in the CATCH block with transaction still remain active there.

06. XACT_ABORT Behavior (with TRY..CATCH block)

Consistent error handling and transaction management strategies are extremely important and allow to avoid data consistency errors and improve data quality in the system. I would recommend the following approach as the best practice:

Always use explicit transactions in the code during data modifications. This would guarantee data consistency in transactions that consists of multiple operations. It is also more efficient comparing to individual autocommitted transactions.

Set XACT_ABORT to ON before data modifications. This would guarantee “all-or-nothing” behavior of the transaction preventing SQL Server from ignoring non-severe errors and committing partially completed transactions.

Use proper error handling with TRY..CATCH blocks and explicitly rollback transactions in case of exceptions. This helps to avoid unforeseen side effects in case of the errors.

The choice between client-side and server-side transaction management depends on application architecture. Client-side management is required when data modifications are done in the application code, for example changes are generated by ORM frameworks. On the other hand, stored procedure-based data access tier may benefit from server-side transaction management.

The code below provides the example of the stored procedure that implements server-side transaction management.

Nested Transactions

SQL Server technically supports nested transactions; however, they are primarily intended to simplify transaction management during nested stored procedure calls. In practice, it means that the code needs to explicitly commit all nested transactions and the number of COMMIT calls should match the number of BEGIN TRAN calls. The ROLLBACK statement, however, rolls back entire transaction regardless of the current nested level.

The code below demonstrates this behavior. As I already mentioned, system variable @@TRANCOUNT returns the nested level of the transaction.

select @@TRANCOUNT as [Original @@TRANCOUNT];
begin tran
select @@TRANCOUNT as [@@TRANCOUNT after the first BEGIN TRAN];
begin tran
select @@TRANCOUNT as [@@TRANCOUNT after the second BEGIN TRAN];
commit
select @@TRANCOUNT as [@@TRANCOUNT after nested COMMIT];
begin tran
select @@TRANCOUNT as [@@TRANCOUNT after the third BEGIN TRAN];
rollback
select @@TRANCOUNT as [@@TRANCOUNT after ROLLBACK];
rollback; -- This ROLLBACK generates the error

You can see the output of the code in Figure 7 below.

07. Nested Transactions

You can save the state of transaction and create a savepoint by using SAVE TRANSACTION statement. This will allow you to partially rollback a transaction returning to the most recent savepoint. The transaction will remain active and needs to be completed with explicit COMMIT or ROLLBACK statement later.

It is worth noting that uncommittable transactions with XACT_STATE() = -1 cannot be rolled back to savepoint. In practice, it means that you cannot rollback to savepoint after an error if XACT_ABORT is set to ON.

The code below illustrates savepoints in action. The stored procedure creates the savepoint when it runs in active transaction and rolls back to this savepoint in case of committable error.

The test triggered foreign key violation during the second dbo.TryDeleteCustomer call. This is non-critical error and, therefore, the code is able to commit after it as shown in Figure 8.

08. Transaction Has Been Committed After Rollback to Savepoint

It is worth noting that this example is shown for demonstration purposes only. From efficiency standpoint, it would be better to validate referential integrity and existence of the orders before deletion occurred rather than catching exception and rolling back to savepoint in case of an error.

I hope that those examples provided you the good overview of transaction management and error handling strategies in the system. If you want to dive deeper, I would strongly recommend you to read the great article by Erland Sommarskog, which provides you much more details on the subject.

Last week we discussed how to implement Sliding Window pattern in the tables with columnstore indexes. As I promised, today we are going to focus on partition management in such tables.

As you will remember, SQL Server does not allow you to split and merge non-empty columnstore partitions. I personally do not consider it as the huge limitation – perhaps because I rarely have reasons to do it in columnstore tables. However, there are still some cases when it may be required.

As one of examples, think about a table, partitioned by date, which may be populated with incorrect data from the future. Perhaps due to some issues in ETL processes. Some rows may be placed to incorrect (future) partition, which would prevent its split.

The common, by the book approach recommends dropping columnstore index, splitting or merging partitions and recreating the index afterwards. As you can imagine, it would lead to extremely inefficient process with huge amount of unnecessary overhead on large tables. After all, you have to drop and recreate columnstore index, converting table to Heap, while just subset of the partitions needs to be rebuilt. Fortunately, you can minimize the overhead with simple workaround:

Switch partition(s) to split or merge to the separate staging table

Split or merge partition(s) in the main table. You can do that because partitions will be empty after the previous step

Drop columnstore index in the staging table, split/merge partition(s) there and recreate the index afterwards

Switch partition(s) back from staging to the main table.

Let’s look at the process in details. I am going to recreate the tables I used last week with the script below.

Figure 1 demonstrates data distribution across partitions in the dbo.Orders table. As you can see, we have 13 non-empty partitions started from November 2016 up to end of November 2017.

01.Initial Data Distribution

Let’s assume that we would like to merge November and December 2016 together, which are partitions 2 and 3 in the tables.

As the first step, we will create the staging tables. They need to be partitioned in the same way as the main tables. It is better to use separate partition function to decouple merge and split operations in main and staging tables from each other. Remember that SQL Server would hold schema modification (Sch-M) lock on all tables that use partition function for the duration of merge and split operations.

Partition function on the staging tables does not need to include all partitions from the main tables. It just need to include partition(s) you are about it split or merge and two adjacent partitions on the left and right sides. This will guarantee that partitions in the main and staging tables are aligned before and after the operation. It would also support partition switch without requirement of creating extra CHECK constraints on the staging tables. As you can see in the code below, pfOrdersStaging function consists of four partitions:

As you can see, with just a few extra operators, we were able to remove the overhead of rebuilding columnstore index on entire table.

Unfortunately, this is offline approach. The data in the main tables would become inconsistent as soon as you switch partitions to the staging tables. Fortunately, if data is static, you can address it with some coding.

Instead of switching partitions to the staging tables, you can copy the data there. After split or merge is done in the staging tables, you can empty source partitions by switching them to another, temporary tables, and switch data back from the staging tables.

As you can see, this implementation is relatively simple and considering that data in columnstore indexes is often static, you may be able to utilize it to perform partition maintenance transparently to the users.

It’s been a while since my last blog post. Many things happened, including another great PASS Summit where I presented the session of Data Partitioning (by the way, you can download the slides and demos from my Presentation page). As usual, I ran out of time 🙂

As many of you know, Data Partitioning is very big and complex subject, which is impossible to cover even with half-day session. One of the topics I left out of scope is how it plays with columnstore indexes. Even though partitioned columnstore tables are similar to B-Tree ones, there are some differences between them.

The biggest difference resides in partition function split and merge behavior. With B-Tree indexes, you can split and merge non-empty partitions. SQL Server would split or merge the data automatically, granted with the schema-modification (Sch-M) table lock held in place. Other sessions would be unable to access the table but at least split and merge would work.

This is not the case with columnstore indexes where you would get the error when you try to split or merge non-empty partitions. There are several reasons for this limitation. Without diving very deep into columnstore index internals, I could say that the only option of doing split of merge internally is rebuilding columnstore index on affected partitions. Fortunately, you can split and merge empty columnstore partitions, which allow you to workaround the limitation and also implement Sliding Window pattern and use partitioning to purge the data.

The Sliding Window pattern stands for scenario when we want to retain the data for some period of time. Consider the situation when we need to keep last 12 months of data. In the beginning of each month, we may purge the data that is older than 12 months, basically having a window on the data that slides purging the oldest data, based on a given schedule.

Data partitioning is great in this scenario. It allows to keep the purge process on metadata- and minimally logged-levels by switching the old-data partition to staging table truncating it afterwards. This helps to avoid log-intensive DELETE operations and reduce blocking in the system.

Let’s look at implementation. First, I would create partition function and scheme considering that now is November 2017 and we would need to retain 12-month of data starting November 2016.

This code will create 16 partitions. The left-most partition will host the data prior 2016-11-01 and it will be empty. The two right-most partitions will host the data for December 2017 and everything starting on 2018-01-01. They also will be empty at this point.

There is the simple reason why we want to have two right-most partitions empty. Let’s assume that we will run the monthly purge process in December and December’s partition will already store some data. As part of the purge process, we would like to create the new partition for the “next month” data. It is beneficial to perform this and split partition function on empty partition. With B-Tree tables, it is nice to have implementation. It allows to avoid any data scans and movements reducing the time for schema modification lock being held. For columnstore indexes, this is actually the must have part of design – you would be unable to split non-empty partition at all.

If you ever implemented Sliding Window pattern with B-Tree indexes, you would be aware of empty right-most partition. However, you might notice that in our example, there is also empty left-most partition present. As I already mentioned, columnstore indexes would not allow you to merge non-empty partitions and you need to have an extra empty partition to perform the merge after old data is purged.

Let’s look at the process in details. As the first step, I will create two tables that will share the partition function and scheme. We will populate them with some data and create the columnstore indexes on the tables.

Let’s create another empty partition by splitting partition function. The right-most partition is still empty and operation would succeed. As result, we will have separate empty partitions for January, 2018 and another partition that will store all data starting 2018-02-01. We will split the right-most empty partition again during our January’s 2018 purge process.

The process of purging old data would consist of several steps. First, we need to create the staging tables to which we will switch old November 2016 partitions. Those tables need to be empty, have exactly the same schema with the main tables and reside on the same filegroup.

If you look at dbo.Orders allocation information again, you would see that two left-most partitions are now empty as shown in Figure 4

04. Data Distribution And Allocation Information After Purge

Now they can be merged, which will complete our implementation.

alter partition function pfOrders()
merge range('2016-11-01');

As you can see, implementation of Sliding Window pattern with columnstore indexes is very similar to B-Tree tables. The only differences are:

You must have empty right-most partition pre-allocated to perform the split. I’d like to reiterate that even though it is not required with B-Tree indexes, such empty partition would reduce I/O overhead and table locking during split operation there.

You must have another empty left-most partition to perform the merge. This is not required nor needed with B-Tree indexes.

I’d like to thank Ned Otter (blog) for his suggestions and feedback. They convinced me to blog on the subject.

As you know, SQL Server uses range locks to protect the range of the index keys. This usually happens in SERIALIZABLE isolation level. This level prevents phantom and non-repeatable reads phenomena and it guarantees that queries executed in transaction will always obtain the same set of data every time they were executed.

As the simple example, consider the table with ID column and two rows with ID = 1 and ID = 10. In SERIALIZABLE isolation level, the query that selects data from the table should always obtain those and only those two rows if you run it multiple times. SQL Server uses shared range lock protecting ID key range interval of (1..10), which guarantees that other transactions would not be able to update or delete existing rows nor insert any new rows into the interval.

In contrast, REPEATABLE READ isolation level uses row locks rather than range locks. They would prevent modifications of existing rows- 1 and 10- but would not prevent other transactions from inserting rows into the interval and introducing phantom read phenomena.

The range locks are usually acquired only in SERIALIZABLE isolation level; however, there is another, pretty much undocumented case, when SQL Server can use those locks. It happens even in READ UNCOMMITTED and READ COMMITTED SNAPSHOT modes when you have nonclustered indexes that have IGNORE_DUP_KEY=ON option. In that case rows with the duplicated index keys would not raise an error but rather being ignored. SQL Server would not insert then into the table.

This behavior leads to very hard to explain cases of blocking and even deadlocks in the system. Let’s look at the example and create the table with a few rows as shown below. As you see, nonclustered index on the table has IGNORE_DUP_KEY option enabled.

As you can see in Figure 1, INSERT statement acquired and held two exclusive (X) locks on the rows inserted into clustered and nonclustered indexes. It also obtained Range (RangeS-U) lock on nonclustered index. RangeS-U means that the key range is protected with the shared (S) lock and SQL Server uses update (U) scan within the range.

01. Locks Held by the Session

You may ask the obvious question – why the range lock is required? The reason is the way how SQL Server handles modifications of the data. The data is always inserted into or updated in the clustered index first followed by nonclustered index updates. With IGNORE_DUP_KEY=ON, SQL Server should prevent the situation when the duplicated keys were inserted into nonclustered index simultaneously after clustered index insert was done and, therefore, clustered index insert needs to be rolled back. Thus, it locks nonclustered index key range before any data modifications preventing other sessions from inserting any rows there.

You can confirm it by looking at lock_acquired xEvent events as it is show in Figure 2. As you can see, the RangeS-U lock is acquired before exclusive (X) locks on the resources.

02. lock_acquired Events During Insert

The key problem here, however, is that RangeS-U lock behaves the same way as in SERIALIZABLE isolation level and it is held until the end of transaction. Moreover, RangeS-U locks are incompatible with each other. That can lead to very unpleasant and hard to understand deadlocks.

Let’s run the code shown below in another session. The first INSERT would succeed (it is in the different key range in the index). The second, however, would be blocked due to RangeS-U/RangeS-U lock incompatibility.

If we checked the locks held by the both sessions now, we would see the picture shown in Figure 3. You can see that session 2 successfully acquired the first range lock but the second range lock request is blocked due to incompatible range lock on the same key interval held by the session 1.

03. Locks Held by Both Sessions

Finally, if we run another INSERT in the session 1 into the range locked by the session 2, it would be also blocked with the typical deadlock condition.

Unfortunately, there is very little you can do about that. The only way to address the problem is removing IGNORE_DUP_KEY option from the index handling duplicates in the different ways. It may or may not work for you.

Finally, it is worth mentioning, that SQL Server does not use range locks in case of clustered indexes with IGNORE_DUP_KEY=ON option. Clustered indexes are modified first and SQL Server could detect duplicated keys at this stage without any extra range locking required.

I think many of us felt quite excited and the same time disappointed with In-Memory OLTP in SQL Server 2014. It was the great and promising technology but it had way too many limitations. The biggest one for me, perhaps, was inability to use data types that required off-row storage. It was possible to address that by changing database schema, implementing data partitioning or performing other tricks; however, it required complex development efforts and often made In-Memory OLTP migration cost ineffective.

SQL Server 2016 removes this and many other limitations. Now we can create tables with (max) columns and with data rows that exceed 8,060 bytes. There is the catch, however. Off-row storage in In-Memory OLTP works very differently comparing to on-disk tables and incorrect design decisions could and would affect performance of the system. Today we will look at that in details.

As all of us know, In-Memory OLTP does not use the data pages. Well, it uses data pages in nonclustered range indexes but the data rows are stored as the separate in-memory objects. They are linked into the row chains through the regular memory pointers. Every index in In-Memory OLTP adds another pointer and creates another chain of the rows.

There are two types of indexes supported in In-Memory OLTP – hash and nonclustered (range) indexes. I do not want to dive into all the details but hash index, in the nutshell, consists of in-memory hash table where each element stores the pointer to the data row chain with the same hash value of the key. You can see the simplified version in Figure 1, which shows the table with two hash indexes on Name and City columns and assumes that hash function generates the hash based on the first letter of the string.

01. Hash Indexes

Each data row has two timestamps that indicate its lifetime. They store the Global Transaction Timestamp of the transactions that inserted and deleted them. For example, you can see two “Ann” rows in the diagram. The first one, with City = “New York” has been created by a transaction with timestamp of 10. The city was updated to Cincinnati by transaction with timestamp of 50, which deleted the old and created the new versions of the row.

The second In-Memory OLTP index type – range index is very similar to B-Tree index defined on on-disk table. The range index consists of the data pages on root, intermediate and leaf levels. On root and intermediate levels, every index row points to the data page on the next level. On the leaf level, index rows point to the actual data rows with the same value of index key. The data pages in the index reference each other through another array-life structure called the Mapping Table as it illustrated in Figure 2. For example, the index row Kevin on the root page references the first element (PID = 1) in the mapping table, which, in turn, stores the pointer to intermediate data page with the highest key value of Kevin.

02. Nonclustered (Range) Indexes

One of very key elements in this schema is that every index, in the nutshell, is covering. Every memory pointer references the actual data row structure regardless of how many columns were included to the index keys.

Every In-Memory OLTP object uses memory and is called a memory consumer. Memory Consumers allocate memory from the structures called varheaps – one varheap per In-Memory OLTP object. You can think about varheaps as the data structures that respond to and track memory allocation requests and can grow and shrink in size when needed. You can track detail memory-allocation information per-memory consumer using sys.dm_db_xtp_memory_consumers view.

Let’s look at the example and create the table with one hash and one nonclustered indexes and look at memory consumers as shown below.

You can see the output of the query in Figure 3. The xtp_object_id column represents internal In-Memory OLTP object_id, which is different than SQL Server object_id.

03. Memory Consumers (In-Row Storage Only)

As you can see, the table has three memory consumers. The Range Index Heap stores internal and leaf pages of nonclustered index. The Hash Index Heap stores the hash table of the index. Finally, the Table Heap stores actual table rows. Figure 4 illustrates that.

04. Memory Consumers

Now let’s see what will happen if we alter our table and add row-overflow and LOB columns using ALTER TABLE statement shown below. This alteration is offline operation, which rebuilds the table in the background.

Now, if you get the list of memory consumers using the query from the first listing again, you would see the output as shown in Figure 5.

05. Memory Consumers with Off-Row Storage

As you can see, both off-row columns introduce their own Range Index Heap and Table Heap memory consumers. In addition, LOB column adds LOB Page Allocator memory consumer (more about it later). The minor_id column indicates the column_id in the table to which memory consumers belong.

As you can guess from the output, SQL Server 2016 stores both, row-overflow and LOB columns in the separate internal tables. Those tables consist of 8-byte artificial primary key implemented as nonclustered index and off-row column value. The main row references off-row column through that artificial key, which is generated when row is created. It is worth repeating that this reference is done though the artificial value rather than the memory pointer.

This approach allows In-Memory OLTP to decouple off-row columns from the main row using different lifetime for them. For example, if you update the main row data without touching off-row columns, SQL Server would not generate new versions of off-row column rows avoiding large memory allocations. Vice versa, when only off-row data is modified, the main row stays intact.

In-Memory OLTP stores LOB data in the memory provided by LOB Page Allocator. That consumer is not limited to 8,060-byte row allocations and can allocate large amount of memory to store the data. The rows in the Table Heap of LOB columns contains pointers to the row data in LOB Page Allocator.

Let’s assume that we run several DML statements with imaginary Global Transaction Timestamp values as shown below.

Figure 6 illustrates the state of the data and links between the rows. It is omitting hash table and nonclustered index structures in the main table along with internal pages of nonclustered indexes for off-row columns for simplicity sake.

06. In-Row and Off-Row Rows – Decoupled

Decoupling of in-row and off-row data reduces the overhead of creating extra row versions during data modifications. However, it will add additional overhead when you insert and delete the data. SQL Server should create several row objects on insert stage and update end timestamp of multiple rows during deletion. It also needs to maintain nonclustered indexes for off-row column tables.

There is also considerable overhead in terms of memory usage. Every non-empty off-row value adds 50+ bytes of the overhead regardless of its size. Those 50+ bytes consist of three artificial ID values (in-row, off-row in data row and leaf-level of the range index) and off-row data row structure. It is even larger in case of LOB columns where data is stored in LOB Page Allocator.

One of the key points to remember that decision which columns go off-row is made based on the table schema. This is very different from on-disk tables, where such decision is made on per-row basis and depends on the data row size. With on-disk tables, data is stored in row when it fits on the data page.

In-Memory OLTP works in the different way. (Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.

Let’s look at the example and create two tables of the similar schema. One of the tables has twenty varchar(3) columns while another uses twenty varchar(max) columns. Let’s populate those tables with 100,000 rows with 1-character value in each column. The code is shown in listing below.

Figure 7 illustrates memory consumers in this scenario (in-row at top and off-row at the bottom). As you can see, every varchar(max) column adds another internal table to the mix.

07. Test Table Memory Consumers

The execution times of INSERT statements n my environment are 153 and 7,722 milliseconds respectively. With off-row storage, In-Memory OLTP needs to add data to twenty other internal tables, which is 40 times slower comparing to in-row storage.

Let’s look at the total memory usage of both tables using the queries below.

As you can see in Figure 8, off-row storage uses about 252MB of RAM comparing to 12MB of RAM with in-row storage.

08. Test Tables Memory Usage

There is another important implication. Indexes defined on the table are not covering off-row data. SQL Server needs to traverse nonclustered indexes on off-row column tables to obtain their values. Conceptually, it looks very similar to Key Lookup operations in on-disk tables done in reverse direction – from clustered to nonclustered indexes. Even though the overhead is significantly smaller comparing to on-disk tables, it is still the overhead you’d like to avoid.

You can see this overhead by running SELECT statements shown below. In case of off-row data, SQL Server will have to go through every internal table for each row.

select count(*)
from dbo.DataInRow
where Col1='0' and Col2='0' and Col3='0' and Col4='0' and Col5='0' and Col6='0' and Col7='0' and Col8='0' and Col9='0' and Col10='0' and Col11='0' and Col12='0' and Col13='0' and Col14='0' and Col15='0' and Col16='0' and Col17='0' and Col18='0' and Col19='0' and Col20='0';
select count(*)
from dbo.DataOffRow
where Col1='0' and Col2='0' and Col3='0' and Col4='0' and Col5='0' and Col6='0' and Col7='0' and Col8='0' and Col9='0' and Col10='0' and Col11='0' and Col12='0' and Col13='0' and Col14='0' and Col15='0' and Col16='0' and Col17='0' and Col18='0' and Col19='0' and Col20='0';

Execution time in my environment is 73 milliseconds for in-row data and 1,662 milliseconds for off-row data.

Finally, let’s look what happen when we delete the data from the tables by using the code below.

delete from dbo.DataInRow;
delete from dbo.DataOffRow;

As before, in-row storage is significantly faster – 28 milliseconds vs. 1,290 milliseconds. As you can guess In-Memory OLTP had to update end timestamp in every row in off-row tables.

You should remember about this behavior and avoid off-row storage unless you have legitimate reasons to use such columns. It is clearly the bad idea to define text columns as (n)varchar(max)just in case – when you do not store large amount of data there. As you see, off-row storage comes at very high cost.