Klaus Aschenbrenner provides independent SQL Server Consulting Services across Europe and the US. Klaus works with the
.NET Framework and especially with the SQL Server 2005/2008 from the very beginnings. In the years 2004 - 2005 Klaus
was entitled with the MVP award from Microsoft for his tremendous support in the .NET Community. Klaus has also
written the book Pro SQL Server 2008 Service Broker which was published by Apress in the Summer of 2008. Further
information about Klaus you can find on his homepage at http://www.SQLpassion.at. He also twitters at
http://twitter.com/Aschenbrenner.

My complete last week was dedicated to XTP (Extreme Transaction Processing), better known as Hekaton, that is introduced with SQL Server 2014. My main goal of the week was to get a better understanding of Hekaton, the current limitations, and the scalability of this amazing new In-Memory technology. This blog posting covers the following things about Hekaton:

Main Overview

Scalability

Limitations

1. Main Overview

Let’s start with a brief overview of XTP. The main goal of an In-Memory technology like XTP is very easy: use the hardware of our current server systems as efficient as possible. When we look nowadays at the hardware of modern server systems you can see the following problems/limitations:

Traditional (rotational) storage is very slow, and enterprise ready SSD storage is very expensive. On the other hand main memory (RAM) is very cheap, it just costs you a few 100 dollars to hit the 64 GB limit of the Standard Edition of SQL Server.

The CPU speed is not scaling anymore. We are currently stuck at around 3 – 4 GHz, anything more is not possible (expect when you are over-clocking your systems)

Traditional RDBMS systems can’t scale linearly, mainly because of the internal Locking, Blocking, and Latching (In-Memory locking of data structures, when they are accessed for reading and writing) mechanism

Executes RDBMS queries as fast as possible and with the least possible amount of CPU instructions to use the current limited clock speed of CPUs as cost-effective as possible

Avoids entirely Locking/Blocking, and Latching when executing a Read/Write workload against your RDMBS

These 3 points are just the 3 main pillars of Extreme Transaction Processing in SQL Server 2014:

With XTP you are able to cache complete tables in in memory (so-called Memory Optimized Tables), Stored Procedures can be compiled down to native C code, and for Memory Optimized Tables the complete Locking/Blocking, and Latching mechanism disappears, because XTP is based on the principles of MVCC (Multi-Version Concurrency Control). Let’s have a more detailed look on those 3 pillars.

In-Memory Storage

RAM for server systems is incredibly cheap. You just need a few hundred dollars to equip your server with 64 GB, and you have already hit the limit that can be used with the Standard Edition of SQL Server. Therefore XTP tables (In-Memory Optimized Tables) are completely stored in RAM. From a SQL Server perspective the whole data of the In-Memory Optimized tables live in a FILESTREAM File Group, and during the startup of SQL Server they read the data from the file group, and rebuild all your indexes on the fly.

This has also a huge impact on your RTO (Recovery Time Objective), because your database is only in the ONLINE state, as soon as ALL your indexes are rebuild during startup. The speed of the storage system where your FILESTREAM File Group is stored, has therefore a direct influence on your RTO. For that reason you can also place multiple containers inside the FILESTREAM File Group, so that you can spread the I/O during startup over multiple storage systems to get your database as fast as possible into the ONLINE state.

In the CTP1 XTP only supports so-called Hash-Indexes, which are stored entirely in Hash Tables in memory. SQL Server is currently able to Seek and Scan Hash Indexes. Beginning with the CTP2 Microsoft will also introduce so-called Range Indexes, where you can also make Range Queries very, very effective. Range Indexes are based on the principles of a so-called Bw-Tree.

Every In-Memory Optimized table is also compiled down to native C code. For every table you get one DLL, that is compiled through cl.exe (the Microsoft C Compiler, shipped with SQL Server 2014). The generated DLL is afterwards loaded into the process space of sqlservr.exe and can be seen in sys.dm_os_loaded_modules. The compilation itself is done in a separate process, which means you can see cl.exe in Task Manager as long as you are fast enough and catch it The following listing shows a typical C code that describes your table.

As you can see, the code itself is not very intuitive, but you can see how your table structure is described through the use of a struct in C. The nice thing about XTP is the complete natural integration of In-Memory Optimized Tables into the rest of the relational engine of SQL Server. So you can query these tables from traditional T-SQL Code, the whole story about Backup/Restore just works, and you also have the complete integration into the HA/DR technologies – Microsoft has done here an amazing amount of work in the area of integration. Besides the In-Memory storage, In-Memory Optimized tables are also avoiding completely Locking/Blocking, and Latching because they are based on the principles of MVCC. We will talk about that later in the section about Lock/Latch Free Data Structures.

The most important thing that you have to be aware of is the fact that you should only move your most performance critical tables to memory, NOT YOURCOMPLETE database! We will talk later in the section about Scalability about use cases where XTP make sense. Normally you have 95% of your database stored with traditional disk based tables, and the remaining 5% are stored with In-Memory Optimized Tables.

Native Compilation

With the In-Memory storage of table data Microsoft addresses the first problem of our current hardware systems: traditional, rotational storage is just slow. The second problem that needs to be addressed is the fact that the clock speed of processors is currently stuck at 3 – 4 GHz. We can’t go further up, because of the high heating that would be introduced. Therefore the current clock cycles must be managed as effectively as possible. This is a huge problem with the current implementation of T-SQL, because T-SQL is just an interpreted language.

During Query Optimization the Query Optimizer of SQL Server produces a so-called Query Tree, and that Query Tree is interpreted during runtime from the top operator down to all siblings. Paul White describes that behavior in an awesome blog posting. This introduces a massive amount of additional CPU instructions that must be executed for every executed Execution Plan in SQL Server. Additionally every operator (a so-called Iterator) is implemented as a C++ class, which also means that so-called Virtual Function Calls are used when executing the various operators. Virtual Function Calls again are very expensive regarding the CPU instructions that are needed for execution. In summary, an Execution Plan that is interpreted during runtime, produces a huge amount of CPU instructions, which means the current CPUs are not as effective as possible. You are just waisting valuable CPU cycles, that could spend otherwise in a more better way and improve the throughput of your workload.

Because of these reasons and limitations inside the Query Engine, SQL Server introduces with XTP so-called Native Compiled Stored Procedures. The idea behind it is very simple: the whole body of a Stored Procedure is just compiled down to native C code, and the result is again a DLL that is loaded into the process space of sqlservr.exe. Therefore you have no interpretation during runtime, and Virtual Function Calls are also eliminated completely. The result is that you can do the same amount of work with fewer CPU instructions, which means the throughput of your workload will be higher, because more work can be done with the available CPU cycles.

During TechEd North America 2013 indicated that they were able to reduce for some specific Stored Procedures the needed CPU instructions from 1.000.000 down to around 40.000. Imagine that performance gain: that’s an improvement of 25x! But as you will see later when we talk about the current limitations in XTP, this improvement is not for free… The following listing shows a typical C code that was generated for a simple Stored Procedure.

As you can see there are a lot of GOTO statements, which leads to a nice spaghetti code. But that’s a different story…

Lock/Latch Free Data Structures

When we have talked earlier about the In-Memory Storage of data in XTP, I have also said that SQL Server implements for In-Memory Optimized Tables Lock and Latch Free Data Structures. This just means that there is no waiting because of Locks and Latches involved when you want to read and write your data. A traditional RDBMS like SQL Server always acquires Exclusive Locks (X) for write operations, and Shared Locks (S) for read operations. Both locks are incompatible to each other.

This means readers are blocking writers, and writers are blocking readers. How long Shared Locks are hold are controlled through the various Transaction Isolation Levels. This approach is called Pessimistic Concurrency. With the release of SQL Server 2005 Microsoft also introduced a new concurrency model: Optimistic Concurrency. With Optimistic Concurrency readers are not acquiring Shared Locks anymore. They are reading their data transparently from the Version Store that is persisted in TempDb.

With the new Isolation Level Read Committed Snapshot Isolation (RCSI) you are getting back the version of the record which was valid at the start of your statement, and with the Isolation Level Snapshot Isolation you are getting back the version of the record which was valid at the start of your transactions, which means you get Repeatable Reads in Snapshot Isolation.

Setting one of these new Isolation Levels can have a huge positive impact on the throughput of your workload. But there are still problems that must be solved:

Writers are still acquiring Exclusive Locks, which means parallel Write Operations are still blocking each other.

When accessing data structures in memory (Data Pages, Index Pages), these structures must be latched, which means they can be only accessed Single-Threaded. That’s a traditional Multi-Threading Concurrency issue that has to be solved this way.

For that reason XTP introduces Hash Indexes that are based on the principles of MVCC (Multi Version Concurrency Control). With MVCC there is no need for Locks (even no Exclusive Locks!) and Latches anymore. Writers are not blocking writers anymore, and because Hash Indexes are not build around pages, the access to the data (that is internally stored in a Hash Table with Hash Buckets) is also Latch free. There is no blocking in memory anymore. This means that an awesome throughput is guaranteed when you are working with XTP. But when your In-Memory access is not latched anymore, you are mainly shifting your performance bottlenecks, mainly to the Transaction Log, as we will see in the next section about Scalability with XTP.

One side effect of MVCC is the so-called Writer-Writer conflict. You can have multiple write activities to the same record in XTP, and they are not blocking. The 1st writer just wins, and all the other concurrent write transactions are just failing. This means that you need in your application code changes that are catching this specific error, and are just retrying your transaction. It’s almost the same as Deadlock Handling. If you have already implemented Deadlock Handling in your application, it should be very easy also to handle the Writer-Writer conflict transparently to your end-users.

2. Scalability

By now you have a general idea about the main concepts and reasons behind XTP, but the big question is, in which scenarios XTP makes sense. In my opinion XTP is NOT a technology that has to be deployed everywhere. You need a specific use case where XTP makes sense. And trust me: most of the current SQL Server problems that we are facing today, are just Indexing problems, or problems with bad configured hardware (especially in the area of SANs).

I would NEVER EVER recommend moving to XTP, when you are facing such a problem! Always analyze the root cause, and try to resolve that one in the first step. XTP should be your last resort, because when you are using and implementing parts of your database with XTP, you have to think completely different. And there are a huge amount of different limitations that you have to live with. XTP is wicked fast (I would say *really damm fast*), but it’s not an all-time wonder, which can be applied everywhere

Microsoft positions XTP mainly for overcoming Latch Contention problems. As you have already seen earlier, Latching occurs all the time in memory, when you are accessing pages for read and write activities. When your workload goes up and up and up, at some point in time you introduce Latch Contention because of the Single-Threaded access to those pages in memory. The most traditional example here is the so-called Last Page Insert Latch Contention. If you want to learn more about removing Latch Contention in TempDb, I suggest watching my SQL Server Quickie about it.

This problem is very easy to reproduce: just follow the best practice and create a Clustered Key which is ever increasing to avoid hard Page Splits in your Clustered Index. Your workload will not scale – trust me! The problem here is that you have during INSERT statements a single hot-spot in your Clustered Index – the last page. The following picture gives you can overview of that specific problem.

As you can see from the picture, SQL Server has to traverse the right hand side of the Clustered Index from the Index Root Page down to the Leaf Level to insert a new record on the trailing edge of the Clustered Index. So you have a Single-Threaded access to the page in the Leaf Level, which means Single-Thread INSERT operations. This will hurt your performance massively! The following picture shows a simple INSERT statement in table with an INT IDENTITY column (ever increasing, leads to the Last Page Latch Contention!) that I have executed with ostress.exe (part of the RML Utilities from Microsoft) with a different amount of users on a 16 core machine (HP DL180 G6, 2 Quad-Core Sockets with HT enabled, 24 GB RAM, OCZ RevoDrive SSD):

As you can see in the picture, the throughput of the workload decreases as soon as you add additional users, and your Latch Waits just increases – traditional Last Page Insert Latch Contention with an ever increasing value. The contention occurs when accessing Index- and Data-Pages in memory, because of the Latching.

There are several ways to overcome the Last Page Insert Latch Contention:

Use a random Clustered Key, like a UNIQUEIDENTIFIER to distribute the INSERTs across the whole Leaf Level of the Clustered Index

Implement Hash Partitioning

In the first step you are thinking that you are doing it completely wrong when using a UNIQUEIDENTIFIER as your Clustered Key, but the throughput of your workload will increase tremendously. Hash Partitioning of the table is another option that you can deploy. Hash Partitioning just means that you create for every CPU core a different partition, and your Partition Function distributes the records with the Modulo operator across the different partitions. The following picture shows this approach.

You are just distributing the INSERT statement across different B-Tree structures in different partitions, so you get ultimately parallel INSERTS in the specific table. But there are several drawbacks, e.g. you need the Enterprise Edition of SQL Server, you can’t partition your table in another way, and you can’t use Partition Elimination effectively. The following picture shows the throughput of the test workload when the Hash Partitioning approach was used for that specific table:

As you can see from the picture the throughput scales up to 64 users, and with 128 users Latch Contention occurs again, and the throughput just decreases.

Imagine now what happens when you move that performance critical table to memory? Your throughput on that table will be wicked fast – XTP is REALLY DAMM FAST, because there is no Locking and Latching anymore! Let’s have a look on the following screenshot about the Batch Requests in SQL Server.

In this scenario I was able to run the simple INSERT statement with 200 users in a native compiled Stored Procedure and produced around 25.500 Batch Request/sec – with ZERO Waiting Tasks and ZERO Database I/O! Everything was happening within memory. But: the test was running this time in a Virtual Machine, because I have not deployed SQL Server 2014 CTP1 on my physical server. The VM had 8 Cores, and 20 GB of RAM assigned, and the VM and all corresponding SQL Server files were stored on a PCI Express based SSD.

The question is now how I have achieved that throughput with XTP? It was not really straightforward. The first thing that you have to think about is again your Clustered Key. IDENTITY columns are not supported with XTP. For that reason Microsoft suggests using a Sequence Object. Sequences are perfect, you can use caches, but XTP is so really damm fast, that you immediately hit contention in the Sequence Generator of SQL Server. SQL Server stores the Sequence values on page 132 of your primary data file. Page 132 is part of the system table sysobjvalues. When reading or writing to that page, SQL Server again has to latch that page, and your Latch Contention is back, but just on a different area in your database. You also can’t avoid latching this page, because system tables are still stored on traditional disk based tables. So from that perspective it seems that Sequences are not the best solution for XTP, if you want to scale your workload infinitely.

So let’s just go back to our good old friend UNIQUEIDENTIFIER. There can’t be any contention when generating UNIQUEIDENTIFIERs because they are just generated through an algorithm. The bad thing: the function NEWID() is in CTP1 of SQL Server 2014 “not yet implemented” But this doesn’t matter, because you can write a traditional T-SQL “Wrapper” Stored Procedure, generate the UNIQUEIDENTIFER in that one, and pass in the generated one to your native compiled one through a parameter value. Problem solved, and I was able to scale the workload up to 25.500 Batch Requests/sec. From my findings a UNIQUEIDENTIFIER is much better than a sequence, because there is no shared resource that has to be coordinated and latched. That’s one of my most important lessons learned from the previous week.

So the question is now what limits the throughput to 25.500 Batch Request/sec? Mainly 2 things: the Transaction Log and CPU usage!

Let’s have in the first step a look at the Transaction Log. Microsoft has done a tremendous amount of optimizations for the Transaction Log in XTP scenarios, e.g. there are no UNDO records. Microsoft just tries to minimize the amount of Transaction Log Records that gets written into the Transaction Log. The less you write, the faster it will be. To overcome the limits of the Transaction Log XTP offers you 2 kinds of In-Memory Optimized Tables:

SCHEMA_AND_DATA

SCHEMA_ONLY

SCHEMA_AND_DATA means that the table schema and the data is persisted, therefore XTP has to write Transaction Log Records into the Transaction Log as soon as you commit your transaction. During the execution of your transaction, XTP NEVER writes Transaction Log Records into the Transaction Log, because there is always the possibility that your transaction is rolled-back. SCHEMA_ONLY means that only the table schema is persisted, and the data in the table is just transient: when you restart your SQL Server, the data is lot. Huh??? You really have to think carefully about that option, when it makes sense, and when not. Microsoft positions that option mainly for 2 scenarios:

ASP.NET Session State Database

ETL Scenarios

The ASP.NET Session State Database makes sense, because you are not storing mission critical data here. Only session state information about your web users. For ETL scenarios it also make sense, because in the case of a failure it should be very easy to reconstruct the data. The 25.500 Batch Requests/sec from above were achieved with the option SCHEMA_ONLY, so I have cheated here a little bit. With SCHEMA_AND_DATA the Transaction Log throughput was my main bottleneck, and it limited me to about 15.000 Batch Requests/sec. As I have already said, everything was running on a PCI Express based SSD (Transaction Log, Data Files, Virtual Machine), so I should get a better throughput when I run everything on a bare metal installation with the Transaction Log stored on a dedicated SSD.

So when you deploy your In-Memory Optimized Tables with SCHEMA_AND_DATA, you just need one thing: an amazingly fast Transaction Log – as usual

When you are using SCHEMA_ONLY the CPU will be your bottleneck, because there is nothing else. As you can see in the picture above the average CPU consumption was about 85% within a Virtual Machine. I assume that would be able to double that throughput, when I deploy everything on a bare metal installation, because I was only able to assign 8 cores out of 16 cores to my Virtual Machine that was running the workload. On the host level I had an average CPU consumption of 50%, so doubling the throughput should not be the problem (I think so…). This would be then 50.000 Batch Requests/sec on a very low cost hardware…

3. Limitations

By now everything was great about XTP, and it seems to be a great solution to some specific problems in SQL Server. But XTP has a very high price – a long list with limitations, especially with the CTP1. Here we go with the things that I have found out over the last week (of course there is more documented in Books Online, when you read through everything):

The whole table definition must be described in one CREATE TABLE DDL statement (incl. Indexes, Constraints)

You can’t insert data from a different DB directly into a Memory-Optimized Table, you need a intermediary Staging Table. That’s a huge bummer for the above mentioned ETL scenarios

…

Besides all these limitations, there are also still bugs in CTP1, and I was able to completely crash my database, so that it was not able to recover anymore. Just have a look on the following screen shot:

Don’t ask me how this was possible, I had also not the chance to reproduce it…

Conclusion

XTP is really a wicked fast new technology in SQL Server. The only thing that you have to be aware of is the fact that you need an amazingly fast storage for your Transaction Log, and that it is very easy to hit other bottlenecks in your system, as soon as you deploy a solution based on XTP. I hope that I gave you with this blog posting a good introduction for XTP, and I hope you will enjoy as much as I have done it over the last week.