SQL Server Blog

Today I thought I’d try out some of the new, highly-touted columnstore improvements in SQL Server 2014. First off, to figure out what’s new I used three main sources: one is the “canon”, the whitepaper, “Enhancements to SQL Server Column Stores“, presented at Sigmod 2013. The second source is the SQL Server 2014 CTP1 Books Online page, that documents and illustrates the DDL to define and work with clustered (and nonclustered) columnstore indexes. The third source is Niko Neugebauer’s excellent blog post series and experiments on Clustered Columnstore Indexes, that starts here.

Rather than repeat what those folks have already implemented, documented, and experimented with respectively, I went off on a different tangent. I’d like to find out how the new clustered columnstore index affects query plans, and look for changes in queries between SQL Server 2012 and 2014 CTP1. I started by restoring AdventureWorks2014 to a SQL Server 2014 instance and a SQL Server 2012 SP1 instance. And taking the FactInternetSales table up to over a million (about 1.15 million) rows using a combination of select…into and insert…select.

On the SQL Server 2012 instance a made in copy of the table without a nonclustered columnstore index and one with a nonclustered columnstore index. The nonclustered columnstore index contained all of the columns in the table. SQL Server 2014 had a third copy of this table, one that used a clustered columnstore index.

With each table (named FactInternetSalesNew, FactInternetSalesNewNCI, and FactInternetSalesNewCCI so I could keep trace of them), I did the relevant variation of the following query:

select ProductAlternateKey, CalendarYear, SUM(SalesAmount) as SumSales
from FactInternetSalesNew fact
join DimDate on fact.OrderDateKey = DateKey
join DimProduct on DimProduct.ProductKey = fact.ProductKey
where CalendarYear BETWEEN 2006 AND 2007
and ProductAlternateKey LIKE ‘BK%’
group by ProductAlternateKey, CalendarYear;

The query plans were interesting is their differences and similarities.

1.The “traditional” row-based storage used a HashJoin with a Bitmap for one dimension (DimDate) and a MergeJoin to join in the Product dimension. This was the same in SQL Server 2012 SP1 and 2014. This is a different plan shape than any of the columnstore plans, with the dimensions being joined in a different order and the aggregation happening at a different point in the plan.
2.The SQL Server 2012 nonclustered columnstore index used BatchHashTableBuild iterators to build hash tables for each dimension, then HashJoins for the join with each dimension. The scan of the fact table, BatchHashTableBuilds and HashJoins, and a HashMatch (partial aggregation) iterator after the joins used Batch Mode; the rest of the iterators used Row mode.
3. The SQL Server 2014 nonclustered columnstore index did NOT use BatchHashTableBuild iterators, instead the join between the Fact table and DateDim used a HashMatch. However, the HashMatch iterator did contain the new (for 2014) property “BitmapCreator=True”. The HashMatch iterator to join in the Product dimension did use a HashMatch, but the BitmapCreator property was not present. All iterators EXCEPT the Exchange, DimensionTable scans, and the last HashMatch (aggregate) used Batch Mode.
4. The SQL Server 2014 clustered columnstore index plan was essentially the same as the nonclustered columnstore index plan.

For clarity’s sake, I’ll continue the series and see how the cardinality estimation works with columnstore indexes in the next post.
For expediency’s sake, the plans are posted as an attachment. I’ll hopefully get back to replace them with pictures soon.

So, to summarize:
1. The query plan for nonclustered columnstore index changes for the example query between SQL Server 2012 SP1 and 2014 CTP1. All experiments are performed on my 2-proc VM.
2. In SQL Server 2014 CTP1, the plan for nonclustered and clustered columnstore indexes are the same. The savings of the clustered columnstore index is that you don’t have to store your actual data twice.

According to the whitepaper “Mission-Critical Performance and Scale with SQL Server and Windows Server” there’s a new cardinality estimator in SQL Server 2014. The whitepaper doesn’t say a lot about how it works, but does mention “trace information will be provided”.

That’s XEvent trace of course, and the trace event is “query_optimizer_estimate_cardinality”. I decided to see what kind of information there is. Created an event session with that single event, and added an action of sql_text and named it, unsurprisingly, “Cardinality”.

Before starting it up, you need to make sure you’re using the new cardinality estimator on your queries; you can check this by looking in the query plan at the value of property “CardinalityEstimationModel110″. You should confirm that the value is “ON”. (Note: this a strange to me because, if the new estimator was introduced in SQL Server 2014, shouldn’t it be CardinalityEstimationModel120?) Databases that are restored from earlier versions’ backups don’t seem to have that property on, databases created and populated “from scratch” on SQL Server 2014 will. There’s also a trace flag 2312 to enable the new estimation model. You also need to be creating a plan, so make sure you don’t already have the plan in cache. Not getting the new model with restored databases from previous versions lets you compared the cardinality estimates and query plans between 2014 CTP1 and previous SQL Server versions.

Note: This post is already way too long to answer the question “What does a cardinality estimator in a database do?” To get a feel for it type “Cardinality Estimation SQL Server” into your favorite search engine and I’ll meet you back here in a while…(hint…it estimates cardinality).

This event fires, naturally, multiple times during the creation of a query plan. I started off with a “simple” 5-table join with a GROUP BY clause and received 54 of these events. Each event contains three major pieces of diagnostic information:
–InputRelation – Logical operator and part of a parse tree. Some ops contain cardinality numbers.
–Calculator – Information about the cardinality calculators used.
–StatsCollection – Stats read for this estimation step and estimates.

After switching back and forth between the chunks of XML using SSMS, I decided to factor the most useful pieces of information into columns, for easier reading. However the XML can be arbitrarily complex, so I left the original XML columns intact as well. Some calculators provide different estimate information, so these should be factored out into a case-based columns. After reading these sets of rows for a while, things become as intuitive as when you first started reading and understanding query plans.

My first row for this query was for the global aggregate logical operator (LogOp_GbAgg). I sifted out the following pieces of information:
From InputRelation:
–Operator: LogOp_GbAgg
–Table1: Order Details
–Cardinality1: 2155
From Calculator:
–CalculatorType: DistinctCountCalculator
–CalculatorName: CDVCPlanLeaf
–SingleColumnStat: OrderID
–CoveringStatDensity: 2155
From StatsCollection:
–StatsCollectionName: CStCollGroupBy
–StatsCollectionCardinality: 830
–StatsInfoObject1: Order Details
–StatsId: 1

Which means that we expect to read 2155 rows from “Order Details” and group it by OrderID into 830 groups. This is what’s estimated for the query plan and works out to be exactly the same as the acual number of rows. Comparing it to the “old” (pre-2014) estimate, its exactly the same. For this one. Other estimates that I’ve looked at are usually closer in the “new” estimate than the “old” estimate and, in many cases result in different query plans. As in every query optimizer change, there are also some regressions. I’ve only looked at less than 20 queries so far.

I’m providing the query I used to take apart the XEvent trace along with this article. You’re welcome to add more columns, and if you come up with more interesting ways to present the information, please let me know. BTW, the SQL query heavily uses SQL Server XQuery, so it’s going to be slow for large trace files unless you materialize intermediate results as tables and add XML indexes.

I also “consulted the binary” for lists of enumerated values of Operator, CalculatorType, CalculatorName, and StatsCollectionName, but they are too many to list here; they’re included in the zip file. It’s interesting to see what kind of query clauses result in a special calculator (e.g. CSelCalcTrieBased or CSelCalcExpressionComparedToExpression) or special statistics collection names (my favorite amusing ones are CStCollFudge and CStCollBlackBox).

Every new version of SQL Server includes new or expanded features that effect query plans. Most (sometimes not all) of the query plan changes (and most of the functionality changes) can be discovered by “diff-ing” the XML schema for query plans between the current/new version and the previous version. I’ve done this before to enumerate the plan changes from SQL Server 2008 R2 to SQL Server 2012.

First, the BaseStatementInfoType (these show up as the properties on the far left iterator (e.g. SELECT iterator) has some new attributes. Note that not all of them show up in the CTP1 plans I’ve looked at, the ones I’ve seen in real plans are marked with an asterisk.
CardinalityEstimationModel110* – does this use the new SQL Server 2014 cardinality estimate model or the “legacy” model?
StatementSqlHandle
DatabaseContextSettingsId
ParentObjectId
StatementParameterizationType

There’s also a new reason for non-parallelized plan: NoParallelForMemoryOptimizedTables. I didn’t find this in the diff, but noticed this in real CTP1 plans while looking for the other attributes.

In the QueryPlanType structure, there’s an additional attribute, EffectiveDegreeOfParallelism. This is the maximum parallelism used by a columnstore index build. Parallel columnstore index builds is a new feature.

New Attribute on HashType: BitmapCreator
New Attribute on RelOpType: StatsCollectionId

Note that, as I mentioned before, I haven’t actually seen all these new attributes yet and the XML Schema doesn’t always make discovering where I should look for them easy. But I now know approximately what to look for and looking at query plans in XML format (rather than “picture” format) will help. As time goes on, we’ll find out what causes these attributes to surface. For example, the CardinalityEstimationModel110 attribute appears in most/every query plan, but the EffectiveDegreeOfParallelism requires collecting a query plan for a columnstore index build.

Feel free to let me know if you find any more items. Happy splunking…
Bob

It’s stated in SQL Server 2014 Books Online that “The following isolation levels are supported for transactions accessing memory-optimized tables: SNAPSHOT, REPEATABLE READ, and SERIALIZABLE”. But later on, on the same page, it’s stated the “The isolation level READ COMMITTED is supported for memory optimized tables with autocommit transactions”. “Autocommit transaction” is just another way to say that individual SQL statements are always atomic, and so an “autocommit transaction” is a single SQL statement that either completely commits or completely rolls back. There’s no way, for example, for a single SQL UPDATE statement to update 10 rows successfully but fail to update 1 row. Either all 11 rows are updated or none of them are.

I had trouble reconciling these two seemingly conflicting statements about Hekaton tables’ isolation level and set out to find out if Hekaton tables can implement “traditional” SQL Server read committed semantics, or if they *support* read committed, but implement it with snapshot semantics. Read committed means that “the latest committed version of a row is read”.

A well-known behavior of SQL Server tables under read committed is known as “statement-level inconsistency”. You can see this behavior by executing a single SELECT statement in a client program running under a debugger, reading a few rows and then stopping the program, and, in a different session, updating, deleted, or inserting rows that have not yet been read (but remember that SQL Server uses 8000 byte buffers for reading and sending rows to the client side). The final resultset usually will reflect the changes that happen *during* the execution of the statement, which includes the reading of the rows from table to buffer to client. If you haven’t seen this behavior demonstrated, come to a class or place where I’m speaking and ask.

Back to Hekaton tables. There are four use cases, and let’s assume the session’s isolation level is read committed, the default.
1. Running in a compiled stored procedure – the isolation level must be specified in the stored procedure definition. Read committed is not supported.
2. Explicit user transactions (i.e. BEGIN TRAN…etc) – attempting to perform a SELECT statement on a Hekaton table without an isolation level hint or with a query hint of readcommitted produces an error.
3. Implicit user transactions (i.e. SET IMPLICIT_TRANSACTIONS ON) – work the same as explicit user transactions with respect to Hekaton tables.
4. Autocommit transactions – attempting to perform a SELECT on a Hekaton table without an isolation level hint succeeds. You aren’t allowed to use the readcommitted query hint even in an autocommit transaction.

We’re interested in use case 4. And, with that use case, in CTP1, using a SELECT statement without a hint produces SNAPSHOT semantics, rather than READ COMMITTED semantics. There is no statement-level inconsistency for Hekaton tables Also with READ COMMITTED SNAPSHOT isolation, the behavior of the READ COMMITTED isolation level (RCSI) is indistinguishable from SNAPSHOT.

Given those experimental results (and perhaps I need to do more experimentation with queries with more complex plans), I’m prepared to conclude until convinced otherwise that Hekaton tables support READ COMMITTED (and RCSI) with autocommit transactions, but *implement* that level with SNAPSHOT semantics. What’s the difference?

With a multi-version storage engine, each version is read as of a well-defined logical read time. With SNAPSHOT isolation, the logical read time is the beginning of the transaction. READ COMMITTED isolation would read the appropriate version with a logical read time equal to the current time, according to the whitepaper “High-Performance Concurrency Control Mechanisms for Main-Memory Databases“. In autocommit transactions, the logical read time always appears to be the beginning of the (one and only) SQL statement. SNAPSHOT and READ COMMITTED and (RCSI) are indistinguishable.

To wrap this up, most folks would consider the Hekaton behavior to be an improvement, as READ COMMITTED statement-level inconsistancy is not usually considered to be a desirable behavior. But it was worth asking the question… So do Hekaton tables really implement only “SNAPSHOT, REPEATABLE READ, and SERIALIZABLE” isolation?

As you can probably tell by the last couple of blog postings, I’ve be doing a bit of experimenting with the SQL Server 2014 Hekaton feature in the past few weeks. I’ll be doing a hopefully demo-heavy (but I do have a few slides) talk next Thursday July 25 at the Portland SQL Server User Group at 6:30 to share some observations. If you’re at all interested (and you’re nearby, this is Portland Oregon, after all) show up with your questions and your intuition and we’ll all share, maybe try out some things I haven’t thought of yet. Hope to see you there.

In reading the whitepaper about “High-Performance Concurrency Control Mechanisms for Main-Memory Databases”, I was intrigued by the discussion of speculative reads and transaction dependencies. It’s not always good to use information from an academic whitepaper as details of an implementation, because the real implementation might be differ slightly from the description of the whitepaper’s implementation. In addition, bear in the that the observations here are based on the CTP1 version of SQL Server 2014, and the details may change by RTM.

The whitepaper describes two implementations of the multiversion storage engine, one using optimistic concurrency with no locking and one using locking. Because the description of the Hekaton feature mentioned “lock-free structures” as one of the pillars, I looked at the optimistic implementation as possibly close to the CTP1 implementation.

To paraphrase the whitepaper, there are at least two ways the speculative reads can take place.
1. Transaction A determines if it can read Version V. Version V’s begin timestamp is another transaction’s (TB) ID. TB is in the preparing state. Use transaction B’s end timestamp (obtained right before TB-prepare) as V’s begin time. Speculatively read V. Transaction A has a commit dependency of Transaction B.
2. Transaction A determines if it can read Version V. Version V’s end timestamp is another transaction’s (TE) ID. TE is in the preparing state. Use transaction E’s end timestamp (obtained right before TE-prepare) as V’s end time. Speculatively ignore V. Transaction A has a commit dependency of Transaction E.

These descriptions mention dependencies happening in the transaction A’s Active phase, but in addition, there is mention that “Transaction A may acquire additional commit dependencies during validation but only if it speculatively ignores a version.” For information about transaction phases, reference the previous blog entry or the whitepaper.

After noticing the extended events dependency_acquiredtx_event and waiting_for_dependenciestx_event, I set out to look for those dependencies. Because tx A can only acquire a dependency on tx B when B is in the preparing phase (state) and, in most cases, the preparing state is usually pretty short, the dependency sounded almost like a race condition. Making the preparing state as long as possible would give me a better chance.

In the implementation of pessimistic concurrency, the whitepaper mentions two “read sets” that are checked during the prepare phase. The ReadSet contains pointers to every version read and the ScanSet stores information needed to repeat every scan. The whitepaper also describes WriteSet but that’s outside my scope. During the prepare phase ReadSets are checked to ensure consistency in isolation level Repeatable Read or Serializable. In addition, ScanSets are checked to guard against phantoms in isolation level serializable. Serializable isolation with a large ScanSet seemed to be the best choice to lengthen the prepare phase.

I declared a two column memory-optimized table with a primary key/hash index on one column (I called it id) and no index at all on the other column (c1). Added 250000 rows. And figured that running the following batch inside a “traditional transaction” (to slow things down even more compared to a compiled stored procedure) from multiple clients should produce the behavior.

Running betwen 5-10 of these clients simultaneously for 20 iterations of each client produced the “transaction dependency” behavior consistently. Tracing this with an extended events session that included:
dependency_acquiredtx_event
waiting_for_dependenciestx_event
before_changestatetx_event
after_changestatetx_event

Besides observing the behavior, I was able to make some interesting observations from the event session.
1. You can have “chains” of dependent transactions, e.g. tx A depends on tx B, tx B depends on tx C, etc.
2. You can have multiple dependent transactions on the same transaction, e.g. txs A, B, and C all depend on tx D.
3. You can have multiple dependency_acquiredtx_event for the same two transactions, e.g. two different occurrences of the event for the dependency tx A depends on tx B.

I also noticed one other interesting behavior. In my tests with 10 clients x 20 iterations (200 transactions total) between 60% and 80% of the transactions routinely abort. That’s not surprising, the test was set up to produce read conflicts. What was surprising is that, although transaction commit wasn’t the norm, every transaction that had dependency(ies) on it ended up committing. 22 of 200 transactions in one test. And the dependent ones also committed. That’s surprising, but perhaps the fact that transactions that have dependencies and those that they depend on all eventually commit is just an artifact of the test batch I chose.

So, from this test you can deduce that:
1. Transaction dependencies and speculative reads are real (can be observed) in the current implementation.
2. Large scans and transaction isolation levels higher than snapshot should only be used with memory-optimized tables when *absolutely* necessary. 3. That’s especially true for transaction isolation level serializable.

In the last blog post, I tried out a simple example to show the difference between the three transaction isolation levels supported by memory-optimized tables. I only used user-transactions in T-SQL, and only showed the behavior of transactions consisting of SELECT statements, when INSERT/UPDATE/DELETE activity was taking place in other sessions running concurrently. But of course, things are not always *that* simple, and there are differences regarding:
1. Transactions that do concurrent INSERT/UPDATE/DELETE in multiple sessions
2. Transactions running in or using compiled stored procedures (which must be defined as atomic and declare their isolation level)
3. Autocommit transactions (i.e. single atomic SQL statements)

Today, I’ll expand the discussion to cover some additional cases.

One thing I did mention in the previous post is WHEN a user transaction fails. The previous post originally mentioned (as a simplification) that transactions with memory-optimized tables “…fail the user transaction at commit time, if there’s a conflict detected. SQL Server’s SNAPSHOT against “traditional” tables fail during user transactions at the statement where the conflict is detected.”

That’s not always true with concurrent *actions* in a user transaction. User transactions can fail at different points in a transactions lifetime, that is:

1. Normal Processing phase – during the lifetime of a transaction
2. Preparation phase – transaction decides whether it can commit or must abort

The preparation phase concludes with writes to the transaction log, if the transaction commits. The postprocessing phase consists of timestamp fix up, whether the transaction commits or aborts. It’s also good to remember that, in a T-SQL user transaction, the transaction begins with the first statement after “BEGIN TRANSACTION” that touches data, rather than during the T-SQL “BEGIN TRANSACTION” statement.

All of the conditions I looked at last week failed in the preparation phase.

It’s obvious that, if two concurrent INSERTS with the same primary key happen, one of them must fail. If the later insert happens during a transaction, that transaction fails at commit time (in the preparation phase), rather than normal processing time.

begin tran insert dbo.t1 with (snapshot) values(5, ‘Fred’)
— insert row with same primary key in another session and commit other session (insert, commit in other session works)
commit — Msg 41325, Level 16, State 0, Line 66 The current transaction failed to commit due to a serializable validation failure.

However, WRITE conflicts (transaction updates or deletes a row already updated/deleted by another session AFTER the transaction starts) fail at normal processing time. This is what we’re used to with conflict detection in SNAPSHOT isolation level with “ordinary” SQL Server tables. The previous post is rephrased from the simplified original “…fail at commit time” to “…CAN fail at commit time” to indicate that write conflicts fail the way that we’re “used to”.

begin tran
select * from dbo.t2 with (snapshot) — starts the transaction
— update row 2 in another transaction and commit other session
update dbo.t1 with (snapshot) set c2 = ‘Sam’ where c1 = 2
— Msg 41302, Level 16, State 110, Line 120
— The current transaction attempted to update a record that has been updated since this transaction started. The transaction was aborted.
— Msg 3998, Level 16, State 1, Line 120
— Uncommittable transaction is detected at the end of the batch. The transaction is rolled back.
commit — transaction already rolled back (above)

begin tran
select * from dbo.t2 with (snapshot) — starts the transaction
— update row 2 in another transaction and commit
delete dbo.t1 with (snapshot) where c1 = 2
— Msg 41302, Level 16, State 111, Line 130
— The current transaction attempted to update a record that has been updated since this transaction started. The transaction was aborted.
— Msg 3998, Level 16, State 1, Line 130
— Uncommittable transaction is detected at the end of the batch. The transaction is rolled back.
— The statement has been terminated.
commit — transaction already rolled back (above)

Just to confirm what I mentioned before, the following shows that the transaction starts when the first data is touched, rather than at BEGIN TRANSACTION. If I remove the SELECT from (unrelated) table dbo.t2 and update and commit row 2 in another session before accessing data at all, the transaction is consistent (you’re just updated already updated/committed data after the transaction starts) and the UPDATE succeeds.

In traditional SNAPSHOT isolation, 2 INSERTers with the same primary key or 2 UPDATE/DELETEers when both sessions are in transactions cause the transaction with the last action to wait, but with multi-version memory-optimized tables, we don’t want to wait, and rather than wait transactions can take dependencies on other transactions, continuing but assuming the transactions “before” them will commit. These dependencies are resolved by waiting, if needed, during the preparation phase, rather than doing normal processing. A best practice with memory-optimized tables is to keep transactions short, to minimize waits for transaction dependencies. Refer to the whitepaper for a description of conditions that can cause transaction dependencies.

There are places where reading or writing inside a transaction can cause other sessions to wait, however. In repeatable read isolation level and above, but not in snapshot isolation level, a SELECT statement on a memory-optimized table WITH A T-SQL STATEMENT within a T-SQL user transaction will cause another session attempting to DROP the table to wait.

begin tran
select * from dbo.t1 with (snapshot)
— drop table t1 in the other session, it works
select * from dbo.t1 with (snapshot) — invalid object name dbo.t1
commit — OK

begin tran
select * from dbo.t1 with (repeatableread)
— drop table t1 in the other session, this waits
select * from dbo.t1 with (repeatableread) — OK
commit — OK, (and drop works after this tx committed)

However, reads from within a repeatable read isolation level COMPILED PROCEDURE and writes (INSERT/UPDATE/DELETE) from within a COMPILED PROCEDURE at any isolation level allow a DROP TABLE from another session statement to work, and cause a validation error at preparation time. Writes in a T-SQL statement at any isolation level cause the same validation error.

begin tran
execute dbo.insert_t1 5, ‘Fred’
— drop procedure dbo.insert_t1 and dbo.insert_t1 and table t1 in the other session (this works without hanging/waiting)
—
— note that we must drop the procedure (and any other procedures that reference table t1)
— first, because compiled procedures are schemabound.
— attempting to drop the table without dropping the procedure would fail with a schemabinding error.
commit — Msg 41305, Level 16, State 0, Line 211 The current transaction failed to commit due to a repeatable read validation failure.

So we’ve succeeded in muddying the waters a bit by illustrating that:
1. With memory-optimized tables, transactions can fail during normal procession as well as the preparation phase
2. Transactional behavior can cause different behavior in other sessions depending on whether we use a compiled procedure or use a T-SQL statement.

With memory-optimized tables, these are not your mother’s/father’s traditional transactions. AND…we haven’t covered all the differences and edge-cases yet, either.

I finished reading the Books Online and CTP1 whitepaper introduction to transactions and memory-optimized tables a couple of days ago. Although I know what the transaction isolation levels mean, I was sort of surprised by some of the descriptions and found them a bit confusing. So I decided to go back to basics and write some simple examples that demonstrated the transaction isolation levels.

Firstly though, I have to define some terms: Books online mentions differences with memory optimized tables between autocommit transactions and user/implicit transactions. “Autocommit transactions” refers to the fact that individual SQL statements are atomic by definition; they either commit or rollback as a unit. For example, if I issue a SQL UPDATE statement that updates 10 rows and 1 row fails update because of a constraint, the statement fails and 0 rows are updated at end-of-statement. In these examples, I’m using “user transactions” where 1-n statements are made part of a single transaction with the BEGIN TRANSACTION and the COMMIT/ROLLBACK TRANSACTION statements. Implicit transactions are another way to make multiple SQL statements transactional, I’m not using this here.

I’m using memory-optimized tables in a database named “simple”. To mention a bit more about the “environment”:

— The database simple snapshot_isolation_state and RCSI turned off at the database level (sys.databases)
— I’m running at default isolation level of read committed (dbcc useroptions)
— XACT_ABORT session setting is turned off (dbcc useroptions doesn’t have a row declaring it is on)

Memory-optimized tables support three transaction isolation levels in user transactions: SNAPSHOT, READCOMMITTED, and SERIALIZABLE. My first puzzle was the BOL describes SNAPSHOT as the “lowest” of the three where memory-optimized tables are concerned. I’ve always considered SQL Server’s SNAPSHOT isolation level as most like Oracle’s SERIALIZABLE isolation level (SQL Server couldn’t use the word serializable because that is implemented by locking in SQL Server). But in the world of memory-optimized tables and user transactions, SNAPSHOT is indeed a lower isolation level than READCOMMITTED. Hmmm.

Also, all three of these levels, when used with memory-optimized tables, can fail the user transaction at commit time, if there’s a conflict detected. SQL Server’s SNAPSHOT against “traditional” tables fail during user transactions at the statement where the conflict is detected.

So, on the examples on the simplest user-transaction behavior… complete script is at the end for you to try. I execute the statements in SQL Server Management Studio, one statement at a time. I’m not using compiled stored procedures.

The only “strange” thing here is that you can use SNAPSHOT against memory-optimized tables even though snapshot_isolation_state is OFF at a database level. That’s because memory-optimized tables don’t implement snapshot isolation level the same way as regular tables (e.g. no version store for them).

2. REPEATABLE READ
begin tran
select * from dbo.t1 with (repeatableread)
— update a row in another session
commit — Msg 41305, Level 16, State 0, Line 39 The current transaction failed to commit due to a repeatable read validation failure.

begin tran
select * from dbo.t1 with (repeatableread)
— insert a row in another session
select * from dbo.t1 with (repeatableread) — does not see new row
commit — OK

Repeatable read doesn’t block after reading rows (no locking for memory-optimized tables), but fails at commit time if repeatable read semantics have been violated during the transaction (e.g. any of the rows have changed since you read them). Note that the transaction starts at the first SQL statement, NOT at the same you issue BEGIN TRANSACTION.

begin tran
select * from dbo.t2 with (repeatableread) — transaction starts with first SQL statement
— update row in t1 in another session before reading
select * from dbo.t1 with (repeatableread) — sees the old value
commit — Msg 41305, Level 16, State 0, Line 72 The current transaction failed to commit due to a repeatable read validation failure.

3. SERIALIZABLE.

Note that the READCOMMITTED transaction failed if you changed a row you’d read, but did NOT fail if someone else inserted a row during your transaction. Failing a transaction when someone else inserts a row inside a keyrange you’ve read (table you’ve read, if you’re reading all the rows) only happens in serializable isolation level.

begin tran
select * from dbo.t1 with (repeatableread)
— insert a row in another session
select * from dbo.t1 with (repeatableread) — does not see new row
commit — OK, commits in repeatableread

begin tran
select * from dbo.t1 with (serializable)
— insert a row in another session
commit — Msg 41325, Level 16, State 0, Line 55 The current transaction failed to commit due to a serializable validation failure.

And there you have it!

However, it’s not always that simple. Here’s a couple of things that differ with respect to environment, right off the bat…
1. There are subtly different rules for “autocommit” transactions.
2. There are differences when you mix memory-optimzed and non-memory optimized tables in the same user transaction.

This is just meant to be a gentle introduction. The script is enclosed for your reference. For more information, refer to “Transactions in Memory Optimized Tables” topics in BOL.

Don’t get me wrong, there’s almost no one who enjoys a good SQL Profiler diagnostic trace more than I do. But, starting in the early CTPs of SQL Server 2008, I’ve become impressed with the diagnostic enablement afforded through Extended Events. Enough to hack through that first beta (sans docs, with only metadata) and blog about it for a week. XEvents’ biggest weakness was that there was no GUI, and a guy named Jonathan Kehayias took care of that with his SSMS add-in.

In SQL Server 2012, it was announced that the old trace would be deprecated in favor of XEvents. The sheer number of events (618 in SQL Server 2012 RTM, with 9 more in SQL Server 2012 SP1) dwarfed the profiler events, as well as a nice GUI built into SSMS and programmatic interfaces for session definition, metadata, and trace-reading (including realtime) with a component called LINQ to XEvents, things were sort of sealed. That’s not even mentioning the actions, the powerful predicate system, and the series of targets.

Still, XEvents seems daunting for some, mostly (I think) because there’s too much XML involved. But c’mon, XPath is, for the most part, like navigating through a file system. And basic XPath is all you need, for the most part, to reformat the event data. Goodness, the XEvent data doesn’t even use XML Namespaces, to keep the querying simple.

After looking at the SQL Server 2014 CTP1 metadata, the results are in; things are settled. The score is:

— 763 XEvents total (that’s 145 new ones) vs. 0 new trace events (although SQL Profiler/event system is still around)
— 4 new packages, including 3 related to Hekaton (internals, compiler, and runtime).
— 68 events in the “sqlserver” packages, including about 10 more xtp (Hekaton) events in these packages, events to trace interactions with Windows Azure storage, for the new cardinality estimating code, and new columnstore functionality.

In a release where there are actually 3 new Event Notification event types (for CREATE/DROP AUDIT and DDL_DATABASE_AUDIT_EVENTS, remember Event Notifications?) the fact that there’s no new trace capability means YOU need to start learning/experimenting with Extended Events now. if you haven’t already. Go to a class, or a presentation, or read blogs or something… And you may want to brush up on your XPath…

A good place to start would be with Jonathan’s “An XEvent A Day” blog series. Or his two courses on Pluralsight about them. But start NOW, before you find yourself in SQL 2012 or 2014 with your diagnostic hands tied behind your back. How else can you look at wait stats for a specific workload or work with the T-SQL stack?

And BTW, Here’s a list of all the new events ([package name].[event name]) in 2014 CTP1 (since 2012 SP1):

hkenginexepkg.after_changestatetx_event – Fires after transaction changes state
hkenginexepkg.alloctx_event –
hkenginexepkg.attempt_committx_event – Is raised when a transaction is asked to commit
hkenginexepkg.before_changestatetx_event – Fires before transaction changes state
hkenginexepkg.compute_oldest_active_tx_hint – Indicates that the oldest active transaction hint has been recalculated
hkenginexepkg.dependency_acquiredtx_event – Raised after transaction takes a dependency on another transaction
hkenginexepkg.endts_acquiredtx_event – Fires after transaction acquires an end timestamp
hkenginexepkg.gc_cycle_completed – Indicates that a GC notification has been enqueued.
hkenginexepkg.gc_notification_enqueued – Indicates that a GC notification has been enqueued.
hkenginexepkg.redo_single_hk_record – Redo on a HK log record
hkenginexepkg.update_oldest_active_tx_stats – Indicates that an attempt is being made to recalculate the GC base generation.
hkenginexepkg.waiting_for_dependenciestx_event – Raised when we have explicitly waited for dependencies to clear
hkenginexepkg.xtp_checkpoint_file_flush – Indicates the point at which a given file has been flushed to disk.
hkenginexepkg.xtp_checkpoint_file_flush_complete – Indicates the point at which all in-flight buffers have been flushed to disk.
hkenginexepkg.xtp_checkpoint_file_read – Indicates reading of a file in XTP checkpoint recovery.
hkenginexepkg.xtp_checkpoint_write_io – Indicates that the checkpointing subsystem has issued or completed a write IO.
hkenginexepkg.xtp_create_log_record – Fires when the XTP engine creates a log record.
hkenginexepkg.xtp_root_file_write – Indicates that the write of the checkpoint root file is complete.
qds.query_store_async_shutdown_failed – Fired when query store encounters an error during async query store shutdown
qds.query_store_background_task_creation_failed – Fired if the background task for QDS data persistence could not be created
qds.query_store_background_task_initialization_failed – Fired if the background task for QDS data persistence could not be initialized
qds.query_store_background_task_persist_failed – Fired if the background task for QDS data persistence is not completed successfully
qds.query_store_begin_persist_runtime_stat – Fired immediately before current runtime statistics for a query plan is persisted to the database.
qds.query_store_bloom_filter_false_positive – Fired if the Bloom filter for QDS statement texts gives a false positive result
qds.query_store_check_consistency_init_failed – Fired when check consistency task fails during initialization
qds.query_store_database_initialization_failed – Fired if initialization of the Query Store for database has failed. The Query Store will be disabled for this database
qds.query_store_disk_size_check_failed – Fired when a check against Query Store on-disk size limit fails
qds.query_store_disk_size_info – Fired when a check against QDS on-disk size is performed
qds.query_store_execution_runtime_info – Fired when runtime information is sent to the query store.
qds.query_store_execution_runtime_info_discarded – Fired when runtime information sent to the query store is discarded.
qds.query_store_execution_runtime_info_evicted – Fired when runtime information sent to the query store is evicted.
qds.query_store_flush_failed – Fired when query store failed to flush dirty data
qds.query_store_loaded – Fired when query store is loaded
qds.query_store_notify_dirty_shutdown_on_partition_startup – Fired when previous instance of query store for the partition is shutdown by force in order to allow creation of a new instance
qds.query_store_notify_force_failure_failed – Fired when query store failed to notify force failure
qds.query_store_persist_task_init_failed – Fired when persist task fails during initialization
qds.query_store_plan_forcing_failed – Occurs when forcing of plan from qds fail
qds.query_store_plan_persistence_failure – Fired if there’s a failure to persist plan
qds.query_store_plan_removal – Fired when plan is removed
qds.query_store_query_persistence_failure – Fired if there’s a failure to persist query
qds.query_store_read_write_failed – Fired if the read/write to Query Store internal tables failed
qds.query_store_statement_not_found – Fired in case when statement couldn’t be found due to race condition or ambiguous user request.
qds.query_store_unloaded – Fired when query store is unloaded from memory
sqlos.nonpreemptive_long_syncio – record long sync io operation in nonpreemptive worker
sqlos.stuck_dispatcher_callback_executed – Stuck dispatcher callback executed
sqlos.wait_completed – Occurs when there is a wait completed on a SQLOS controlled resource. Use this event to track wait completion.
sqlserver.after_natively_compiled_proc_entry_removal_on_drop – Fired after the procedure cache entry is flushed when dropping a natively compiled procedure.
sqlserver.before_natively_compiled_proc_entry_removal_on_drop – Fired before the procedure cache entry is flushed when dropping a natively compiled procedure.
sqlserver.before_redo_lsn_update – Occurs just prior to the update of the EOL LSN
sqlserver.buffer_pool_eviction_thresholds_recalculated – Lazywriter and/or worker clock has wrapped the BUF array and thresholds are re-calculated.
sqlserver.buffer_pool_extension_pages_evicted – Page is evicted from the buffer pool extension cache.
sqlserver.buffer_pool_extension_pages_read – Page is read from the buffer pool extension cache.
sqlserver.buffer_pool_extension_pages_written – Page or contiguous set of pages evicted into the buffer pool extension cache.
sqlserver.check_phase_tracing – Occurs when DBCC CHECK enters a new phase of the checking. Use this event to trace the phases of DBCC CHECK process.
sqlserver.check_thread_message_statistics – Occurs when a phase of DBCC CHECK is finished. Use this event to collect the number of messages a DBCC CHECK thread has sent or received.
sqlserver.check_thread_page_io_statistics – Occurs when a phase of DBCC CHECK is finished. Use this event to collect the number of logical, physical, and read-ahead IOs a DBCC CHECK thread has performed.
sqlserver.check_thread_page_latch_statistics – Occurs when a phase of DBCC CHECK is finished. Use This event to collect the number and time of page latch and IO latch waits.
sqlserver.clustered_columnstore_index_rebuild – Occurs when clustered index on the table was rebuilt. This event is raised three times for ALTER index rebuild operation on CCSI. The event is raised when the operation takes lock on index rebuild resource, when lock is taken on the table and when S lock on the table is upgraded to SCH_M lock to switch indexes in metadata.
sqlserver.column_store_index_build_low_memory – Occurs when Storage Engine detects low memory condition and the rowgroup size is reduced.
sqlserver.column_store_index_build_process_segment – Occurs when a segment is processed
sqlserver.column_store_index_build_throttle – Shows the statistics of columnstore index build parallelism throttling
sqlserver.columnstore_tuple_mover_begin_compress – Occurs when column store tuple mover starts compressing a deltastore.
sqlserver.columnstore_tuple_mover_end_compress – Occurs when column store tuple mover is done compressing a deltastore.
sqlserver.database_xml_deadlock_report – Produces a deadlock report for a victim, with information scoped to the victim’s database.
sqlserver.db_lock_acquired_from_cache – Occurs when a DB lock is acquired from the XactWorkspace DB lock cache.
sqlserver.db_lock_released_from_cache – Occurs when a DB lock is released from the XactWorkspace DB lock cache.
sqlserver.ddl_with_wait_at_low_priority – A DDL statement was executed using the WAIT_AT_LOW_PRIORITY options
sqlserver.diag_quantum_end – Occurs when the diag is notified of quantum end.
sqlserver.dyn_throttle_checkpoint – Occurs when checkpointing has been dynamically throttled with a new Checkpoint rate.
sqlserver.feature_extension – Occurs when received and parses data for a feature in feature extension.
sqlserver.file_read_enqueued – File read enqueued
sqlserver.file_read_throttled – File read throttled
sqlserver.file_write_enqueued – File write enqueued
sqlserver.file_write_throttled – File write throttled
sqlserver.hadr_tds_synchronizer_payload_skip – Hadron Tds Listener Synchronizer skipped a listener payload because there were no changes since the previous payload.
sqlserver.lock_request_priority_state – The priority state of a lock request
sqlserver.log_cache_write_block – Writing a log block to the log cache
sqlserver.logreader_start_scan – Outputs the replnextlsn value from the database table that the first replcmds scan uses as a starting point.
sqlserver.metadata_ddl_add_column – Occurs when an ALTER TABLE ADD column operation is updating base index.
sqlserver.metadata_ddl_alter_column – Occurs when an ALTER TABLE ALTER column operation is updating base index.
sqlserver.metadata_ddl_drop_column – Occurs when an ALTER TABLE DROP column operation is updating base index.
sqlserver.mixed_extent_activation – Track mixed extent activation and deactivation operations.
sqlserver.mixed_extent_allocation – Track mixed extent allocation operations
sqlserver.mixed_extent_deallocation – Track mixed extent deallocation operations.
sqlserver.mixed_page_allocation – Track mixed page allocation operations
sqlserver.mixed_page_deallocation – Track mixed page allocation operations
sqlserver.mixed_page_scan_file – Track the activity of SGAM scans for mixed page allocation
sqlserver.mixed_page_scan_page – Track the activity of SGAM scans for mixed page allocation
sqlserver.mixed_page_skipextent – Track the activity of SGAM scans for mixed page allocation
sqlserver.natively_compiled_proc_execution_started – Fired before a natively compiled procedure is executed.
sqlserver.optimizer_timeout – Occurs when the optimizer times out either due to spending too much time or hitting a memory limit. Use this event to look at all the queries that are impacted by the optimizer timeout in a particular workload. This can be very useful when tuning a particular workload.
sqlserver.plan_affecting_convert – Occurs when a type convert issue affects the plan. The expression value indicates the conversion that can cause inaccurate Cardinality Estimation or prevent from using a seek based plan in the query plan choice. If performance is affected, rewriting the query could help.
sqlserver.process_killed_by_abort_blockers – A process is killed by an ABORT = BLOCKERS DDL statement
sqlserver.query_execution_batch_hash_aggregation_finished – Occurs at the end of batch hash aggregation.
sqlserver.query_execution_batch_hash_children_reversed – Occurs each time when hash join reverses build and probe side while processing data spilled to disk.
sqlserver.query_execution_batch_hash_join_spilled – Occurs each time when hash join spills some data to disk in batch processing.
sqlserver.query_optimizer_estimate_cardinality – Occurs when the query optimizer estimates cardinality on a relational expression.
sqlserver.query_optimizer_force_both_cardinality_estimation_behaviors – Both traceflags 2312 and 9481 were enabled, attempting to force both old and new cardinality estimation behaviors at the same time. The traceflags were ignored. Disable one or both of the traceflags.
sqlserver.query_store_failed_to_capture_query – Fired if the Query Store failed to capture query. The Query Store will not track statistics for this query
sqlserver.query_store_failed_to_load_forced_plan – Fired if the query failed to load forced plan from QDS. Forcing policy will not be applied
sqlserver.query_store_persist_on_shutdown_failed – Occurs when SQL Server fails to store dirty entries in Query Store on database shutdown.
sqlserver.selective_xml_index_no_compatible_sql_type – Occurs when a value() method is used in a query, specifying a sql type that differs from the type specified during selective XML index creation for that path.
sqlserver.selective_xml_index_no_compatible_xsd_types – Occurs when the user specifies a different resulting XSD type in a query, than the one that was used to promote the path during selective XML index creation.
sqlserver.selective_xml_index_path_not_indexed – Occurs when the path given in the user query is not promoted in the selective XML index, and thus the index is not used.
sqlserver.selective_xml_index_path_not_supported – Occurs when selective XML index is not used for a XML query due to the user using an unsupported path type.
sqlserver.session_recoverable_state_change – Occurs when the server determines a state change in term of session recovery on a connectionresilency-enabled connection.
sqlserver.spatial_guess – Occurs when the optimizer cannot get enough join selectivity information. Use this event to look at the queries where spatial index could be used and the optimizer guesses the spatial selectivity.
sqlserver.unmatched_filtered_indexes – Occurs when the optimizer cannot use a filtered index due to parameterization. For each such index an event is fired. The unmatched_database_name, unmatched_schema_name, unmatched_table_name, unmatched_index_name fields give details of the filtered index that could not be used.
sqlserver.xfcb_blob_properties_obtained – Windows Azure Storage blob property is obtained from response header.
sqlserver.xfcb_failed_request – Failed to complete a request to Windows Azure Storage.
sqlserver.xfcb_header_obtained – Response header is obtained from request to Windows Azure Storage.
sqlserver.xfcb_read_complete – Read complete from Windows Azure Storage response.
sqlserver.xfcb_request_opened – A request is opened to Windows Azure Storage.
sqlserver.xfcb_send_complete – Request send to Windows Azure Storage is complete.
sqlserver.xfcb_write_complete – Request send to Windows Azure Storage is complete.
sqlserver.xtp_create_procedure – Occurs at start of XTP procedure creation.
sqlserver.xtp_create_table – Occurs at start of XTP table creation.
sqlserver.xtp_deploy_done – Occurs at completion of XTP object deployment.
sqlserver.xtp_matgen – Occurs at start of MAT generation.
sqlserver.xtp_offline_checkpoint_scan_start – Fired by XTP offline checkpoint when the checkpoint thread begins.
sqlserver.xtp_offline_checkpoint_scan_stop – Fired by XTP offline checkpoint when the checkpoint thread stops.
sqlserver.xtp_recover_done – Occurs at completion of log recovery of XTP table.
sqlserver.xtp_recover_table – Occurs at start of log recovery of XTP table.
sqlserver.xtp_storage_table_create – Occurs at just before the XTP storage table is created.
ucs.ucs_connection_rejected_by_proxy_whitelist – After a connection attempt to the UCS proxy endpoint is rejected by whitelist check
ucs.ucs_proxy_connect_next_hop – UCS proxy next hop connection
ucs.ucs_proxy_receive_proxy_connect_message – UCS proxy receive proxy connect message
ucs.ucs_proxy_route_add – UCS proxy route added
ucs.ucs_proxy_route_disable – UCS proxy route disabled
ucs.ucs_proxy_route_refresh – UCS proxy route refreshed
ucs.ucs_proxy_send_proxy_connect_message – UCS proxy send proxy connect message
XeXtpCompilePkg.cgen – Occurs at start of C code generation.
XeXtpCompilePkg.invoke_cl – Occurs prior to the invocation of the C compiler.
XeXtpCompilePkg.mat_export – Occurs at start of MAT export.
XeXtpCompilePkg.pit_export – Occurs at start of PIT export.
XeXtpCompilePkg.pitgen_procs – Occurs at start of PIT generation for procedures.
XeXtpCompilePkg.pitgen_tables – Occurs at start of PIT generation for tables.
XeXtpRuntimePkg.bind_md – Occurs prior to binding metadata for a memory optimized table.
XeXtpRuntimePkg.bind_tables – Occurs prior to binding tables for a natively compiled procedure.
XeXtpRuntimePkg.create_table – Occurs prior to creating memory optimized table.
XeXtpRuntimePkg.deserialize_md – Occurs prior to deserializing metadata.
XeXtpRuntimePkg.load_dll – Occurs prior to loading the generated DLL.
XeXtpRuntimePkg.recover_done – Occurs at completion of checkpoint recovery of a memory optimized table.
XeXtpRuntimePkg.recover_table – Occurs at start of checkpoint recovery of a memory optimized table.
XeXtpRuntimePkg.serialize_md – Occurs prior to serializing metadata.

When investigating new features, I’m prone to use what I call the Rumpelstiltskin method, that is, I’m happier if I know where these features “live” in the file system (not exactly Rumpelstiltskin, where the reward came by guessing his name, but hopefully I’m not stretching the analogy too far).

With this in mind, I started off to find where *exactly* Hekaton stores it’s compiled table code, it’s compiled procedure code, and it’s”backing store” data for persistent memory-optimized tables. For the experiment I used the Hekaton-ized Northwind sample database code from the SQL Server 2014 Books Online. I executed the code in stages, and looked at the appropriate file system locations, once I’d found them.

Creating or altering a database to contain a filegroup for memory-optimized data with a corresponding file produces the well-known “filestream directory structure”, as used by filestream (2008+) and filetable (2012+). With a file for a memory-optimized data filegroup, it contains the nested GUID-named subdirectories as though you had one table containing filestreams and one filestream-based column. However, the lowest level of directory isn’t empty; it contains 1 GB-worth of files. 8 of those files are 128 mb in size, and 14 other files are empty. And this is *before* adding any data. Or even any tables.

BTW, the filegroup for memory-optimized files can have more than on container (ADD FILE.. in DDL, directory as far of the file system goes). Adding a second “file” produces the corresponding “filestream” directory, but no GUI-named subdirectories (yet). So that’s where the data will live, because there has to be a persistent backing-store to populate the in-memory tables at next SQL Server startup time.

Now, let’s add a memory-optimized table and see what happens. I’d been browsing around the file system in the “SQL Server instance directory” (C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL in my case) and noticed an subdirectory of binn named “xtp”. That’s where the C compiler and linker live (in the xtp\bin directory) and where the support libraries live (in the xtp\lib directory). Great.

I also noticed an empty subdirectory of the “data” directory named “xtp”. Creating a single memory-optimized table (Employees, in this script) produced an subdirectory under …\data\xtp with the name ‘5’. Where 5 is the DBID of the database in question (hkNorthwind in my case). There are six files for each memory-optimized table, with names start begin with “xtp_t_5_277576027″. With memory-optimized tables:

xtp_t – is for table code

5 – refers to the DBID

277576027 – refers to the object_id of the table

The six files I saw had suffixes .c, .obj, .dll, .pdb (symbols), .out (intermediate compiler file), and .mat.xml. Not sure what .mat.xml is just yet (it contains an XML representation of the table metadata), but the other files are recognizable. Adding the other tables, I can see one set of these files for each memory-optimized table. Similarly, adding an native-compiled stored procedure produces the same six file types that begin with “xtp_p” (for procedure).

Interestingly, if I stop the instance and start it up again, these files only appear once I’ve *executed* the procedure. The files for tables appear after startup. So it sounds like these are lazy-instantiated and compiled when they are first used. This could be the case for the “table code” as well, but these would have to be instantiated at SQL startup, so that the tables could be (re)created and (re)populated in memory.

Turns out that I didn’t have to “browse-around” for the modules’ location. It shows up in sys.dm_os_loaded_modules, under the “name” column when the modules are loaded.

Back to data. All I do for now is what the script contains, populate the tables using INSERT statements. And this does nothing to the tables in the memory-optimized filegroup files. I’m assuming it writes to the 128 mb pre-allocated files. However, executing CHECKPOINT afterwards writes to one of the previously empty pre-allocated files, changing the size to 64mb. Bringing SQL Server (and the OS) down and up a few times populates more of these files to 64mb. And, after a while, I check back and I have 24 files total in the “filestream subdirectory”. All this appears to be consistent with the Books Online description under the “Storage/Defining durability for memory-optimized tables“. I’m not seeing any populated “Delta files” yet, because I haven’t UPDATEd or DELETEd any rows. That should work as advertised too.

So. Nothing too earth-shattering, but at least now I know “where everything is”. And Hekaton tables and procedures are less of an unknown “Rumpelstiltskin” today than they were yesterday.