T-SQL

Transactions are an area that I often find are used badly, or not used at all. Transactions without any error handling, nested transactions, named transactions named for no good reason or just plain lack of transactions where they should be used.

This is going to be the first of three posts looking at some aspects of transactions and transaction management. And deserving of the first post is that most misunderstood of things, the nested transaction.

Before getting into the details, I need to make one thing clear. Nested transactions are a lie. They do not exist in SQL Server.

A piece of code may appear to have multiple nested transactions in it. It may have multiple BEGIN TRANSACTION statement in it. It only has one transaction though.

An easy way to show this would be to look at the transaction log. Since transactions are there to make multiple statements atomic (committed or rolled back as a unit), the transactions must be reflected in the transaction log to allow SQL to know where to roll things back to or what statements need to be undone/redone as a unit.

I’m inserting a row before the transaction because allocating the first page to a table generates a lot of log entries. Those log records are not relevant to the subject at hand and will just confuse things.

One table, three transactions, three inserts, three commits. Let’s see what went into the transaction log.

What went into the log was a single begin transaction, a single commit and three inserts. The other two begin transaction statements, and the other two commit transaction statements were completely ignored for the purposes of the transaction log.

The other two begin transaction statements did do something. Not very much, and nothing that reflected in the transaction log, but they did do one thing, and if I make the code a little more complicated it’ll become clear what that one thing is.

Each BEGIN TRANSACTION incremented the value returned by @@TranCount, each COMMIT TRANSACTION decremented the value returned by @@TranCount. If I looked at the transaction log after each statement, the BEGIN_XACT log entry corresponds to the BEGIN TRANSACTION which incremented @@Trancount from 0 to 1, and the COMMIT_XACT log entry corresponds to the COMMIT TRANSACTION which decremented @@Trancount from 1 to 0.

Only the operations which changed @@Trancount to or from 0 are reflected in the transaction log, they’re the only ones which have any effect on what becomes part of the database’s durable state.

Does ROLLBACK then follow the same rules? Does it get ignored unless it’s the last statement of multiple COMMIT/ROLLBACK statements?

No. The ROLLBACK TRANSACTION set the value of @@Trancount directly to 0 and undid all of the inserts back to the outermost BEGIN TRANSACTION. The COMMIT TRANSACTION statements which ran after it all threw errors

This is the danger of nesting transactions, that one ROLLBACK, anywhere after the first BEGIN TRANSACTION will undo everything right the way back to that first BEGIN TRANSACTION. These are not autonomous transactions being run one inside another, there’s a single transaction and rollback anywhere will roll back everything.

This can be a debugging nightmare when the application creates a transaction and calls a procedure which begins a transaction which calls other procedures which begin and commit transactions. If something goes wrong in one procedure and that procedure rolls the transaction back, it can end up undoing a lot more work than the developer of that procedure may have expected.

In general I would recommend only starting and committing transactions at a single level, and making transaction handling part of the application’s design, not something that falls out of the implementation by accident. If there’s an intentional design around how transactions will be handled (and choosing not to have any is not a good idea), then it’s a lot less likely that there will be transaction-related bugs that need to be found and fixed.

And what happens when working with an existing app where this wasn’t done? I’ll look at that in the next post on transactions, on conditionally starting transactions and on the use of savepoints.

A well overdue feature introduced in Denali CTP 3 is that of the Try_Parse and Try_Convert functions. These are great for dealing with the something that’s frustrated SQL developers for years – data type conversions.

Let’s imagine a rather nasty case, a file with some values in it that once imported into SQL (as character data) looks something like this:

Ewww… that’s a mess. Let’s see if we can identify which of the values can be converted into a numeric data type. The function prior to Denali for that was ISNUMERIC.

SELECT ANumber, ISNUMERIC(ANumber)
FROM BadNumerics

Great, so other than the obvious one, they’re all numeric data. Time to get converting.

SELECT CAST(ANumber as Numeric(18,6))
FROM BadNumerics
WHERE ISNUMERIC(ANumber) = 1;

Each predicate (or set of predicates) combined with an OR must have a separate index

All of those indexes must be covering, or the row count of the concatenated result set low enough to make key lookups an option, as the optimiser does not apparent to consider the possibility of doing key lookups for a subset of the predicates before concatenating the result sets.

So what can be done if it’s not possible to meet those requirements?

The standard trick is to convert the query with ORs into multiple queries combined with UNION. The idea is that since OR predicates are evaluated separately and the result sets concatenated, we can do that manually by writing the queries separately and concatenating them using UNION or UNION ALL. (UNION ALL can only be safely used if the predicates are known to be mutually exclusive)

In this case, the OR can be replaced with a UNION and the results are the same. The Union form is slightly less efficient according to the execution plan’s costings (60% compared to the OR at 40%), and the two queries have the same general form, with two index seeks and some form of concatenation and remove duplicates.

So in that case it worked fine, although the original form was a little more efficient(more…)

If a transaction rolled back at the first failure, that final select would return no rows. But it doesn’t, it returns 3 rows. The failure of the individual statements was ignored and the transaction completed and committed. If that had been an important business process, not a made-up example, that could have some nasty consequences for transactional consistency of data.

What’s really going on here? Aren’t transactions supposed to be atomic? Isn’t SQL supposed to roll them back if they don’t complete successfully?

I thought I’d tackle a trio of table variable myths and partial truths.

Table Variables are memory-only

This one is pervasive and irritating. It typically goes like this:

You should use table variables rather than temp tables because table variables are memory only.

This myth can be broken down into two parts:

That table variables are not part of TempDB

That table variables are not written to disk

The first is easy to prove and has beendonerepeatedly. I’m not doing it again. I’m going to tackle the second portion only.

See, one could argue that, even though the table variable is created in the TempDB system tables and allocated pages within the TempDB data file, it is still kept entirely and only in memory. Let’s see if that’s true…

This is not a massively large table. 100000 rows at 204 bytes per row (excluding header). A query of sys.dm_db_index_physical_stats (which does work on temp tables and table variables) reveals a total page count of 2632. That’s a grand total of 20.6 MB. 20 Megabytes. The SQL instance I’m running this on is allowed to use up to 2 GB of memory. No way on earth is this table variable going to cause any form of memory pressure (and I promise there is nothing else running)

So, run that code and, while that waitfor is running, do something that should never be done to a SQL server that you care anything about. (more…)

Over the last several months I’ve had a look at IN, Exists, Join and their opposites to see how they perform and whether there’s any truth in the advice that is often seen on forums and blogs advocating replacing one with the other.

In this roundup post, I’m going to do multiple tests on the 6 query forms, with different numbers of rows, indexes, no indexes and, for the negative forms (NOT IN, NOT EXISTS), nullable and non-nullable join columns.

In the individual tests, I used 250000 rows in the first table and around 3000 rows in the secondary table. In this roundup, I’m going to use 3 different row counts, 1000000 rows, 50000 rows and 2500 rows. That should give a reasonable idea for performance at various table sizes. (Not much point in going smaller than 2500 rows. Everything’s fast on 100 rows)

Some notes on the tests.

The version of SQL is SQL Server 2008 SP1 x64 Developer Edition.

The tests were run on a laptop. Core-2 Duo, 3 GB memory. SQL limited to 1 processor, so no parallelism possible.

Each query will be run 10 times, reads, cpu and duration measured by profiler and averaged.

Each query will be run once before the tests start to ensure that the data is in cache and the execution plans are generated and cached.

I’m looking at NOT EXISTS and LEFT OUTER JOIN, as opposed to NOT IN and LEFT OUTER JOIN, because, as shown in the previous part of this series, NOT IN behaves badly in the presence of NULLs. Specifically, if there are any NULLs in the result set, NOT IN returns 0 matches.

The LEFT OUTER JOIN, like the NOT EXISTS can handle NULLs in the second result set without automatically returning no matches. It behaves the same regardless of whether the join columns are nullable or not. Seeing as NULL does not equal anything, any rows in the second result set that have NULL for the join column are eliminated by the join and have no further effect on the query.

It is important, when using the LEFT OUTER JOIN … IS NULL, to carefully pick the column used for the IS NULL check. It should either be a non-nullable column (the primary key is a somewhat classical choice) or the join column (as nulls in that will be eliminated by the join)

Just one note before diving into that. The examples I’m using are fairly simplistic and that’s intentional. I’m trying to find what, if any, are the performance differences in a benchmark-style setup. I’ll have some comments on more complex examples in a later post.

The most important thing to note about NOT EXISTS and NOT IN is that, unlike EXISTS and IN, they are not equivalent in all cases. Specifically, when NULLs are involved they will return different results. To be totally specific, when the subquery returns even one null, NOT IN will not match any rows.

The reason for this can be found by looking at the details of what the NOT IN operation actually means.

Let’s say, for illustration purposes that there are 4 rows in the table called t, there’s a column called ID with values 1..4

WHERE SomeValue NOT IN (SELECT AVal FROM t)

is equivalent to

WHERE (
SomeValue != (SELECT AVal FROM t WHERE ID=1)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=2)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=3)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=4)
)

Let’s further say that AVal is NULL where ID = 4. Hence that != comparison returns UNKNOWN. The logical truth table for AND states that UNKNOWN and TRUE is UNKNOWN, UNKNOWN and FALSE is FALSE. There is no value that can be AND’d with UNKNOWN to produce the result TRUE

Hence, if any row of that subquery returns NULL, the entire NOT IN operator will evaluate to either FALSE or UNKNOWN and no records will be returned

So what about EXISTS?

Exists cannot return NULL. It’s checking solely for the presence or absence of a row in the subquery and, hence, it can only return true or false. Since it cannot return NULL, there’s no possibility of a single NULL resulting in the entire expression evaluating to UNKNOWN.

Hence, when the column in the subquery that’s used for comparison with the outer table can have nulls in it, consider carefully which of NOT EXISTS or NOT IN you want to use.

Ok, but say there are no nulls in the column. How do they compare speed-wise. I’m going to do two tests, one where the columns involved in the comparison are defined as NULL and one where they are defined as NOT NULL. There will be no NULL values in the columns in either case. In both cases, the join columns will be indexed. After all, we all index our join columns, right?

Radically different execution plans, radically different performance characteristics. The NOT IN took over 5 times longer to execute and did thousands of times more reads.

Why is that complex execution plan required when there may be nulls in the column? I can’t answer that one, probably only one of the query optimiser developers can, however the results are obvious. When the columns allow nulls but has none, the NOT IN performs significantly worse than NOT EXISTS.

So, take-aways from this?

Most importantly, NOT EXISTS and NOT IN do not have the same behaviour when there are NULLs involved. Chose carefully which you want.

Columns that will never contain NULL values should be defined as NOT NULL so that SQL knows there will never be NULL values in them and so that it doesn’t have to produce complex plans to handle potential nulls.

On non-nullable columns, the behaviour and performance of NOT IN and NOT EXISTS are the same, so use whichever one works better for the specific situation.

Often in forum threads discussing query performance I’ll see people recommending replacing an INNER JOIN with an IN or EXISTS (or recommending replacing an IN or EXISTS with an INNER JOIN) for performance reasons. I’ve previously looked at how the IN and EXISTS compared, now I’m going to investigate and see how IN compares with the join.

One very important thing to note right off is that they are not equivalent in all cases.

An inner join between two tables does a complete join, it checks for matches and returns rows. This means, if there are multiple matching rows in the second table, multiple rows will be returned. Also, when two tables are joined, columns can be returned from either. As a quick example:

This returns 7 rows and returns columns from both tables. Because the values in @SomeTable are duplicated, the matching rows from BigTable are returned twice.

With an IN, what is done is a semi-join, a join that checks for matches but does not return rows. This means if there are multiple matching tables in the resultset used for the IN, it doesn’t matter. Only one row from the first table will be returned. Also, because the rows are not returned, columns from the table referenced in the IN cannot be returned. As a quick example

So, that said, how does the performance of the two differ for the cases where the results are identical (no duplicates in the second table, no columns needed from the second table)? For that, I’m going to need larger tables to play with. (more…)

The EXISTS and IN clauses at first glance look fairly similar. They both use a subquery to evaluate rows, but they do it in a slightly different way

IN does a direct match between the column specified before the IN keyword and the values returned by the subquery. When using IN there can only be a single column specified in the select clause of the subquery