T-SQL

It’s no secret that I’m not overly fond of most user-defined functions. This isn’t just a pet hate, I have some good reasons for disliking them. All too often they’re performance bottlenecks, but that can be said about many things in SQL. The bigger problem is that they’re hidden performance bottlenecks that often go overlooked and ignored for too long.

I’m going to start with this fairly simple scalar function, created in the AdventureWorks database

No problems so far. They both return 504 rows (in my copy of AW, which has been slightly padded out with more data). Now, let’s look at the execution characteristics by running them again with Statistics IO and Statistics Time on.

It’s a fairly common to need to know the number of rows in a table, the number of rows that match a certain condition or whether or not there are any rows that match a condition. There’s also a number of ways of doing so, some better than others. The problem being that counting is not a cheap operation, especially on big tables. It’s not as bad as a sort, but it still can be expensive.

So, given that, let’s take a look at some of the ways.

Querying the metadata

If all that’s needed is the number of rows in the table, and it’s not 100% important that the value be completely accurate all the time, the system metadata can be queried. In SQL 2000 and below, that info was in sysindexes. In 2005 and higher it’s been moved into sys.partitions.

SELECT OBJECT_NAME(object_id) AS TableName, SUM(rows) AS TotalRows
FROM sys.partitions
WHERE index_id in (0,1)
AND object_id = OBJECT_ID('TableName')
GROUP BY object_id

The advantage of this approach is that it is fast. Since it’s not actually counting anything and, in fact, isn’t even accessing the table that’s being counted, it’s the fastest way to get the count of rows in the table.

The disadvantage is it can only get the number of rows in the table and cannot consider any criteria at all. It also may not be 100% accurate, depending how and when the table’s rowcount metadata is updated by the SQL engine. (more…)

When I wrote about the catch-all queries, someone asked why the dynamic SQL that I offered wasn’t vulnerable to SQL injection. I thought I’d go into the whys and the wherefores of that in a little bit more detail.

The most important thing to realise with SQL Injection (and with all other forms of command injection) is that it requires that a user-inputted string be incorperated as part of a command that’s going to be executed. Not as part of a paramter value, but as part of the command itself.

The query form that I refer to as ‘Catch-all’ typically results from search screens in the application where the user may enter any one (or more) of a number of optional parameters. One of the more common ways for such a query to be written in SQL is with multiple predicates in the where clause of the form (WHERE SomeColumn = @SomeVariable OR @SomeVariable IN NULL)

Now this does work, the problem is that it works fairly inefficiently and, on large tables, can result in really poor query performance. I’m going to take a look at why that is the case and what alternatives there are.

Erland Sommarskog has written on this as well, and in a lot more detail than I’m going to. His article on dynamic search conditions is well worth reading, as are the rest of his articles.

A typical example of a ‘catch-all’ query would be this one, based off a table in the AdventureWorks database.

Now, let’s say that I run that query and pass values for the ProductID and the Transaction type. Let’s further say that there’s a nonclustered index (called idx_TranHistory_TranTypeProductID) on those two columns.

EXEC SearchHistory @Product = 978, @TransactionType = 'W'

Now this returns 52 rows out of 980000 that are in the table, so we’d expect that SQL would use an index seek operation on that index, followed by a bookmark lookup.

Nope. It’s using that index all right, but it’s doing a scan, not a seek. Ok, not great, but not bad. Let me try a different set of parameters

EXEC SearchHistory @Qty = 100

The plan’s exactly the same. No surprise, it was cached the first time and then reused. There’s a problem here though, the index that’s used is completely inappropriate and there’s a bookmark lookup that ran almost a million times. No wonder this execution took 3 seconds and 2,949,715 IOs to return 29 rows.

The problem with these types of queries is that there is no stable plan. The optimal plan differs completely depending on what paramters are passed. The optimiser can tell that and it plays safe. It creates plans that will always work. That’s (one of the reasons) why in the first example it was an index scan, not an index seek.

The downside of the safe plan is that it’s highly unlikely to be a good plan and, even if it is, it won’t be good for all possible combinations of plans.

So, how to handle this type of query? Well, there are typically two ways.

Recompile

This is only an option on SQL 2008. On 2008, if the query is specified with the OPTION (RECOMPILE) hint, then the optimiser knows it doesn’t have to worry about safe plans because the plan will never be reused. In fact, if I add that hint to the query in the first example, I get the expected index seek.

Dynamic SQL

The other option is to build up the query string dynamically, based on the parameters passed and then to use sp_executesql to run it. There are the ususal downsides to dynamic SQL but, it may be that the performance improvement is worth it.

We get a set of data (perhaps in a temp table, perhaps in an xml document) that needs to be inserted into those tables. The source data will have multiple parent rows, each with multiple child rows. Those need to be inserted into the appropriate tables and the foreign keys have to be assigned correctly. In addition, the ID of the child rows, along with the value and the current date must be written into an audit table, along with the login name of the current user.

I have the pleasure at the moment of doing a code review on some vendor code. No names will be mentioned. I’ve seen better. I’ve seen a lot better. I’m seeing very common mistakes in the code, so, in the interests of my sanity, I’m going to go over a couple of common T-SQL mistakes in the hopes that the next batch of code I get to review doesn’t have these mistakes in…

1. Error Handling

Proper error handling is hard. SQL 2005 has made it a lot easier with the TRY…CATCH blocks, but it still means that everything that can throw an error be wrapped inside a TRY block, with an appropriate CATCH block to handle any errors.

It was a lot harder on SQL 2000 when all we had to work with was @@Error. What I think was not well understood was what statements set and reset @@Error, and how long a non-zero value persists, leading to code constructs like this

Null seems to confuse a lot of people. The main issue with null is that it is not a value. It’s the absence of a value, a place holder for unknown data. Hence, when it is used in a comparison, it returns results that look surprising. Take this as a quick example

I’ve encountered a fair bit of confusion on various forums as to the differences between temporary tables and table variables. As a quick article (I’m knee-deep in some AI stuff at the moment) I thought I’d quickly go over some points on temp tables and table variables.

Temporary Tables

Created using the Create table syntax, preceding the table name with a’#’ for a local temp table and ‘##’ for a global temp table

Allocated storage space within the TempDB database and entered into the TempDB system tables 1

The table’s actual name is the name is was created with, a large number of underscores and a hash value, to prevent object name collisions if two connections create a temp table with the same name

Can have a primary key, defaults, constraints and indexes (however the names of these are not hashed, possibly leading to duplicate object errors for constraints and defaults)

May not have triggers.

Foreign keys are permitted, but are not enforced

Have column statistics kept on them. The algorithm for determining when to update is different to permanent tables

Exist until they are dropped, or the connection closes.

Are visible in any child procedures called from the one where the table was created. Are not visible to parent procedures

Are not persisted to disk unless there is memory pressure, or the table is too large to fit in the data cache

The date time data type and the date time functions within SQL are things that I see coming up time and time again in news groups and forums. Questions on how to get rid of the time, how to get the first day of the week, the last day of the month and so on. With the new Date and Time data types coming in SQL 2008, things will get easier, nut until then we have to do things the hard way.

In systems I’ve worked on I’ve seen several implementations of functions to find the first and last day of a week, a month or a quarter. Some have worked well, some have worked and others, well, haven’t

In Part 1 I discussed why some queries that look like they should work fail with a conversion error. In this part, I’m going to go over some solutions to the problem.

Probably best solution, at least in the long term is, as I mentioned previously, to correct the data type of the column and to remove the data that doesn’t fit. This is probably, however, the hardest of the solutions to implement, due to the potential impact.