Tuesday, December 1, 2015

Introduction

More than ten years ago, on November 5, 2005, Microsoft released SQL Server 2005 to manufacturing. Over that span of time, it has become one of the most venerable -- and solid -- releases in the product line. All good things must come to an end though and the same is true of SQL 2005. It has been in extended support since April of 2011 but even that will end on April 12, 2016. see Extended support for SQL Server 2005 ends on Aptil 12, 2016 for more information.

This article highlights some of the innovations it brought us but more importantly, shows the developments since November 2005 and SQL.Next: SQL Server 2016.

Highlights of SQL Server 2005

For nostalgia's sake, you might want to review the details at What's New in SQL Server 2005, but here are a few notable enhancements brought to us in that release:

Service Broker

CLR integration

The XML data type

Ranking functions

Common Table Expressions

Pivot and Unpivot

Database schemas

Partitioning

New architecture for Integration Services and dtsx packages

Reporting Services as a new feature

Multiple enhancements for Analysis Services

Replication enhancements including transactional, peer to peer and merge replication and support for heterogeneous environments

Since 2005 ...

SQL Server releases 2008, 2008 R2, 2012 and 2014 greatly extended the power of Microsoft's enterprise-grade RDBMS and SQL Server 2016, due out in just a few months, promises to do the same. It's worth reviewing all the "What's new" pages which you can find using these links:

Of course this is just a list of highlights that I happen to like! Please follow the links to get the full history of improvements.

I need to upgrade! But...

Upgrading from SQL Server 2005 to SQL Server 2014 is not an exercise for the faint of heart. To begin with, you cannot upgrade directly from RTM 2005 to 2014. You must have installed Service Pack 4 of SQL Server 2005 before you begin. Also, if you want to upgrade to version 2016, you will be facing a two-part upgrade, since 2016 no longer supports any version of 2005 as a base for upgrading.

In addition, before you plan your SQL Server migration, you may need to plan a Windows Server upgrade. Many SQL Server 2005 installations are still running on Windows Server 2003, but the minimum version of Windows Server supported for SQL Server 2014 is Windows Server 2008 R2 SP1, For SQL Server 2016 the minimum is Windows Server 2012 R2. Note that Windows Server 2003 extended support ended on July 14, 2015, so if you are still running that version and something breaks, you're on your own unless you opt for pay-as-you-go support.

For installations using mirroring or replication, special care must be taken to ensure minimum downtime and avoiding running "exposed" (that is, with only one active and no passive instances). A good discussion of this topic can be found here: Minimize Downtime for Mirrored Databases When Upgrading Server Instances from a 2016 perspective, but the same principles apply for other versions.

Should I stay or should I go now?

When you begin to consider upgrading a SQL Server 2005 instance to SQL Server 2014/16 -- especially when you may also need to upgrade Windows Server 2003 as well -- you may wonder if it is worth it. After all, it will cost you time and resources, not to mention possible changes in licensing costs. Considering the approaching End of Extended support for the product (and the fact that all support for Windows Server 2003 has ended), it's probably a good idea to plan for the upgrade, build a realistic project plan with appropriate budgetary and resource support, get your team trained for the upgrade(s) and the new world of SQL Server 2014/16 and set specific timelines.

On the other hand, if your SQL Server 2005 instance is not supporting business critical processes and you can afford to lose it entirely, you may want to stand pat. But then, why are you still running that system at all?

If you've followed along and implemented or at least thought about implementing this strategy, you may be wondering how to enforce it. After all, if the option still exists for some DBA to add an individual login to your instance or individual users to a database or grant specific permissions to specific users that are not custom database roles, you want to know if it's happening.

SQL Server Auditing to the rescue!

SQL Server Auditing

Auditing an instance of the SQL Server Database Engine or an individual database involves tracking and logging events that occur on the Database Engine. SQL Server audit lets you create server audits, which can contain server audit specifications for server level events, and database audit specifications for database level events. Audited events can be written to the event logs or to audit files.

Setting up Auditing is easy. Let's set up an audit to catch anyone violating the DA-PLP regime we've established. In SSMS you open the Security/Audits sub tree. There you create a server audit for this purpose, a server audit specification hooked up to the server audit and enable them both. Graphically:

Now, simply enable the audit and audit specification (right-click on them, select Enable) and you're ready to catch violaters! Note, that, if you chose to go with a file as I did, you can use the system function sys.fn_get_audit_file to view it in tabular format. There you will see anyone (with their login id) who changed anything in these areas and you can easily sort out legitimate changes from rule breakers.

Summary

With this article, we wrap up this series. We've discussed two important principles: Double Abstraction and the Principle of Least Privilege that together make our lives easier for managing security in SQL Server.

Saturday, August 8, 2015

Introduction

In the previous two posts, we discussed the concept of double-abstraction security for SQL Server. The motivation is that we do not want to be administering permissions at an individual user to individual object basis. This does not scale well and if you're anything like me, you'll make mistakes with that approach.

The double-abstraction approach makes it much easier.

Put users into groups that reflect their business needs

Create database roles to hold the permissions to fit those needs

Add groups to roles

In this section, we're going to put all three things together. We'll be using the same database as in parts I and II:

We need a new database role for this read-only group to use. We'll give it only SELECT and EXECUTE permissions in the SecureSchema:

-- Create the new role
USE [SecureDB]
GO
CREATE ROLE [SecureRoleReadOnly] AUTHORIZATION [SecureRole]
GO
USE [SecureDB]
GO
-- Add the restricted group to the role
ALTER ROLE [SecureRoleReadOnly] ADD MEMBER [SecureGroupReadOnly]
GO
USE [SecureDB]
GO
-- Grant EXECUTE and SELECT permissions
GRANT EXECUTE ON SCHEMA::[SecureSchema] TO [SecureRoleReadOnly]
GO
USE [SecureDB]
GO
GRANT SELECT ON SCHEMA::[SecureSchema] TO [SecureRoleReadOnly]
GO

That's it! Now, all we have to do to manage access is to add/remove users from the Windows (or Active Directory) group!

Thursday, July 30, 2015

Intoduction

In Part I of this series, we introduced the idea of Double Abstraction as a way to manage permissions with ease and consistency. The basic concepts are:

Use Server and Database custom roles to hold permission grants

Add Windows groups to the roles

Add users to the Windows Groups

We combine this with the principle of least privilege. That is, we ensure that only the permissions actually required to perform the duties of the role (in a business sense) are granted to the role.

In this post, we will work through an example to demonstrate some of these ideas. We'll create a new database, a database schema, a database role, and a SQL user to add to the role. In Part III, we'll extend the ideas to use Windows groups for management.

2. Switch to the context of the new database and create a database role:

-- Change the database context to the new database
USE SecureDB;
GO
-- Create a role in the new databse
CREATE ROLE SecureRole;
GO
-- View the attributes of the new role
SELECT * FROM sys.database_principals WHERE name = 'SecureRole';
GO

3. Create a database schema and a user with access to it:

-- Create a schema to hold secure objects owned by the new role
CREATE SCHEMA SecureSchema AUTHORIZATION SecureRole;
GO
-- View attributes of new schema. Note that the principal_id matches the one in sys.database_principals
SELECT * FROM sys.schemas WHERE name = 'SecureSchema'
GO
-- Create a database user without a matching SQL Server Login.
CREATE USER SecureUser WITHOUT LOGIN WITH DEFAULT_SCHEMA = SecureSchame;
GO
-- View attributes of new user. Note default schema.
SELECT * FROM sys.database_principals where name = 'SecureUser'
-- Grant new user permission to create tables in the database
GRANT CREATE TABLE TO SecureUser;
GO

4. The new schema is owned by the new role and the user has database permissions to create tables. Let's try that:

EXECUTE AS user = 'SecureUser';
CREATE TABLE SecureSchema.SecureTable (SecureId INT);
GO
-- Switch the execution context back to the caller of the last EXECUTE AS statement.
REVERT;

5. Whoops! That didn't work! If you are following along on your own SQL Server instance, you should have seen an error message like this:

Msg 262, Level 14, State 1, Line 42

CREATE TABLE permission denied in database 'SecureDB'.

The thing is, the user needs ALTER TABLE permission in addition to CREATE TABLE. The database schema is owned by the custom role, so let's add the user to that role and try again:

ALTER ROLE SecureRole ADD member SecureUser; -- ALTER ROLE SecureRole DROP member SecureUser
GO
-- Set the execution context to the newly-created user and try to create a table
EXECUTE AS user = 'SecureUser'; SELECT SUSER_NAME()
CREATE TABLE SecureSchema.SecureTable (SecureId INT);
-- Switch the execution context back to the caller of the last EXECUTE AS statement.
REVERT; SELECT SUSER_NAME();

\
6. It worked! The table creation worked. Let's see if we can access it:

Success! We can access the table. Actually we can do anything to it. Since the new user is a member of the custom role that owns the schema, we have carte blanche.

Summary

In this section, we have shown how to use a database role to control who can do what. The role we created is the owner of the schema, so members of the role can do anything, provided other requirements are met.

Next time, we'll look at creating a restricted role with very limited access, applying the principle of least privilege.

Friday, July 17, 2015

Introduction

Securing a SQL Server instance is critical to every business. Whether to comply with regulatory requirements, pass external audits with flying colors or to ensure proper restrictions on access to sensitive or financial data, it is imperative that database resources are only accessed by those with the need to do so and in a manner matching requirements. This is the first in a series of articles on managing Security in SQL Server.

Challenges

Your database is accessed by a wide variety of people with different needs. You need to be able to secure database assets in a manageable way. All too often we see implementations where every user has serveradmin rights and/or database owner (dbo) rights on one or more databases in some SQL Server instance. This is a recipe for disaster! Even the most conscientious user will make mistakes and the careless, disgruntled and even criminally-oriented users can wreak havoc, destroy valuable data or even siphon it off to sell to the highest bidder.

We know that SQL Server has the tools to GRANT or DENY access to its objects but all too often get frustrated managing user-by-user. This is where the double-abstraction approach shines.

Double Abstraction

The first part of Double Abstraction (DA for short) can be found by reversing the letters in the acronym. AD, or Active Directory allows us to create groups so that birds of a feather can flock together -- that is, so that users with similar requirements can be managed as a unit by placing them into a suitable group. So, the first step, is to identify the sorts of actions different groups might need. A reporting team needs to be able to read some tables from one or more databases; an ETL job needs to be able to load tables; a web application needs to be able to query and possibly update certain data; auditors need to be able to monitor the security of the database; administrators at various levels of responsibility need to manage the server, create databases, backup/restore and so on. Each of these functions (and probably others, depending on your requirements) deserves its own group.

The second part of DA exploits database roles and server roles. SQL Server comes with many out of the box, but for our purposes it makes sense to define custom database roles and (in SQL Server 2012) custom server roles. Each role would be given just the privileges corresponding to the actual needs of the various groups. For example, access to certain schemas, execute access on some stored procedures, the ability to create databases, tables and other objects, the ability to view all of these privileges and so on. We set up these roles using the Principle of Least Privilege: each role has just the privileges actually required and no more.

To bring these together -- AD groups and DB/Server roles -- we use the ALTER ROLE ADD MEMBER command (or sp_addrolemember system stored procedure). Say that we have created an AD group Corp\SalesReports and a database role in the Sales database called SalesReaders. Then we can:

Add domain users (and other groups) to the Corp\SalesReports group

Add the AD group to the SalesReaders role like this:

ALTER ROLE SalesReaders
Add Member [Corp\SalesReports]

Next time, we'll dig deeper and add more examples.

Summary

Managing security in SQL Server is made easier with the principle of Double Abstraction: Add users to groups; create roles in the database to cover the functions; add the groups to roles.

Friday, July 10, 2015

Introduction

Many maintenance operations in SQL Server require us to iterate over some list and perform one or more operations against each item in the list. For example, perhaps you want to backup a number of databases or rebuild a number of indexes. Since the basic BACKUP and ALTER INDEX commands cannot take variables for database or index names, we often see a pattern like this:

DECLARE mycursor as CURSOR FOR
SELECT ...
OPEN mycursor
DECLARE @var1 ...
FETCH NEXT FROM mycursor into @var1 ...
WHILE @@FETCH_STATUS = 0 BEGIN
-- do something with @var1
FETCH NEXT mycursor into @var1
END
CLOSE mycursor
DEALLOCATE mycursor

In Real Life

For a real-life example, consider ETL (Extract, Transform and Load). A typical sequence runs like this:

Extract rows from some source

Change/Add (transform) the data

Load the rows into some target table

Often, prior to step 3, it is helpful to disable the non-clustered indexes (NCIs), then re-build them when the load is done. (As a side note, if your table has a clustered index, and you ensure proper ordering of your data to be loaded, you want the clustered index. You can't disable it, since that effectively disables access to the table (since the clustered index is the table)).

Using a Cursor

Here is a script that will disable the NCIs for a table in the AdventureWorks database:

This does the job quite handily. We can even use similar logic to rebuild the indexes when the load is done (or simply REBUILD ALL). What I'd like to show is a method to do the same without using a cursor.

Using a set-based operation

The basic idea is to build up the ALTER INDEX commands as a single NVARCHAR string, then use sp_executesql to run it. The code looks like this:

As you can see, there is less code to write and debug, I also find it less visually distracting. The net results are exactly the same. The performance is also the same, except for the front-end job of building the query string. As is usually the case, the set-based code from the second example out-performs the first. However, since we are doing this for maintenance-type jobs, there are not millions of rows (perhaps only hundreds or thousands) so the run time of the front-end code is overwhelmed by the run time of the executed code.

Summary

For many small jobs where a cursor seems a natural way to go, there is an alternative using a set-based operation to construct a list of commands and running the list in one execution. Run times are essentially the same but the code can be more compact and done in just two statements (build the list, run the list) instead of the several statements required by a cursor-based approach.

Friday, July 3, 2015

In the previous post Set operations in SQL Server - Part 1, we looked at the basic set options Union, Intersection, Complement (or subtraction) and how they can be used and combined. We also looked at the problem of determining table equality in a relational database -- especially when duplicate rows are permitted. In this installment, we will look at various ways of testing set membership.

Set membership is a property of an item that may or may not be a member of a given set. In database terms, membership implies that there exists some row in a table that satisfies the membership test. This usually means that one or more columns contain data that satisfy some constraint such as equality or inequality. Testing for set membership is an essential operation in processing data in a relational database, such as SQL Server.

Starting easy

Let's start off with a simple example. Suppose you have a table that contains information on bicycles. Perhaps it looks something like this:

Now, suppose this table was populated with various bike data and you want to see if there are any bicycles that cost over $1,000.00. A typical query might look like:

IF EXISTS(SELECT * FROM Bicycle WHERE Cost > 1000.00)
PRINT 'There is at least one bicycle costing over $1,000.00'

This is simple test for membership that prints a message if the constraint is satisfied. Note that it uses the function EXISTS(). There are actually four functions
available to help us with membership testing like this: EXISTS(), IN(), ANY() (a.k.a SOME()) and ALL(). You're probably familiar with EXISTS() and IN() and may have used them before. Perhaps you are curious about the ANY() and ALL() functions and what they can do for you.

Using ANY()

Let's rewrite the sample query using the ANY() function instead:

IF ANY (SELECT Cost FROM Bicycle) > 1000.00
PRINT 'There is at least one bicycle costing over $1,000.00'

This query returns the same result as the first. It is actually a more direct translation of the requirement and is shorter to boot! Before we go further though, I'm going to rewrite this query this way:

IF 1000 < ANY (SELECT Cost FROM Bicycle)
PRINT 'There is at least one bicycle costing over $1,000.00'

Why did I do that? Well, sometimes the query inside the call to the ALL() function can be long, making it harder to spot the condition being tested.

Using ALL()

Now that we've seen how ANY() operates, what about ALL()? It turns out that this is fairly intuitive. The function performs the desired comparison against all rows in the query and only returns True if the comparison is True for every member examined. Sticking with out bicycle example:

IF 1000 < ALL (SELECT Cost FROM Bicycle)
PRINT 'All bicycles cost more than $1,000.00'

If you try this against the sample data inserted above, you should receive no message, since there is at least one bicycle (the Giant 500) that costs less than $1,000.00.

Negating ANY() and ALL()

SQL allows for these operators when using ANY() and ALL():

{ = | <> | != | > | >= | !> | < | <= | !< }

However the first time you use the negative forms, you may be surprised at the results. For example, if you run this query:

IF 1000 !< ANY (SELECT Cost FROM Bicycle)
PRINT 'No bicycle costs less than $1,000.00'

you might be surprised to see that the message is indeed printed! If so, that is probably because you are reading the query in the ordinary English meaning of the command, "If there is no bicycle that costs less than $1,000.00, print the message." However, SQL sees it differently. You see,

<condition> ANY()

is really short hand for:

value <condition> first item OR value <condition> second item, etc.

With our sample data, negating that:

1000.00 !< ANY(...)

is executed as:

1000.00 !< 1000.01 OR 1000.00 !< 600.00

Clearly. 1000.00 > 600.00 so the expression evaluates to True and the message is printed. You might have expected behavior similar to:

IF NOT EXISTS(SELECT * FROM Bicycle WHERE Cost < 1000.00)
PRINT 'No bicycle costs less than $1,000.00'

which would result in no message, but NOT EXISTS() is not the same as !< ANY() as shown.
In fact, this is a case where using the alias SOME() can make the intent clearer. That is:

IF 1000 !< SOME (SELECT Cost FROM Bicycle)
PRINT 'No bicycle costs less than $1,000.00'

more clearly indicates what we're asking for.
What about negating the ALL() function? Well, similar to ANY(), ALL() is evaluated like this:

value <condition> first item AND value <condition> second item etc. So, this query:

IF 1000.00 !< ALL (SELECT Cost FROM Bicycle)
PRINT 'No bicycle costs less than $1,000.00'

is executed as:

1000.00 !< 1000.01 AND 1000.00 !< 600.00

which evaluates to False, so the message is not printed.

Note: We can rewrite the !< ANY() query (that didn't work) to achieve the desired result in this way:

IF NOT 1000 < ALL (SELECT Cost FROM Bicycle)
PRINT 'No bicycle costs less than $1,000.00'

which, when you think about it, makes perfect sense!

Summary

ANY() and ALL() are valuable functions in SQL Server that can be used to write succint queries that reflect the business logic. However, we need to be careful when using them with negative comparisons, since their execution differs from the normal, English reading.

Friday, June 26, 2015

Sets and set operations are fundamental to relational databases -- both in theory and in practical day-to-day operation. A table ("relation" in mathematical terms) represents a set (more properly, a multiset, or bag) and the rows of the table ("tuples") represent the members of a set. It's no surprise then, that SQL contains many operators designed to operate on sets. Some of these are familiar to anyone who recalls basic set theory from first-year college courses:

Union (UNION): Return a combination of two sets, eliminating duplicates

Intersection(INTERSECT): Return a set consisting of members present in both of two sets

Complement(EXCEPT): Return a set consisting of members present in one set but not the other

Cartesian product (A CROSS JOIN B): Set of all possible tuples from two sets

Cardinality (COUNT(*))

Note that I mentioned that in a relational database, sets are more properly called "multisets" or "bags". This is because an RDBMS typically allows duplicate rows (that is, when a primary key is not defined).

Testing set equality

One interesting problem is how to determine the equality of two sets in an RDBMS while allowing for duplicate rows. In set theory, two sets are equivalent if every member of set A is also a member of set B. That means we must have:

A \ B = B \ A = {}, the empty set, where "\" is the operator for the complement operation. We can also look at the cardinality:

|A\B| = |B\A| = 0

For tables in an RDBMS with a primary key (which can have no duplicate rows), we can check equality simply:

which you might recognize as computing the symmetric difference. If this query returns no rows, the two tables are equal. (The parentheses are needed to overcome the default operator precedence in SQL Server. See EXCEPT and INTERSECT (Transact-SQL) for details)
For tables in an RDBMS without a primary key (which enforces uniqueness), we need to use the cardinality to account for cases where some rows are duplicated. In SQL we can write:

we can see that the results indicate equality. That is, the orders of the two complements is indeed 0, but the tables are not equal, since table A has two rows of 1's but table B has two rows of 2's. What is happening here? SQL removes duplicates as part of the set operations UNION, EXCEPT and INTERSECT!

To solve this problem, we need to compare the two tables row-by-row. However, since there is no primary key, how should we match them up? One simple way is to impose an ordering on the tables and assign a surrogate key. In this case we simply use the ROW_NUMBER() function. e.g.

Saturday, June 20, 2015

Introduction

Linked servers are an essential part of the SQL Server developer's tool kit. They are often necessary for pulling/pushing data to/from remote servers whether those servers run SQL Server, Oracle, DB2, PostgreSQL or any other RDBMS. Linked servers are flexible and can adapt to many different types of "servers" including Excel workbooks and MS-Access databases among others

Creating a linked server is relatively easy. There is both a GUI method in SSMS (under Server Objects/Linked Servers) and a T-SQL method, using the system stored procedures sp_addlinkedserver and sp_addlinkedsrvlogin.

Problem

Securing access to linked servers falls outside the normal SQL Server permissions controls. For example, to give some user -- say CORP\USER1 -- access to a linked server, you need to add that login id as a linked server login and map it to some login on the linked server. That means that, for every user who needs to access the linked server, you potentially need to add a new linked server login.

There are options available to mitigate this somewhat. If the remote user is in the same AD domain as the local user, you can specify that the connection should be made using their credentials. This cannot work across domains unless those domains are in a trust relationship. You can also specify a catch-all remote user which is used for connections where there is no explicit mapping specified. Also, you can simply deny access for any users trying to connect that are not in the explicit login list. (I'm using the terms "user" and "login" rather loosely here. Strictly speaking, linked servers only know about logins since they are a server-level object.)

Solution

I find these options deficient and inflexible. I do not want to edit the linked server object every time I need to give a new login access or remove an old login. I'd rather use SQL Server's roles and credentials. Here's how I envision this working:

When you create a new login for a linked server, you should have additional options for the local login, specifically:

An AD group that is in the Server Logins. This would mean that any member of the group that can connect to the server (that is, not DENYd), can use the linked server.

A SQL Server user-defined role. This is the most flexible. Any member of the role would have access to the linked server. Note that you can add AD groups to roles if those groups are included in the server login list.

In addition to specifying a specific remote login, you should have the ability to specify a Credential.

These additional options, should they be added, would make the management of linked servers easier and more consistent with the SQL Server security model. With these in place, adding logins to the list of those permitted to use the linked server would be a simple matter of adding them to the AD group or add the login directly to the server role. Also, adding or changing the remote login would be managed as credentials,

Conclusion

I'm convinced that it is time to align linked server management with the SQL Server roles and credentials security model. Let's see if we can raise this issue for SQL Server Next!

Friday, June 12, 2015

Part of my daily routine is to have a look at jobs that ran overnight to see which ones ran longer than seems reasonable. I know, "seems reasonable" is a little vague! What I mean is that how long did the job run relative to previous runs and relative to the amount of data processed.

The other day while going through the overnight jobs, I found one that had been running between 90 minutes and 3.5 hours, yet only processed about 25000 rows. I suddenly got very interested in that job! What was it doing that could possible take so long?

Digging into SSIS

The job in question runs an SSIS package that loads a table into our data warehouse that is needed by many daily, weekly and monthly reports -- some of which go to C-level executives and above. One might say this is important! The job doesn't do anything very special: it gets data -- previously staged -- from the same server it loads it to. Sounds simple, right? Well, actually it is simple. So, why the long run time?

Digging into the package I discovered that the data flow uses an OLE DB source transformation, specifying Table or View as data access mode and referencing a view. "OK," I thought, "What's up with that view?" I opened the view definition in SSMS to look at it. It involved a join of six tables but nothing suspicious. Some of the tables could use extra indexes but the amount of data (25,000 rows) was small enough that I figured those indexes wouldn't make much difference. It turned out that I was right and that the problem was something else entirely.

It's the plan, dummy!

Before trying the run the view in SSMS, I took a look at the estimated execution plan. Essentially, it was a series of hash joins, which are usually a good choice. With a little trepidation, I ran the code in the view. To my surprise, it returned results in under 10 seconds! So, where did the 3.5 hours come from? Good question!

Next, I decided to rerun the package to see if it finished in 10 seconds, as my standalone query had. I waited and waited and waited some more. Finally I killed it. Something else was going on. I recalled that I had installed Adam Machanic's excellent sp_whoisactive stored procedure some time ago. You can find the current release version here: sp_whoisactive

This procedure has a number of options to control its output. I noticed a couple I thought I could use:

--If 1, gets the full stored procedure or running batch, when available
--If 0, gets only the actual statement that is currently running in the batch or procedure
@get_full_inner_text BIT = 0,
--Get associated query plans for running tasks, if available
--If @get_plans = 1, gets the plan based on the request's statement offset
--If @get_plans = 2, gets the entire plan based on the request's plan_handle
@get_plans TINYINT = 0,

So, I started up the package again, and while it was running, ran sp_whoisactive in a second session using those options. The procedure returned a row for my running package with a link to the query plan. Using that link, I opened the query plan to see what it looked like. It didn't look at all like the plan from my first, interactive session. Instead of hash joins the plan had nested loops! Given the sizes of the tables involved, no wonder it ran for 3.5 hours!

So, why the nested loops? Digging further, I discovered that the optimizer had timed out trying to find a good plan, leaving us with what it had at the moment. Not only that, but the estimated row counts for two of the tables were 1 -- that's right, one! I knew that those tables had thousands of rows, so nested loops would kill performance. However, the optimizer had given up before finding a good plan. Why? For that matter, why did it not simply reuse the plan from my (fast) interactive query?

ARITHABORT OFF messes up the plan

The next thing to look at was the settings in use when both plans were generated. They were the same except for one: ARITHABORT. The plan that caused the optimizer to time out was generated with SET ARITHABORT OFF in effect. Googling this setting led me to ARITHABORT and performance where it is stated clearly:

Client applications setting ARITHABORT to OFF can receive different
query plans making it difficult to troubleshoot poorly performing
queries.

Sooooo, I knew the reason. Now, what about the root cause? It turns out that the default setting for ARITHABORT in the model database is --- you guessed it! -- OFF. My database inherited that setting. I could have changed the database setting (and I still might do that) but I did not want to break any queries that might depend on it, so I changed the package instead.

The fix is in!

I changed the Data Access mode to SQL Statement and wrote:

SET ARITHABORT OFF;
SELECT * from dbo.myview;

I deployed the package to the server, deleted the old plans from the cache and reran the package. It ran in under ten seconds! Worrying that it might be a fluke, I monitored the package for the next several days and not once did it run longer than a minute. Problem solved!

Summary

The moral of the story? Don't always assume that the problem is the way the query is written or how the tables are indexed. The problem can be subtler than than and you need to look at the query plan and the options used to generate it.

Wednesday, June 3, 2015

One cool new feature in SQL Server 2014 is Buffer Pool Extensions, which allow you to use your SSDs like an extension of your system RAM, albeit a tad slower. It's no secret that adding memory to a SQL Server can improve your system's performance. Thankfully, memory prices fall logarithmically decade-over-decade and you can now get RAM for about ten bucks a gigabyte. Not bad! However, what if you've already maxed out your RAM and you still want more? BPE to the rescue!

How Buffer Pool Extensions Work

Buffer pool extensions work by using (fast) non-volatile storage devices, such as Solid State Disks, as an extension of your system RAM. Of course even the fastest SSD is not as fast as the slowest RAM available today, but SQL Server manages BPEs on SSDs in such a way that it's all gain and very little pain. With an available SSD, you can set up a BPE on that device with just a few T-SQL statements.

Where Does BPE Help?

You will likely see good performance gains on OLTP workloads -- particularly if they are of the read-mostly type. You don't have to modify your applications. The feature is transparent to users. There's no risk of data loss, since BPE only works with clean pages (already committed to durable storage). Oh, and if the SSD fails, BPE automatically disables itself, though it can be re-enabled manually should you think it's just a hiccup.

Saturday, May 30, 2015

For almost ten years I've been an ardent devotee of the Python language, initially developed by Guido van Rossum - Python at Centrum Wiskunde & Informatica in Amsterdam, the Netherlands. I learned it from the ground up during my participation the GRAMPS project, motivated by the unfolding history of my remarkable family.
A few years back, I was approached by Raul Chong of IBM's Big Data University to help author an introductory book on the language. That book is nearing completion!
Along with the book, I was invited to give a two-hour introductory class on Python at Ryerson University. That class went very well, with more than 100 students hanging on to the end. Naturally two hours is hardly enough to do more than scratch the surface, especially for non-Computer Science students, many of whom had little programming experience. Still, I was very much energized by the opportunity and the response.
I was able to use Visual Studio 2013 Express with Python support added via NuGet as my demonstration IDE. Though there are some evident growing pains (especially wrt debugging) VSCE has the potential to become one of the best (free) Python IDEs available.

Sunday, May 24, 2015

Like many large organizations, the company I work for has a large, heterogeneous database environment. We have at least four RDBMSs (five, if you count MS Access) in use to support the various lines of business. That list includes major, multiple instances of SQL Server, Oracle, DB2 and MySql. Some of those came as part of certain vendor packages. This article concerns my adventures integrating data from MySql into SQL Server for reporting purposes.

One of the instances of MySql that I need to pull data from to populate a data warehouse on SQL Server is the database used by SANscreen, developed by Onaro and now part of the NetApp product suite. It provides information on storage allocations and usage in our SAN. Developing an ETL process for this data load has been challenging -- even frustrating at times -- but very satisfying as the barriers we encountered along the way began to fall.

The first thing we did was install the MySql ODBC drivers into our development instance of SQL Server and verify that we could at least connect to the SANScreen database. That worked and we could see the SANScreen tables under the Linked Servers node of System Objects in SSMS. On to the queries!

We learned right away that we could not use SQL Server's four-part naming to query the MySql tables, since the MySql ODBC driver does not suport the required interfaces. So, we built views using OPENQUERY that work quite well. A typical view looks like this:

No problem! We used the views to build ETL packages in Integration Services and the world was a happier place. After a few runs however, it became obvious that were pulling largely the same data every night for some of the biggest tables. With 20/20 hindsight we realized that this was not unexpected, since those tables are essentially data warehouse fact tables. Only a few hundred or thousand rows were new for any given run. The rest (millions of rows, in some cases) were from older transactions. An incremental load approach was needed.

Writing incremental loads in Integration Services means that you need some column you can use to define the start of the new data. That is commonly a column containing a transaction date or something that can be mapped to such a date. You want to be able to restrict your query to rows that are new with respect to that column. Then, you want to save the new, maximum value of that column to use for the next incremental load. We found columns that mapped to dates and decided to use those. Adding those dates to our query created a new challenge.

We didn't want to retrieve the whole table from MySql and then filter out the new rows. That would defeat one of our key goals in the incremental load approach: only pull data from MySql that we need. So, we needed to pass that date to the call to OPENQUERY. Unfortunately, OPENQUERY does not support a variable as its second parameter. Our nice, neat views would not do the job. We needed new stored procedures. The procedures have the general form:

(Note that I like to declare a variable containing the whole query including place-holders for variables, then replace those variables before executing the query. This way, I can see the whole query at a glance. I find this easier to follow than alternatives that assemble the query from parts. YMMV.)

Voila! A stored procedure that calls MySql with a variable start date! Worked like a charm in all our unit tests. So, now we went on to build SSIS packages that use the new procs. More pain ensued...

Since we were no longer working with views, there has to be some way for Integration Services to figure out the metatdata for the source adapter. Since we are using OLE DB Sources in our packages, we changed the input from Table or View to SQL Command. The commands look like:

EXEC usp_MySqlTable ?

and we put a variable holding a default starting date in the Parameters section. (As an aside: For some reason, when you do this, the Integration Services designer cannot preview the results. It throws an error, complaining that parameters were not supplied, even though they clearly are.) Once we had changed the view to the call to the stored procedure, I clicked on Columns to make sure that everything was still OK. We waited...and waited...and waited some more...until the designer threw an error saying that the query has timed out! What on earth was going on? To find out, we fired up SQL Server Profiler to observer. we found a sequence like this (reformatted for readability):

Of course, SSIS needs to know the metadata, hence the SET FMTONLY ON. But where did that weird date come from? We had not specified January 1, 1992 as the start date! (Note that this remains a mystery.) The consequence of the bogus date is that MySql is asked to return the entire table, all four million rows of it. Since this is inside an OPENQUERY, SQL Server cannot know what the metadata is until something is returned. MySql chose to process a large portion of the table before returning anything to SQL Server, hence the timeout.

What to do? Well, we thought if we could somehow limit the results returned by MySql, in spite of the bogus date in 1992, we might be able to avoid the timeout and make the package design less frustrating as well. Reading through the MySql docs, we saw that there is a LIMIT clause available for MySql queries. That's it! we thought. Now, we needed a way to include that in the query, but conditionally, since we only wanted to do that when FMTONLY was ON. So how could we know that? BOL was no help, but some Internet searches turned up this gem:

Note that adding a comment to the query passed to MySql does not have any ill effects. However, it gives us a place to insert a proper limit clause. So, before actually executing the dynamic SQL, we added this one line:

IF 1=0 SET @query = replace(@query, '/* LIMIT */', ' LIMIT 1');

The effect? It worked! Now, when SSIS does its FMTONLY run, only one row is returned from MySql, almost instantly. No more lengthy waits and time outs!

Friday, May 22, 2015

There's an interesting article on MSDN: OLE DB Source discussing, among other things, how to use parameters when calling a stored procedure in an OLE DB Source component. The section "Specifying Parameters by Using Ordinal Positions" (about halfway down the page), shows an example using the AdventureWorks database. Unfortunately, the example doesn't work, at least not if you try to Preview the results.

Note that I want to use Preview as a sanity check that I've set things up correctly. Instead, it's driving me insane!

I set up my test using exactly the setup described:

OLE DB Connection Manager to AdventureWorks (I used the 2012 version for the test)

Two variables created: User::StartProductId (int) and User::CheckDate (datetime), set to 0 and getdate(), respectively

Two parameters in the SQL query Parameters window:
Parameter0 mapped to StartProductId
Parameter1 mapped to CheckDate

back in the OLE DB Source Editor, clicked Preview ------->>> BOOM!

"No value given for one or more required parameters."So,
it seems that the documentation is wrong (or incomplete) on this point.
Curious to see what was going on, I fired up Profiler to trace the
calls. I was surprised to see this sequence:

Then I clicked Preview again. -- Same result ----> BOOM! Same error message. Still didn't use my variable settings
So, this is totally messed up.

HOWEVER!
When I actually execute the task under debug in SSDT, it runs fine, uses the variable values I set, etc.

So, the big questions:

Why does Preview bomb? From what I can see in the trace, it shouldn't. That is, it properly prepares the call to the stored procedure (except for
my variable values), FWIW I copied the statements from Profiler and ran them fine in SSMS. So why does this fail in SSDT debug?

Why doesn't it use my variable values?

Where did the date '01-01-1992 01:01:01:000' come from? Is it just a historical reference to SQL-92?

Wednesday, April 29, 2015

Introduction

Don't Repeat Yourself (DRY) is an important principle of professional
programming practice (How's that for alliteration!) In SQL queries,
though, it is common to see repeated expressions. This article explores
how we can use the SQL APPLY operator to reduce repetition and make our
queries DRYer and easier to maintain.

Example of Repeated Expressions

A good example of repetition can be shown in the AdventureWorksDW2008R2 database. Consider the view [dbo].[vTimeSeries]:

There are three expressions in this query that are repeated. They are
given the aliases [ModelRegion], [TimeIndex] and [ReportingDate],
respectively. You can see these expressions first in the SELECT list
and later on in the GROUP by. We're going to eliminate this repetition
using the APPLY operator.

Using APPLY to Encapsulate Expressions

When you read the official documentation about the APPLY operator, it
is not immediately obvious how it can help us reduce repetition. (See "Using Apply" at https://msdn.microsoft.com/en-us/library/ms177634.aspx)
Many examples show how to use the APPLY operator to invoke a
table-valued function for each row in the rowset. However, since it is
possible to use SELECT queries in the APPLY context, it is also possible
to use an expression operating on columns in each row in the rowset.
How does this help us? Here's a simple example:

Note that the APPLY operator only contains an expression that uses a
column in the rowset and aliases the result. The alias is then used in
the main query. As an aside, note the use of '_' as the alias for the
APPLY rowset. This is handy when you don't care about the rowset alias
as in this case. It is a paradigm used in many programming languages as
well.

We can also use the results of one APPLY operation in a subsequent
APPLY. Note that the operations proceed top-to-bottom, left-to-right.
This means that the results of one APPLY are not available until the
operator appears in sequence. Let's expand our toy example a little:

Just like the toy example, above, the APPLY clause merely returns
aliased expressions using columns in the rowset. How will this make the
original query DRYer? We don't need to repeat these expressions!

Now, if the expressions need to change, we have only one place to change them. The query has become DRYer.

Summary

We can use the APPLY operator to encapsulate and alias expressions
that are repeated in our queries. This allows us to apply the principle
of "Don't Repeat Yourself" and make our queries easier to maintain.