SQL General

This isn’t a technical post about databases, but rather a discussion of a statistical paradox that I read about recently. Statistics and data often go hand in hand, and many of us who work with data often use statistics in our work – particularly if we cross over into BI, Machine Learning or Data Science.

So, let’s state the problem.

I think I have an average number of friends, but it seems like most people have more friends than me.

I can see this on facebook – even though facebook friendship isn’t the same thing as the real kind. A quick google tells me that the average number of facebook friends is 338 and the median is 200. My number is 238. That’s less than average but greater than the median – so according to those figures I do have more friends than most people (on facebook at least).

I could probably do with culling some of those though…

I’m going to pick ten of those facebook friends at random. I use a random number generator to pick a number from 1 to 20 and I’m going to pick that friend and the 20th friend after that, repeating until I get 10 people.

I get the number 15 to start so let’s start gathering some data. How many friends does my 15th friend have, my 35th, my 55th, 75th, 95th 115th etc.

Here’s the numbers sorted lowest to highest:

93
226
239
319
359
422
518
667
1547
4150

If I look at that list only two people have less friends than me – or 20%. 80% have more friends than me. Why do I suddenly feel lonely \ How can this be?

If I have more friends than the median, then I should be in the second half.

Friendship is based on random accidents, and I’ve picked people out of my friends list at random. Surely I’ve made a mistake.

I could repeat the sampling but I’m likely to get similar findings.

The answer is that it’s not all random, or at least not evenly so. The person with over 4,000 friends is over 40 times as likely to know me as the person with less than 100. Maybe they get out a lot more (actually they’re a musician).

I’m more likely to know people if they have a lot of friends than if they have fewer.

This is difficult to get your head round, but it’s important if you’re ever in the business of making inferences from sampled data. It’s called “The Inspection Paradox”.

It’s also one of many reasons why people may get feelings of inferiority from looking at social media.

I think it’s appropriate to give a shout-out to Microsoft at this point, because over the last few releases they’ve given us some of the items that are top of my list.

Recommending and setting MAXDOP during setup (coming with SQL 2019) will hopefully mean I no longer have to have arguments about why the out-of the-box setting isn’t appropriate.

The same with setting max memory in the setup (also with SQL 2019).

A more verbose error where string or binary data might be truncated – we got that in SQL 2017.

It’s the little things like these that make me happy – and make my job easier.

A few other little things I’d like – though I accept that something small to describe isn’t always small in its execution…

A change to the default cost threshold for parallelism – be it 20, be it 30, be it 50. I’d prefer any of those to 5.

I think it would also be great to have a cardinality optimizer hint, e.g. OPTIMIZE FOR 5 ROWS. Oracle has this, and it’s not good having to feel jealous of those working on the dark side 😉 You can do the equivalent but it’s convoluted and not clear what’s going on when the uninitiated see the code:

There is one big thing I’d like – but it’s never going to happen. Get rid of Enterprise Edition – or rather, make Enterprise Edition the standard. Enterprise is comparatively so expensive, it’s rare I’m going to recommend it. It’s interesting to see that in Azure SQLDB we only have one edition to work with – I’d love to see that in the box product. I understand that change would be a massive revenue loss so can’t see it happening.

If not that though, if we could just have at-rest encryption (i.e. TDE) in the Standard Edition that would make me very happy. In these days of security and privacy consciousness it seems that should be a core functionality.

UPDATE: TDE is going to be available on Standard Edition from SQL Server 2019. I get my wish!

Finally, I’d just like to upvote Brent’s idea that it would be great to be able to restore just a single table.

That all said, I’d like to go back to my first point. I love what MS is doing with SQL Server, and the continual improvements that are being made. I particularly love that we sometimes get them for free in service packs and cumulative updates – without having to upgrade to a new version.

Even administration, many of the core concepts are the same, understanding how security works, backups, high-availability. The main difference is often that some of these might be taken care of for you and you don’t need to worry about them any more.

Caveat – you still need to worry about them a bit!

The point is, most of what you already know, the experience you have gained over the years, is still totally valid. Learning about SQL Server on a new platform may feel like a big learning curve, but in reality, the new stuff you need to get to grips with is small compared to all the stuff you already know.

And in some cases, the skills you already have become even more valuable. People might not care if your query tuning on physical kit takes your CPU down from 50% to 10%. But tell them you’ve just reduced their cloud bill by 80% and they really care!

So don’t be intimidated, and don’t feel you need to learn every flavour. Have a play with SQL Server in the cloud, have a play with containers, set up SQL on Linux. You’ll quickly find it’s not that hard, and once it’s running – it’s pretty much the same as ever.

And remember, if someone comes to you with a question about why SQL is running slow, or why a query isn’t doing what they want – on RDS, Docker, Linux, or whatever. You don’t need to know that platform inside out to be able to help, you already know SQL Server and that’s the important bit.

To paraphrase a popular lyric :

If you’ve got SQL problems I can help you son. I’ve got 99 problems but SQL aint one.

One of the powerful aspects of Query Store is the ability to directly query the DMVs for details of historical executions and performance.

A key view for this is sys.query_store_runtime_stats (but also sys.query_store_runtime_stats_interval).

If you’re querying these views to look at what was happening during a particular time period then it’s important to understand that the dates and times are stored according to UTC time (which is equivalent to GMT with no adjustment for daylight savings times).

Though it will be more obvious to you if you’re in a time zone other than the UK.

The datetimes are stored as DATETIMEOFFSET which you can see from the +00:00 at the end of each entry. DATETIMEOFFSET allows you to store both the datetime as well as the timezone being used.

This means that if you’re querying against these columns you need to convert the values you are looking for to UTC first. You can do that my making sure you use GETUTCDATE() instead of GETDATE(), or if you are using specific local times you can use the AT TIME ZONE function e.g.

SELECT CAST('2019-08-21 11:50:40.400 +9:00' AS datetimeoffset) AT TIME ZONE('UTC');

I’m a big fan of storing all datetimes as the UTC version. It avoids things going out of sequence due to daylight saving changes – and can be helpful in mitigating problems when you have application users in multiple countries accessing the same database.

I’ll admit I might be biased though as – being based in the UK – UTC is the same as my local time half the year.

Last week a question came up about adding a column to a table, and giving that column a default constraint. Would that default value be assigned to all existing rows, and how much processing would be involved.

Unsurprisingly, the answer is that – “it depends”.

I’ve got a table with about a million rows that just has an identity column and a text column I’ve populated from sys.objects:

So whether our default value gets assigned to existing rows depends on whether your column is nullable or not, a nullable column will retain Null as the value. A non-nullable column will get assigned the new default value.

If you want to override that behaviour, and have your default assigned even where the column is nullable, you can use the WITH VALUES statement. First I’ll remove the constraint and column then add it again with values:

A few years back I started running regular SQL workshops in my workplace. Teaching beginners the basics of querying databases with SQL, as well as more advanced topics for the more advanced.

During one session we were discussing the issue of knowledge acquired being quickly lost when people didn’t get the chance to regularly practice what they’d learnt. One of the attendees suggested that I should be assigning them homework.

I could see from the faces of everyone else present that the word “homework” struck an unpleasant chord. Perhaps reminding them of school days struggling to get boring bookwork done when they’d rather be at relaxation or play.

Okay, so homework maybe wasn’t going to go down well, but I figured everyone likes a good puzzle. So every Friday I started creating and sharing a puzzle to be solved using SQL. This went on for the best part of a year, then other things got in the way and gradually I stopped.

This is my invitation to you this T-SQL Tuesday. Write a blog post combining puzzles and T-SQL. There’s quite a few ways you could approach this, so hopefully no-one needs be left out for lack of ideas:

Present a puzzle to be solved in SQL and challenge your readers to solve it.

Or give us a puzzle or quiz about SQL or databases.

Show the SQL solution to a classic puzzle or game.

Provide a method for solving a classic sort of querying puzzle people face.

Show how newer features in SQL can be used to solve old puzzles in new ways.

Tell us about a time you solved a problem or overcame a technical challenge that was a real puzzle.

Or just make your own interpretation of “puzzle” and go for it!

There’s some great stuff out there already. Itzik Ben-Gan’s done a bunch of them. There’s Kenneth Fisher’s crosswords. The SQL Server Central questions of the day. Pinal Dave’s SQL Puzzles. And there’s a few on my blog too if you take a look back:

Parallelism and MAXDOP

The pros and cons of parallelism have always been with us in SQL Server and I blogged about this a couple of years ago. This is an updated version of that post to include details of the new wait stat related to parallelism that was added in 2017 (CXCONSUMER), as well as to discuss the options available for cloud based SQL Server solutions.

There’s no doubt that parallelism in SQL is a great thing. It enables large queries to share the load across multiple processors and get the job done quicker.

However it’s important to understand that it has an overhead. There is extra effort involved in managing the separate streams of work and synchronising them back together to – for instance – present the results.

That can mean in some cases that adding more threads to a process doesn’t actually benefit us and in some cases it can slow down the overall execution.

We refer to the number of threads used in a query as the DOP (Degree of Parallelism) and in SQL Server we have the setting MAXDOP (Maximum Degree of Parallelism) which is the maximum DOP that will be used in executing a single query.

Out of the box, MAXDOP is set to 0, which means there is no limit to the DOP for an individual query. It is almost always worth changing this to a more optimal setting for your workload.

Cost Threshold for Parallelism

This is another setting available to us in SQL Server and defines the cost level at which SQL will consider a parallel execution for a query. Out of the box this is set to 5 which is actually a pretty low number. Query costing is based on Algorithm’s from “Nick’s machine” the box used by the original developer who benchmarked queries for Microsoft.

(Nick’s Machine)

Compared to modern servers Nick’s machine was pretty slow and as the Cost Threshold hasn’t changed for many years, it’s now generally considered too low for modern workloads/hardware. In reality we don’t want all our tiny queries to go parallel as the benefit is negligible and can even be negative, so it’s worth upping this number. Advice varies but generally recommendations say to set this somewhere in the range from 30 to 50 (and then tuning up and down based on your production workload).

There are many articles in the SQL Server community about how the out of the box setting is too low, and asking Microsoft to change it. Here’s a recent one:

CXPACKET and CXCONSUMER waits

Often in tuning a SQL Server instance we will look at wait stats – which tell us what queries have been waiting for when they run. CXPACKET waits are usually associated with parallelism and particularly the case where multi-threaded queries have been stuck waiting for one or more of the threads to complete – i.e. the threads are taking different lengths of time because the load hasn’t been split evenly. Brent Ozar talks about that here:

High CXPACKET waits can be – but aren’t necessarily – a problem. You can cure CXPACKET waits by simply setting MAXDOP to 1 at a server level (thus preventing parallelism) – but this isn’t necessarily the right solution. Though in some cases in can be, SharePoint for instance is best run with MAXDOP set to 1.

What you can definitely deduce from high CXPACKET waits however is that there is a lot of parallelism going on and that it is worth looking at your settings.

To make it easier to identify issues with parallelism, with SQL Server 2017 CU3 Microsoft added a second wait type related to parallelism – CXCONSUMER. This wait type was also added to SQL Server 2016 in SP2.

Waits related to parallelism are now split between CXPACKET and CXCONSUMER.

Here’s the original announcement from Microsoft regarding the change and giving more details:

In brief, moving forward CXPACKET waits are the ones you might want to worry about, and CXCONSUMER waits are generally benign, encountered as a normal part of parallel execution.

Tuning Parallelism

In tuning parallelism we need to think about how we want different sized queries to act on our server.

Small Queries

In general we don’t want these to go parallel so we up the Cost Threshold to an appropriate number to avoid this. As discussed above 30 is a good number to start with. You can also query your plan cache and look at the actual costs of queries that have been executed on your SQL Instance to get a more accurate idea of where you want to set this. Grant Fritchey has an example of how to do that here:

Often the answer is going to be simply to set it to 8 – but then experiment with tuning it up and down slightly to see whether that makes things better or worse.

Very Large Queries

If we have a mixed workload on our server which includes some very expensive queries – possibly for reporting purposes – then we may want to look at upping the MAXDOP for these queries to allow them to take advantage of more processors. One thing to consider though is – do we really want these queries running during the day when things are busy? Ideally they should run in quieter times. If they must run during the day, then do we want to avoid them taking over all the server power and blocking our production workload? In which case we might just let them run at the MAXDOP defined above.

If we decide we want to let them have the extra power then we can override the server MAXDOP setting with a query hint OPTION(MAXDOP n):

You will want to experiment to find the “best” value for the given query. As discussed above and as shown in Kendra Little’s article, just setting it to the maximum number of cores available isn’t necessarily going to be the fastest option.

Exceptions to the Rule

Regardless of the size, there are some queries that just don’t benefit from parallelism so you may need to assess them on an individual basis to find the right degree of parallelism to use.

With SQL server you can specify the MAXDOP at the server level, but also override it at the database level using a SCOPED CONFIGURATION or for individual queries using a query hint. There are even other ways you can control this:

Options in the Cloud

If your SQL Server is hosted in the cloud, then most of this still applies. You still need to think about tuning parallelism – it isn’t done for you, and the defaults are the same – so probably not optimal for most workloads.

There are in general two flavours of cloud implementation. The first is Infrastructure as a Service (IaaS) where you simply have a VM provided by your cloud provider and run an OS with SQL server on top of it in that VM. Regardless of your cloud provider (e.g. Azure, AWS etc.), if you’re using IaaS for SQL Server then the same rules apply, and you go about tuning parallelism in exactly the same way.

The other type of cloud approach is Platform as a Service (PaaS). This is where you use a managed service for SQL Server. This would include Azure SQL Database, Azure SQL Database Managed Instance, and Amazon RDS for SQL Server. In these cases, the rules still apply, but how you manage these settings may differ. Let’s look at that for the three PaaS options mentioned above.

Azure SQL Database

This is a single SQL Server database hosted in Azure. You don’t have access to server level settings, so you can’t change MAXDOP or the cost threshold. You can however specify MAXDOP at the database level e.g.

ALTER DATABASE SCOPED CONFIGURATION SET MAXDOP = 4;

Cost threshold for Parallelism however is unavailable to change in Azure SQL Database.

Azure SQL Database Managed Instance

This presents you with something that looks very much like the SQL Server you are used to, you just can’t access the box behind it. And similar to your regular SQL instance, you can set MAXDOP and the Cost threshold as normal.

Amazon RDS for SQL Server

This is similar to managed instance. It looks and acts like SQL Server but you can’t access the machine or OS. You access your RDS instance through an account that has permissions that are more limited than your usual sa account or sysadmin role allows. And one of the things you can’t do with your limited permissions is to change the parallelism settings.

Amazon have provided a way around this though and you can change both settings using something called a parameter group:

Closing Thoughts

Parallelism is a powerful tool at our disposal, but like all tools it should be used wisely and not thrown at every query to its maximum – and this is often what happens with the out of the box settings on SQL Server. Tuning parallelism is not a knee-jerk reaction to high CXPACKET waits, but something we should be considering carefully in all our SQL Server implementations.

Acknowledgements

I wanted to update my original article to include the cloud options noted above, but didn’t have access to an Azure SQL Database Managed Instance to check the state of play. Thanks to TravisGarland via Twitter (@RockyTopDBA) and Chrissy LeMaire via the SQL community slack (@cl) for checking this and letting me know!

When you drop a database from a SQL Server instance the underlying files are usually removed. This doesn’t happen however if you set the database to be offline first, or if you detach the database rather than dropping it.

The scenario with offline databases is the one that occurs most often in practice. I might ask if a database is no longer in use and whether I can remove it. A common response is that people don’t think it’s in use, but can I take it offline and we’ll see if anyone screams. I’ll often put a note in my calendar to remove it after a few weeks if no-one has complained. When I do come to remove it, hopefully I’ll remember to put it back online before I drop it so the files get removed, but sometimes I might forget, and in an environment where many people have permissions to create and drop databases you can end up with a lot of files left behind for databases that no longer exist – these are what I’m referring to as orphaned files.

Obviously this shouldn’t happen in production environments where change should be carefully controlled, but if you manage a lot of development and test environments this can certainly occur.

So I created a script I can run on an instance to identify any files in its default data and log directories that are not related to any databases on the instance. Here it is:

I wouldn’t say that you can just go delete these once you’ve identified them, but at least now you have a list and can investigate further.

By the way, you might notice a nasty join statement in the above query. This is to deal with instances where the default directories have been defined with a double backslash at the end. SQL Server setup allows this and it doesn’t cause any day-to-day problems, but can make this sort of automation challenging. I’ve included it in this query as I’ve encountered a few people having this issue. In general I’d avoid such joins like the plague.

Making things more complicated

One complication can be where you have multiple SQL Server instances on the same server. This isn’t greatly recommended in production, but is common in dev\test. Where a database has been migrated from one instance to another, it’s possible that hasn’t been done correctly and the files still exist under the directories for the old instance and you might then see them as orphaned. I previously posted a script to identify such files:

Combining these three techniques makes it relatively easy to identify files that are probably no longer needed. You can get a list of all files that don’t belong to databases on the instances they live under, correlate that to any files that are down the wrong path for any of your instances, then look at what’s left over.

The SEQUENCE object was added to T-SQL in SQL Server 2012. It’s reasonably well known to DBAs, but less so to developers or those new to SQL, so I thought I’d produce a quick post to demonstrate its use.

In basic terms, a SEQUENCE is a way of generating a sequence of numerical values. The following are examples of sequences you could generate:

1, 2, 3, 4, 5 6, 7, 8, 9…

1, 6 , 11, 16, 21, 26, 31…

1000, 1001, 1002, 1003, 1004…

You can pick a starting number (and an ending number if you want), a data type (which might produce a natural maximum to the size of the sequence) and an increment (i.e. how much you want to be added to produce the next number in the sequence). There are other options, but I’m going to focus on the simplest use case. You can find the full documentation here:

So, let’s define my use case. I have a table to hold customer orders. For each record want to define an Order Reference Number of the format ORN0000000001.

Now you could implement something using an IDENTITY column to manage this – but there may be times when that is not ideal, for instance your table may not already have a suitable identity to use (you might have a unique identifier as the primary key) and if you want to store the actual reference then you’d need to add an IDENTITY column in addition to the reference column. Or you might need a reference that is unique across multiple tables.

The SEQUENCE object is also designed to be faster than IDENTITY, creating less blocking when you have a lot of concurrent inserts.

First of all, creating the sequence to generate the numeric part of my reference is easy. Let’s say that a bunch of reference numbers have already been used so I want to start with ORN0000100001

You can see that the OrderReference is defined with a default using our sequence object.

I insert a bunch of rows into the table. For the sake of this rather contrived example, I only need to specify the CustomerId. I do that by generating a bunch of random unique identifiers – one for each row in the sys.objects table.

INSERT INTO dbo.Orders (CustomerId)SELECT NEWID() FROM sys.objects;

Let’s have a look at an extract from the table:

You can see I’ve got a nice series of ascending, non-duplicating reference numbers.

One thing to note is that, while the sequence will generally produce unique number, it is still worth enforcing that in your table definition with a unique constraint i.e.

This prevents someone from issuing an UPDATE command that might create a duplicate reference. Also, once the sequence runs out of numbers it will repeat back at the beginning unless you specify NO CYCLE in the defintion of the sequence – obviously in most applications this is unlikely to be an issue if you’re using a bigint for the sequence.

There was also a bug in some versions of SQL 2012 and 2014 that meant a duplicate could get created when your server was under memory pressure:

This was fixed with SQL Server 2012 SP2 CU4 and SQL Server 2014 CU6 – but it’s better to be safe than sorry.

As a final note, it’s worth remembering that with the GDPR, these sorts of references are defined as personal data.That’s one good reason not to ever consider using these sorts of references as the primary key of your table (there are many others) – but also a reason why – where you already have an identity based primary keys that you could use to generate the references – it may be worth decoupling them from the primary key and basing them on a separate sequence instead.