But I came across some problems. I assumed that the DATEDIFF function I wrote worked this way: Subtract the two dates to get a timespan value and then return the number of seconds (rounded somehow) in that timespan.

And it would be good for what I need except that DATEDIFF using MILLISECOND will overflow for any timespan over a month. For example,

SELECTDATEDIFF(millisecond, '2017-11-01', '2017-12-01')

SELECT DATEDIFF (millisecond, '2017-11-01', '2017-12-01')

gives this error:

Msg 535, Level 16, State 0, Line 1
The datediff function resulted in an overflow. The number of dateparts separating two date/time instances is too large. Try to use datediff with a less precise datepart.

SQL Server 2016 introduced DATEDIFF_BIG to get around this specific problem. But I’m not there yet.

Use DATEADD

I eventually realized that I don’t actually need to measure a timespan. I really just need to answer the question “Does a particular DateTime occur before one second ago?” And I can do that with DATEADD

Update: Adam Machanic points out another benefit to this syntax. The predicate AND MyDateTime < DATEADD(SECOND, -1, GETUTCDATE()) syntax is SARGable (unlike the DATEDIFF examples). Even though there might not be a supporting index or SQL Server might not choose to use such an index it in this specific case, I prefer this syntax even more.

So How About You?

Do you use DATEDIFF at all? Why? I'd like to hear about what you use it for. Especially if you rely on the datepart boundary crossing behavior.

November 10, 2017

Devops, Agile and Continuous Delivery

Sometimes I get asked about work and our software development practices. Often these questions use words like agile, devops, or continuous delivery as in “Are you agile?” or “Do you do continuous delivery?”. But these questions rarely have yes or no answers. I always want to answer “It’s a work in progress”.

One of the things I like best about the PASS Summit is the opportunity to talk to people to find out how they do development. After speaking with others, it seems like everyone’s processes are always going to be works-in-progress. But I’ve come to realize that at D2L, we’re pretty far along. Here are just some of the things we do:

Things We Do

Deployments to production are scheduled and almost completely automatic. This lets us deploy way more often than we used to.

All code – including procedures and table definitions – are checked in.

Automatic tests are run on every pull request and merge.

Safety is a huge concern. We use feature flags and other techniques, but it remains difficult to maintain large complicated systems safely.

The database layer is an exception to this blue-green technique. So it’s very important to be able to rollback any changes or configuration.

Sardines and Whales

This means we must also support thousands of copies of our main database. They’re used for client sites, test sites, qa sites, or whatever. So that leads to a variety of server configurations that I refer to as sardines and whales:

Look at those sardines. They’re quite happy where they are. The server can handle up to a thousand databases when there’s almost no activity.

But that whale is on a huge server and is extremely busy. Because of the high volume of transactions, we sometimes encounter tempdb contention due to our frequent use of table valued parameters. One technique I’ve been looking forward to evaluating is using memory optimized table types.

Maybe We Can Use In Memory OLTP?

I’m actually not very interested in memory optimized tables. I’m much more interested in the memory optimized table types. Those types can be used for table valued parameters. I can’t tell you how excited I was that it might solve my tempdb pet peeve.

But our dreams for the feature died

We’re leaving the feature behind for a few reasons. There’s an assumption we relied on for the sardine servers: Databases that contain no data and serve no activity should not require significant resources like disk space or memory. However, when we turned on In Memory OLTP by adding the filegroup for the memory-optimized data, we found that the database began consuming memory and disk (about 2 gigabytes of disk per database). This required extra resources for the sardine servers. So for example, 1000 databases * 2Gb = 2Tb for a server that should be empty.

Another reason is that checkpoints began to take longer. Checkpoints are not guaranteed to be quick, but on small systems they take a while which impacts some of our Continuous Integration workflows.

At the PASS Summit, I talked to a Hekaton expert panel. I also talked to a couple people in the Microsoft SQL Server clinic about some of my issues. They all recommended that we upgrade to SQL Server 2016 (which we can’t yet). Maybe I didn’t phrase my questions well, but I didn’t come away with any useful strategy to pursue.

I later talked to a Speaker Idol contestant Brian Carrig (@briancarrig) after hearing him talk briefly about his experiences with In Memory OLTP. He mentioned his own hard-fought lessons with In Memory OLTP including some uncomfortable outages.

The final nail in the coffin, as it were, is that once you turn on In Memory OLTP, you can’t turn it off. Once the In Memory OLTP filegroup is added, it’s there for good. Like I said, safety is a huge concern for us so we’re giving up on the feature for now.

Resurrecting the Feature?

The feature was designed for whales, not sardines. Maybe someday we will try to get those sardine servers to not fail with In Memory OLTP. Until then, the feature goes back on the shelf.

September 11, 2017

“Above all else, show the data” says Edwarde Tufte. He’s the data visualization expert who promotes a high data-ink ratio in data visualizations. He describes data-ink as “the non-erasable core of a graphic”. In other words, avoid chartjunk.

SQL Junk

I buy that. Less is More. And we can apply that idea to SQL. If SQL is going to be maintained by a human, it’s best to use a style that is easy on the reader. There’s a lot of syntax in SQL that is redundant. By keeping only the non-erasable syntax in SQL statements, the SQL gets easier to understand and maintain.

Square Brackets

I know I’m not the only one who thinks so. I was watching Kendra Little’s presentation SSMS Shortcuts & Secrets. When someone asked whether there was a shortcut for removing brackets. There isn’t. Most people search and replace “[” with “” and then do the same thing with “]”. Kendra wondered if there was a regular expression that allowed a user to do both. There is, but it’s awkward because brackets need to be escaped: [\[\]] (which is only slightly easier to remember than ¯\_(ツ)_/¯).

SQL Junk in SSMS Generated Scripts

SSMS’s scripting engine is great at taking an object like a table and giving you a CREATE statement that will perfectly recreate what you need. But it’s a huge contributor of SQL Junk. I’ve seen so much SSMS-generated SQL get checked in to repositories that the style is assumed to be a best practice.

Here’s something that shows just how much SQL Junk can be removed. I think the resulting SQL is so much more clear and so much easier to maintain.

August 2, 2017

We encountered a CPU issue which took over a month to understand and fix. Since then, it’s been another couple months and so I think it may be time for a debrief.

The cause was identified as a growing number of ghost records that SQL Server would not clean up no matter what. Our fix was ultimately to restart SQL Server.

Symptoms and Data Collection

Here’s what we found.

At time marked ‘A’ on the graph, we noticed that CPU increased dramatically. It was hard not to notice.

We used sp_whoisactive to identify a frequent query that was taking a large amount of CPU. That query had never been a problem before. It was a select against a queue table – a table whose purpose was to store data for an implementation of a work queue. This table had a lot of churn: many INSERTS and DELETES. But it was small, never more than 100 rows.

So next, we ran the query manually in Management Studio. SET STATISTICS IO, TIME ON gave us a surprise. A simple COUNT(*) of the table told us there were 30 rows in the table, but reading it took 800K logical page reads!

What pages could it be reading? It’s impossible for a table to be that fragmented, it would mean less than one row per page. To look at the physical stats we ran:

Interesting. The ghost records that remain are version_ghost_records, not ghost_records. Which sounds like we’re using some sort of snapshot isolation (which we’re not), or online reindexing (which we are), or something else that uses row versions.

Over time, those version_ghost_records would constantly accumulate. This ghost record accumulation was also present in all other tables, but it didn’t hurt as much as the queue table which had the most frequent deletes.

Mitigation – Rebuild Indexes

Does an index rebuild clean these up? In this case, yes. An index rebuild reduced the number of version ghost records for the table. At the time marked ‘B’ in the timeline, we saw that an index rebuild cleaned up these records and restored performance. But only temporarily. The version_ghost_records continued to accumulate gradually.

At time ‘C’ in the timeline, we created a job that ran every 15 minutes to rebuild the index. This restored performance to acceptable levels.

More investigation online

Kendra Little – Why Is My Transaction Log Growing In My Availability Group?
This is a great video. Kendra describes a similar problem. Long running queries on secondary replicas can impact the primary server in an Availability Group (AG). But we found no long running queries on any replica. We looked using sp_whoisactive and DBCC OPENTRAN. We didn’t see any open transactions anywhere.

Calling Microsoft Support

So we called Microsoft support. We didn’t really get anywhere. We spoke with many people over a month. We generated memory dumps, PSSDiag sessions and we conducted a couple screen sharing sessions. Everyone was equally stumped.

After much time and many diagnostic experiments. Here’s what we did find.

Availability Groups with readable secondaries are necessary (but not sufficient) to see the problem. This is where the version_ghost_records come from in the first place. Readable secondaries make use of the version store.

We ran an extended event histogram on the ghost_cleanup event. There was a ridiculous amount of events. Like millions per minute, but they weren’t actually cleaning up anything:

Microsoft let me down. They couldn’t figure out what was wrong. Each new person on the case had to be convinced that there were no open transactions. They couldn’t reproduce our problem. And on our server, we couldn’t avoid the problem.

Resolution

Ultimately the time came for an unrelated maintenance task. We had to do a rolling upgrade for some hardware driver update. We manually failed over the Availability Group and after that, no more problem!

It’s satisfying and not satisfying. Satisfying because the problem went away for us and we haven’t seen it since. After the amount of time I spent on this, I’m happy to leave this problem in the past.

But it’s not satisfying because we didn’t crack the mystery. And restarting SQL Server is an extreme solution for a problem associated with an “Always On” feature.

If you’re in the same boat as I was, go through the links in this post. Understand your environment. Look for long running queries on all replicas. And when you’ve exhausted other solutions, mitigate with frequent index rebuilds and go ahead and restart SQL Server when you can.

July 24, 2017

IBM holds a monthly math/programming challenge called “Ponder This”.This month’s problem (July 2017) is an interesting one because it can (mostly) be solved using SQL and highschool math:

“Find a way to assign a unique digit to each letter such that the 863,172nd digit after the decimal point of sin(PONDER) and of sin(THIS) are not zeros, but more than a fifth of the first million digits of sin(PONDER)*sin(THIS) are zeros.

“The parameters are in degrees, and we count the digits from 1, so for example, sin(100) = 0.9848077530122… so the fifth digit after the decimal point of sin(100) is zero.”

If you’re up to the challenge, use any technique you like, but you’ve only got a week. Next week (August 1st) after IBM publishes their solution, I’ll publish the SQL based solution that I used.

Update: My Solution

First I realized that PONDER and THIS represent integers greater than 360 degrees. That’s okay. All I need to do is find two numbers A and B between 1 and 360 with the desired properties and then take that solution and add 360 to either number until I get a six digit number and a four digit number with no shared digits.

So my guess is that SIN(A) * SIN(B) is a regular number.
I can look for that using SQL:

Where HasDuplicate is a CLR function I wrote that returns whether a number has repeat digits.
This gives me a few solutions, for example PONDER = 102978 and THIS = 3654.

I get a kick out of these puzzles and as an added bonus, I get to exercise my SQL skills.

Update 2

Stephen pointed out in the comments that I didn’t even check to see whether the 863,172nd digit of SIN(PONDER) or SIN(THIS) is nonzero. How did I do that? Well, I didn’t use SQL. I used google. Searching for “What’s the exact value of SIN(18)” gives me this page Exact Value of sin 18°. The internet’s amazing. So I trust that the exact value is sin(18) = ( -1 + sqrt(5) ) / 4 which has the same 863,172nd digit as sqrt(5) / 4. The same reasoning works for SIN(54) because sin(54) = ( 1 + sqrt(5) ) / 4.

Rather than figure out how to calculate the 863,172nd digit of sqrt(5) / 4, I searched for and found the first million digits of sqrt(5). The internet’s amazing. I know that when you divide a number by four, any digit in that number usually has a limited scope of influence on the digits in the answer. I divided the digits around the 863172 place by 4 to convince myself that the 863172nd digit of sqrt(5) / 4 is not 0.

In 2015, Postgres added support for ON CONFLICT DO UPDATE to their INSERT statements. Their docs say that using this syntax guarantees an atomic INSERT or UPDATE outcome; one of those two outcomes is guaranteed, even under high concurrency.

Now, that is one amazing feature isn’t it? Postgres worked really hard to include that feature and kudos to them. I would trade ten features like stretch db for one feature like this. It makes a very common use case simple. And it takes away the awkwardness from the developer.

But if you work with SQL Server, the awkwardness remains and you have to take care of doing UPSERT correctly under high concurrency.

I wrote a post in 2011 called Mythbusting: Concurrent Update/Insert Solutions. But since then, I learned new things, and people have suggested new UPSERT methods. I wanted to bring all those ideas together on one page. Something definitive that I can link to in the future.

The Setup

I’m going consider multiple definitions of a procedure called s_AccountDetails_Upsert. The procedure will attempt to perform an UPDATE or INSERT to this table:

I have no concurrency concerns here but because of this issue – not to mention performance concerns – it’s no longer my first preference.

Performance and Other Concerns

When describing each pattern so far, I haven’t paid attention to performance, just concurrency. There are many opportunities to improve the performance of these solutions. In particular, pay very close attention to your average use case. If your workload UPDATEs a row 99% of the time and INSERTs a row only 1% of the time, then optimal implementation will look different than if the UPDATE and INSERT frequency is reversed.

April 18, 2017

My friend Josh came up with the term “dark queries”. Just like “dark matter”, dark queries can’t be detected directly, but their effect can still be observed. He’s talking about queries that aren’t stored in cache. If your monitoring solution relies heavily on the statistics of cached queries, then you may not be capturing all the activity on your server.

but some queries will fall out of cache or don’t ever make it into cache. Those are the dark queries I’m interested in today. Today let’s look at query recompiles to shed light on some of those dark queries that maybe we’re not measuring.

By the way, if you’re using SQL Server 2016’s query store then this post isn’t for you because Query Store is awesome. Query Store doesn’t rely on the cache. It captures all activity and stores queries separately – Truth in advertising!

High Recompile Rate?

If you work with a high-frequency transactional workload like I do, then you can’t afford the CPU required for frequent recompiles. If you have sustained recompiles larger than a few hundred per second, that’s probably too much. It’s easy to check. Use the performance monitor to take a look at the SQL Re-Compilations/sec counter which is found in SQLServer:SQL Statistics/sec.

Drill Into Recompile Causes
You can drill into this a little further with an extended event session stored to a histogram like this:

Infrequent Recompiles?

Even if you don’t have frequent recompiles, it’s important to know what’s going on in the server. On your OLTP database, I bet you’re reluctant to let those BI folks run their analytical queries whenever they want. They may try to hide their shenanigans with an OPTION(RECOMPILE) hint. This extended events captures those query recompiles. Be sure to only capture a small sample by having the session run for a small amount of time. Or you can use the histogram above to make sure that the frequency is low.

February 22, 2017

I’ve got a program here that finds SQL in procedures with missing column specifications.

Update 10/26/2018: See comments for a quick TSQL-only solution

Table Columns Are Ordered

Even though they shouldn’t be.

Unlike mathematical relations, SQL tables have ordered columns, but please don’t depend on it.
In other words, try to treat these tables as the same because it’s super-awkward to turn one into the other:

CREATETABLE PEOPLE
(
LastName varchar(200),
FirstName varchar(200))

CREATE TABLE PEOPLE
(
LastName varchar(200),
FirstName varchar(200)
)

CREATETABLE PEOPLE
(
FirstName varchar(200),
LastName varchar(200))

CREATE TABLE PEOPLE
(
FirstName varchar(200),
LastName varchar(200)
)

Don’t Omit Column Specification

And don’t forget to specify the columns in your INSERT statement. No excuses.

You’re depending on the column ordering if you write INSERT statements like this:

Find Missing Column Specifications

Thomas LaRock recently encouraged DBAs to branch out horizontally. In that spirit, don’t be too afraid of the C#. I’ve got a program here that finds procedures with missing column specifications.

If for some reason, you don’t care about enforcing this rule for temp tables and table variables, then uncomment the line // visitor.TolerateTempTables = true;

It uses ScriptDom which you can get from Microsoft as a nuget package.

The performance is terrible in Visual Studio because ScriptDom uses Antlr which uses exceptions for flow control and this leads to lots of “first chance exceptions” which slows down debugging. Outside of Visual Studio, it’s just fine.

Another recap of a problem we recently tackled where we had to find out why a doomed transaction was still trying to do work. A doomed transaction is one that is about to roll back. In our case, it was caused by a rare set of circumstances:

Getting Error 3930

Msg 3930, Level 16, State 1, Line 14
The current transaction cannot be committed and cannot support operations that
write to the log file. Roll back the transaction.

Google pointed me to this Stackoverflow question where Remus Rusanu talks about doomed transactions. You’re not allowed to do work in a catch block if your transaction is doomed. He then points to more information about uncommittable transactions in the TRY/CATCH docs. According to the TRY/CATCH docs, you can get this kind of error when XACT_ABORT is turned ON. This is something we do and I wrote about why in Don’t Abandon Your Transactions.

So there’s a CATCH block catching some error in a TRY block. Now I wonder what that error is. The SqlException that got raised to the client only had the 3930 error above. It had nothing about the original error. I did have the procedure name though. I can look there.

Complicated Procedure

So I took a look at the procedure getting called. I saw that it was calling another procedure. And that procedure called other procedures. It was a big complicated mess. The calling tree looked like this:

Something Complicated

So the procedure was complicated and it used explicit transactions, but I couldn’t find any TRY/CATCH blocks anywhere! What I needed was a stack trace, but for T-SQL. People don’t talk about T-SQL stack traces very often. Probably because they don’t program like this in T-SQL. We can’t get a T-SQL stack trace from the SQLException (the error given to the client), so we have to get it from the server.

More Info Needed From the Server

So luckily, extended events do have T-SQL stack traces. I wanted to look at those stack traces whenever there was an error on the server. My session looked something like this:

Eventually It Failed Again

Woohoo!
First I looked for the error. The one where error_code = 3930. I did it using Management Studio to view the session’s data.
Then I looked at the error immediately before it. The one where is_intercepted = true. That looks something like this

The Original Error

Here’s what I found. The original error was a PK violation in a procedure using the Just Do It (JDI) pattern. It’s a technique which tackles upsert concurrency problems by inserting a row and catching and suppressing any PK errors that might be thrown.
So that’s where the TRY/CATCH block was.

Also, the stack trace told me that the procedure was called from a trigger. So my complicated procedure was actually more complicated than I thought. It actually looked like this

Something More Complicated

That’s why i didn’t see it before. I hate the way triggers can hide complexity. They’re sneaky.

The CATCH block catches a real error and the XACT_ABORT setting dooms the transaction. Because I didn’t do anything with the error, the transaction was allowed to continue. It was actually some query in s_ProcM that attempted to do work inside the doomed transaction.

December 22, 2016

Merry Christmas readers! It’s story time. This is about a problem I encountered a few weeks ago. We were looking at a production site using sp_whoisactive and we noticed a lot of blocking on one particular procedure. I’m going to explain how we tackled it.

In this case, I think it’s interesting that we were able to mitigate the problem without requiring sysadmin access.

The Symptoms

Using sp_whoisactive and other tools, we noticed several symptoms.

SQLException timeout errors were reported by the app when calling one procedure in particular.

Many sessions were executing that procedure concurrently. Or at least they were attempting to.

There was excessive blocking and the lead blocker was running the same procedure.

The lead blocker had been running the longest (about 29 seconds)

The blocking was caused by processes waiting on Sch-M locks for a table used by that query

Here’s what was going on:
SQL Server was struggling to compile the procedure in time and the application wouldn’t let it catch its breath. The query optimizer was attempting to create statistics automatically that it needed for optimizing the query, but after thirty seconds, the application got impatient and cancelled the query.

So the compilation of the procedure was cancelled and this caused two things to happen. First, the creation of the statistics was cancelled. Second, the next session in line was allowed to run. But the problem was that the next session had already spent 28 seconds blocked by the first session and only had two seconds to try to compile a query before getting cancelled itself.

The frequent calls to the procedure meant that nobody had time to compile this query. And we were stuck in an endless cycle of sessions that wanted to compile a procedure, but could never get enough time to do it.

Why was SQL Server taking so long to compile anyway?
After a bunch of digging, we found out that a SQL Server bug was biting us. This bug involved

SQL Server 2014

Trace flag 2389 and 2390

Filtered Indexes on very large base tables

Kind of a perfect storm of factors that exposed a SQL Server quirk that caused long compilation times, timeouts and pain.

What We Did About It

Well, in this case, I think that the traceflag 2389, 2390 kind of outlived its usefulness (the cure is worse than the disease and all that). So the main fix for this problem is to get rid of those traceflags. But it would be some time before we could get that rolled out.

So for the short term, we worked at getting that procedure compiled and into SQL Server’s cache.

We called the procedure ourselves in Management Studio. Our call waited about thirty seconds before it got its turn to run. Then it spent a little while to compile and run the procedure. Presto! The plan is in the cache now! And everything’s all better right? Nope. Not quite. The timeouts continued.

If you’ve read Erland Sommarskog’s Slow in the Application, Fast in SSMS you may have guessed what’s going on. When we executed the procedure in SSMS, it was using different settings. So the query plan we compiled couldn’t be reused by the application. Remember, all settings (including ARITHABORT) need to match before cached plans can be reused by different sessions. We turned ARITHABORT off in SSMS and called the procedure again.

After a minute, the query completed and all blocking immediately stopped. Whew! The patient was stable.

The whole experience was a pain. And an outage is an outage. Though the count of the snags for the year had increased …