A basic approach to SQL Server administration

Menu

One of my servers wasn’t rebooted when it should have been and I never realized it until after the outage was over. The big deal wasn’t that it didn’t reboot, it was that I never realized it. These are the things that bug me, so I used it as an excuse to learn a bit of PowerShell.

A little backstory to give the details a better chance of making sense to everyone… Instead of having a couple overworked sys admins doing windows updates for every server, we each take care of our own with a sys admin checking that all updates were applied. It’s the first I’ve heard of it being done this way, but I like it.

To keep things consistent, the database team has a policy of rebooting all database servers regardless of WSUS requiring it. One month most of the servers required a reboot and bounced automatically, except for one that didn’t require a reboot. After the first 10 or 15 rebooted, I got lazy and started just verifying the servers were up, SQL was responding, all updates were applied, and there was no pending reboot.

It wasn’t until I was reading the logs the next business day I realized that that server was missing all of the chatter from SQL Services starting up. Like I said, the biggest issue was me being frustrated that it snuck by me, and I didn’t want that to happen again. I wanted my solution to be independent of SQL Server because not all of the computers I do updates on have a SQL engine.

Wanting an excuse to learn PowerShell better, I talked to our Exchange guy (they know their PowerShell) and got a basic script to determine a computer’s boot time given its name. I expanded that to be an entire AD group, then a list of AD groups, and finally using wildcards to find AD groups. Wildcards work best for me because I’m reorganizing my SUS groups and all of mine have “SQL” in the groups name, even for the app servers I’m responsible for.

That resulted in the following code. Now I’m absolutely certain that a PowerShell or AD guru can poke a ton of holes in my logic, but that’s what we have gurus for! To be fair, I’d do the same to their T-SQL. If any of those gurus are out there reading this DBA’s blog, the comment section below is how you fix my code and stop me from spreading bad code like a virus.

There are relatively equivalent things you can do in SQL Server, but that’s assuming you’re running a SQL Server engine on every server you care about. Even from the viewpoint of a DBA running the SUS updates on my own servers, this doesn’t hold up. I still care about my monitoring application server and several others.

There’s also some more bad news with these beyond requiring the SQL Server engine.

The create date of TempDB is useful, but it’s created on startup of SQL Services, not the computer. This means if you restart SQL Services then you get a new date, even if you specifically state that you want to restart every computer during your SUS outage whether it’s required or not.

The WMIC call gets around that and tells you when the server came up, but it’s relying on xp_cmdshell. This extended procedure has a bad rep for being a security loophole, even though by default to run the thing you need enough permissions to be able to turn it on. Because of that, it’s turned off in many environments and flipping it on for a second can be frowned upon.

Advertisements

Rate this:

I hinted at file growths in my previous posts about shrinking data and log files. Then I talked about growing log files in my post Database Log VLFs. However, there are still some unanswered questions about growing data files.

Should I use autogrowth? What should I set it to grow by?

Why do manual growths, and how do I know how much to grow by or when a growth is needed?

Can I and should I use Instant File Initialization? If I can and should, how do I even know if it’s turned on?

Leave Autogrowth On?

Yes, absolutely. This isn’t a question.

This could be worded as “Do you want a slight pause or do you want your application to fail?” The answer is a little more obvious when you word it that way. The only time I’ve seen an appropriate answer of “No” is when you have multiple data files in a filegroup and only want one to grow.

To answer how long this pause will be you have to ask these two questions. Are you able to use Instant File Initialization? How much are you growing by if you can’t use Instant File Initialization? I’ll discuss Instant File Initialization later in this post, but it’s basically the difference between “this space is yours immediately” instead of writing zeros SQL Server doesn’t care about to be paranoid about security.

There are competing factors as to how much you want to autogrowth to be. The only real consensus is that the defaults are horrible. You need to balance out how long of a pause it will take for a growth, how long of a pause your users will tolerate, how fast is your database growing, how much fragmentation smaller growths will cause, and how much fragmentation you will tolerate.

By default, data files grow at 1 MB and log files grow at 10%. I covered log file growths rather well in my post Database Log VLFs, so we’ll focus on the data files here. Think about it, if you have a 100 GB database that got there by the defaults, it got there in 102,400 pieces. Not good.

If the defaults are horrible, what’s good? Tiny is bad, huge is bad, and percentage-based growth just lets you automatically drift from tiny to huge. Do your balancing act I talked about above. If growing by 64 MB will keep you good for several weeks, set it to 64 MB. If 1 GB goes by rather quickly, you can’t use Instant File Initialization, and the users can tolerate the pause it takes to grow by 1 GB, then this might be the best growth rate for you. No single answer is right, but at least you have a better idea of what the variables are.

There are also cases where there are no good answers. Say you have an app wouldn’t be uncommon to grow by a couple GB a day, you can’t use Instant File Initialization, and pauses of more than a second wouldn’t go over very well. You don’t want the fragmentation of setting the growth to 64 MB, but you don’t want the pause of setting it to 1 GB. Maybe you’ll settle for 256 MB, but you’re not happy about it. Manual growths are the real answer, but you want a safety net that won’t kill you with pauses or fragmentation. Not everyone is in that situation, but some of us are.

If I’m going to suggest leaving Autogrowth on and suggest that you be aware of the size of these growths, then the least I can do is give you a script for that. This will tell you the size of every database file and the autogrowth rate. I’ll often use it to tweak the growth rates, so I script that change out in the results as well. If you read the post Database Log VLFs then it will make more sense why I use 1,000 MB here instead of 1,024 MB; it wasn’t a typo.

Manual Growths

If you want to be extreme about it, autogrowth is like asking someone about wearing a seatbelt. Do you want it to hurt a little or a lot if you “crash” by running out of allocated space? Well, you want it to hurt…wait…why do I have to crash?

This may seem over-the-top, but some of the applications may have databases like the last one I talked about in the autogrowth section. Typically we’d prefer not to have autogrowth handling everything to avoid the pause while it grows, but there are cases like that where it’s more than a slight preference to avoid the pause.

So the autogrowths are our seatbelts we’re putting on just in case while catching everything with manual growths is our real goal of avoiding that “crash”. However, how are you supposed to know when to grow a file or how much to grow it? Seeing how large your files are now doesn’t tell you how much they’ll grow in the next couple months, they could be static or growing exponentially.

The partial answer is to say how much data was used before, how much data is being used now, and predict how much data will be used in the future using that trend. A better answer is to say what the peak usage was over a week and compare that to past weeks, which would account for things that cause spikes in file usage such as index rebuilds or data loads.

If you have monitoring software, it may have that answer just waiting for you to query it. That’s the situation I find myself in right now, and I wrote Database Growth Trends – Idera Diagnostic Manager talking about how I handle it right now. However, I used to work with monitoring software that pretty much only told me OS-level info, so we watched this data manually as was discussed in my post Monitoring Database and Table Sizes.

To be clear, many databases are fine with autogrowth handling everything. Do what’s best for you for each individual database. I personally have a mix of databases where some I’d be upset if autogrowth kicked in because it would cause timeouts that the app handles extremely poorly, and others, such as my DBA database, where a pause would only affect my collection jobs or a query I’m personally running. Autogrowth is not inherently evil, so only be paranoid where paranoia is called for.

Performing Manual Growths

Manually growing the file is just setting a new initial size. To do that in SSMS, right-click on a database, properties, files, then change the initial size to how large you want it to be.

I’m not a big fan of having SSMS do things for me, so I’ll use “Script Action to New Query Window” and run it form there. This will give me a script like this:

One thing I wish I knew about years before I did is that the size in KB also accepts sizes in MB. I write just about everything in MB, so it’s more natural for me to write that script with Size = 250MB instead. I know I’m being petty, but I’m petty and comfortable.

This script could take a couple milliseconds or several seconds, even for 250MB. It all depends on if you have Instant File Initialization turned on and if you’re able to use it for this growth.

Instant File Initialization – What is it?

Short answer, this makes growths exponentially quicker by opening an often-acceptable small security hole.

Of course, I always have a long answer…

By default, all space used by SQL Server is zeroed out before control is handed over. This makes sure that it’s clean space for two basic reasons. First, the database requires that space to be clean because of how it plans to use it. Second, we don’t want to risk the possibility of special commands being able to pull back unencrypted sensitive data that used to be there.

Now the first reason we don’t have much control or say over at this point in the conversation. If you’re growing a log file then it needs to be zeroed out, and SQL Server will zero out each VLF when it’s marked for reuse, too. For data files, if you’re using Transparent Data Encryption (TDE) then the way SQL Server stores the data also requires the space to be zeroed out. These things are outside of the conversation on file growths, it’s just what we’ll have to deal with at this point and SQL Server won’t use IFI even if it’s turned on.

The one thing that will be pulled into this discussion is the security aspect of using uninitialized space. Kimberly Tripp demonstrates this the best in her post Instant Initialization – Why, Why, and How? in the “Why isn’t this on by default?” section.

I’ll try to make it short and sweet if you don’t want to read yet another post (I recommend you read it). You can ask SQL Server to read specific pages off the disk. If SQL Server never actually used that page then it contains what was there when it was given to SQL Server. Zero initialized disks are very boring info, just “0000000000….”. Not much fun.

Uninitialized space is more fun (depending on your meaning of the word) and could contain old data. If you had unencrypted sensitive data sitting there before, it could contain enough to steal someone’s identity. You never know. These commands will only work if you have the permissions to run them, but who has those permissions and how much do you trust them? Most organizations say it’s fine, but some won’t.

Instant File Initialization – How do I use it?

This is an odd feature that you can’t just see in a table in SQL Server. The easiest way is to run an xp_cmdshell command and look for it, but you can optionally turn on trace flags, grow or create a database, turn off the trace flags, then look in your error log.

As for me, I’ll run the query below as a multiserver query in SSMS and look at the results. If that sounds foreign to you, check out my Multiserver Queries post, it’s a very simple and useful technique for specific tasks such as this one. In fact, I wrote that post so I could link to it in this post.

However, even if IFI is turned on, it’s only going to work for data files that aren’t encrypted with TDE. Logs and TDE data files need the zeros. Also, if any database on your instance is using TDE then so is TempDB, which means its data files can’t use IFI either.

If IFI is turned off and it’s proper to turn it on for this server, here are your steps to turn it on:

Open Administrative Tools / Local Security Policy

Local Policies

User Rights Assignment

Open “Perform volume maintenance tasks”

Add the user for the SQL Server service account

If IFI can’t be used, here are your steps (which are also good even if IFI is being used):

Monitor database sizes for manual growths

Grow files manually

The SQL Server service account, if you don’t know which account it is, can be found in Administrative Tools / Services / SQL Server (InstanceName) under “Log On As”. There are easier ways find the service account name in bulk if you’re doing this a lot, such as a slight variation of the query above.

xp_cmdshell Rant

Now, to be clear on xp_cmdshell which I used in the script above…. I do believe in security, but I don’t believe disabling this feature is the security you’re looking for. When it’s set up properly, disabling xp_cmdshell is like locking the door to your data center’s supply closet with a keyless lock instead of limiting access to the data center.

I do have issue with granting execute permissions on this to non-sysadmin accounts. I also have issue with granting sysadmin to anyone who asks, including vendors who insist that’s the only way their app will work. This means the only users that can use it are the very, very few with sysadmin. Also, any sysadmin can turn it on anyways, so the only extra security you get is a message in the error log saying it was turned on, but not even who did it.

My recommendation is that if you need it, leave it on. If you don’t, leave it off. If you have a one-time script that needs it, such as checking for IFI, turn it on just long enough to do what you need to do then automate turning it off. The script follows this recommendation; it’s left on if it was already on, but turns it off if it was off.

It’s time for me to admit I’m going off on a tangent, so I’ll pass you off to Sean McCown’s (b|t) Security Theater post if you’re too fired up to stop now.

In Summary

Autogrowth – Yes. Just as a fail-safe if performance is important to you. Limit the size to limit the performance impact, but don’t cause undue fragmentation.

Manual Growth – Yes. Trend how fast you’re growing so you know how to stay ahead of it.

Instant File Initialization – Usually. Understand the limitations and security risks, and turn it on if it’s the right thing to do.

Entry-Level Content Challenge

This is my fourth post in Tim Ford’s Entry Level Content Challenge. I went through much of my career with people saying to manually grow files, but it was left at that. I had no clue how to see how much space was used without clicking through the GUI, and passed it off as being ridiculous. Now I found ways to make it easy, and I’m betting there’s a lot of people out there doing too much work or skipping this simply for not knowing how.

Read over Tim’s challenge and consider joining in. It’s a great way to help people get past barriers you had in the past while learning how to clear them a little more efficiently yourself. Anyone who’s reading this post already knows how to clear some barriers that others are viewing as an impassible brick wall, so you can make a difference for them.

Rate this:

How fast are your databases growing? Should I panic about a 1 TB database that has 100 GB free on disk? Am I safe with a 100 GB database that also has 100 GB free? Based on those sizes…I have no idea.

It’s possible that the 1 TB database is purging data as fast as it takes it in, so it’s been at 1 TB for a long time and may never use the 100 GB it has free. Looking at the 100 GB database, it may have been a new 1 MB database created two months ago and grew 1 MB at a time to get to where it’s at now. (Bonus points if you panicked reading that last sentence.)

The thing to look for is how big that database was before. What did you write down for how much space it used last week, last month, and several months ago? Do you remember all those stock broker commercials saying “Past performance doesn’t predict future results” which is supposed to lead you on to think “Yeah, but it’s the best indicator I have.” The same goes for database growths. So, you did write down what it was before, right?

My Default Monitoring Message

I’ll copy/paste this section on multiple posts, so feel free to skip it if you’ve seen it before.

The biggest issue is the word “before”. If you weren’t watching your servers before then you have nothing to compare the current state of your server to. Also, if you’re watching it consistently then you’ll be able to make an apples to apples comparison.

Your production servers should have some kind of monitoring software in place, and the monitoring software is already capturing this type of information to help make all the pretty charts and graphs you see through the GUI. You need this software to run so you can get alerted when things go wrong, and pulling this data will just make it more valuable.

Monitoring software doesn’t perform magic, so you can pull this data on your own if you don’t have any. However, that’s taking on a lot of responsibility, time, and effort. If something is collecting the data already, take advantage of that.

For me, I use Idera Diagnostic Manager, so my queries are going to look at the data collected using this tool. Other monitoring tools are going to collect the same basic data in a database you can read.

If anyone from one of these companies wants to provide me with scripts that fill the same need, I’ll make a write up on them and give you credit for being awesome supporting your product like that. I can only write against what I can test against, it’s nothing personal.

What I’m Watching

I want to know when my files will hit an autogrowth and when they’ll run out of space.

Running out of space is pretty obvious. If you’re out of space, every command that needs more space will fail, and users will run to the supply closet for torches and pitchforks. It’s a bad day for the DBA.

Autogrowth is less of an issue and less obvious. That’s not saying it’s not an issues, especially for larger databases that can’t take advantage of IFI. The command that needs more space, and any that pile up behind it, will wait patiently (barring any timeouts) for the growth to take place. However, SQL Server can be more patient than your users or applications.

That being said, if I knew I’d need more space than is allocated in the database using a simple script then I could do it when no one was waiting for the space. No complaints, just make the files larger and move on with my life. Oh, happy days.

To get all of this information I need to know a couple things:

How much space is used in each filegroup

How much space is allocated to each filegroup

How much can each filegroup grow according to disk free space and growth rates

Idera DM Captures That

Well, Idera does a pretty good job. It captures the data size (used), data expansion (growth by growth rate and disk free space), and data file size (allocated). The problem is that it captures this stuff on the database level, not the filegroup level. This means that it’s possible for a filegroup to run out of space while the database as a whole has tons of free space.

I’m currently working in an environment where all data files for a single database are on the same drive, so it’s less concerning to me. It is something to keep in mind, and may be very important to your environment. If this is critical to you, I touch on how to capture this data yourself below.

If separating filegroups isn’t an issue for you (it often isn’t) then all you have to do to trend growths with Idera is to write a very long script that hits SQLdmRepository..DatabaseSize a couple times. My script goes a little like this:

What this does is look at the trends based on what the database used 30, 90, 180, and 360 days ago to try to predict the future. By “predict the future” I mean it’s not perfect, and can’t be perfect. However, it gives you very good data with a few false positives and very few false negatives.

Speaking of this not being perfect, it double-dips. If there are two databases growing on the same drive with 10 GB free, it says that each one has 10 GB of growth available. I was not able to rewrite this to take that into account while also accounting for databases with multiple files in a single filegroup spread across multiple drives.

Now I’m two weeks ahead of projected file growths and over three months ahead of projected out-of-space errors.

This is very basic information, and I’d have trouble imagining any SQL monitoring software not capturing this in a table you can query.

If I Didn’t Have Idera…

If I didn’t have Idera, or any other monitoring software I could query, I’d be going out at least once a day and pulling the allocated and used space from every file on every server. I may want to do it more often to find peak usage, such as when online index rebuilds are using twice their typical space over the weekends; I take advantage of Idera capturing this hourly. I’ll capture it using something like this:

Then I’d need to see how much free space I have on each drive, so I’d go to Ed Wagner’s article Time and Space: How to Monitor Drive Space in SQL Server. Knowing that we use MountPoints in places, I’d have to change his script to use “volume” instead of “logicaldisk” (also mentioned in his comments). This minor change is about the best case scenario you’ll run across.

Remember, the scripts you find on the internet are typically designed to work perfect on the AUTHOR’s system, not yours. It’s not Ed’s fault they don’t use MountPoints, in fact he should probably be congratulated for not making his environment more complex than it needs to be. He should also be congratulated on not making a change to his article that he can’t properly test. At least you know you have to be extra careful testing this change instead of being led to believe the author did something they couldn’t do properly.

After getting the drive’s free space, I’d have to look at the file growth rates to see how large my files could grow. Keep in mind that growth in sys.database_files could be either a percentage or the number of 8kb pages the file will grow by, depending on the value in is_percent_growth. Static growths are easy, just divide by 128 and that’s how many MBs it will grow by. It’s easier (and better for your server) to change percentage growths to MB than it is to look up the logarithmic functions used to calculate compound interest that also apply here.

Now that you have the free space on the drive and the growth rates on the files, you can calculate the maximum size the file can grow to. Use this with trends on the used space to calculate when that size won’t be enough for you.

Make sure you test your results, you’ll have to support it.

So Your Database Is Growing…

Databases get larger, it’s what they do. A normal DBA then goes through and makes sure that there’s room for that growth. A better DBA will do that, too, but then follow it up by figuring out why the database is growing and try to justify it.

I have seen way too many databases out there that don’t purge old data that’s useless to the company. Sometimes the purge was never written, sometimes the code to do that is never being called for several somewhat humorous reasons. Any way you look at it, the data is growing and it’s easy to fall into the trap of just adding space every month, forever.

The way I handle this is by watching my tables and seeing how they’re growing. These are treated just like databases where the largest one isn’t necessarily the growth issue, and looking at historical data to see the growth trend is a better solution.

In the past I wrote a post called Monitoring Database and Table Sizes to get this info yourself. In the future, I’ll write Table Growth Trends – Idera Diagnostic Manager. At the time of this writing, I’m planning on the Table Growth post coming out three weeks after this one. Until then, I put this script up on my Idera Diagnostic Manager Scripts page, which is where I keep these posts organized and where I put my scripts to wait for me to blog about them.

Rate this:

You can run queries against multiple servers at once, and it’s quite useful for a number of reasons. I use it to check settings, verify backups and DBCC ran recently on unmonitored servers, make sure all servers rebooted during the maintenance window, and many other reasons.

This is all done through registering servers on SQL Server and opening a new query for the group. I’ll walk you through that then run a couple queries that I find useful.

Unfortunately, this method is so easy that you’ll only ever have to see it once and won’t be back to visit this blog post again. Hopefully the scripts below will inspire a return visit.

Setting It Up

Registered servers are local to your SSMS install. You’re not changing server settings by setting this up, and this will be lost if you reinstall SSMS without exporting your settings.

First, open the Registered Servers pane by going to View / Registered Servers.

Add a new server group and call it anything you want. I’ll call mine blog because that’s what I’m using it for. I love the idea of having Prod and Non-Prod groups, especially since I can run a query against all my Non-Prod servers that aren’t monitored to verify backups and DBCC checks are being done.

It’s important to note at this point that you can have a server in more than one group and groups can be nested. So in my prod group I have groups for each data center, then the servers are at that level. I could also have the same servers in functional groups, such as Finance, IT, and Why_Me.

Right-click and do a New Server Registration, and the options should be pretty natural to you. For this example, I used aliases for “Registered Server Name”, but I stick to the default when doing this for myself.

At the most basic level, it should look like this.

That’s it, you’re set up.

Running Queries

This is easier than the setup.

Right-click on a group and click on New Query.

It opens a new query window with the only oddity being instead of a yellowish bar saying “Connected. (1/1)”, now you have a pink bar saying “Connected. (2/2)” along with the group name.

This will be connected to all servers directly in the group and in groups nested within that group. There will be a long delay if one of the servers isn’t available, so it’s worth while to keep these groups cleaned up.

Now you run a query. Here’s how the results are grouped by default. Although they showed up in order for me, that was a coincidence. They’ll actually be in the order they came back, so a server that returned results in 1 second will come before a server that returned results in 2 seconds.

You can go to Tools / Options and change a couple things, but the defaults tend to work great.

Now that it’s set up, right-click on your group and go to Tasks / Export… to save this off somewhere. The only place this is saved right now is on your workstation, and those words will make any DBA cringe.

Practical Uses

Doing “SELECT 1” like I did above is always a blast, but it’s not all that productive. As always, I encourage you to play around with it yourself, but here’s some things to get you started.

Server (services) last started

When I do Windows Updates I reboot all the database servers every month, even if the updates don’t say the needed the reboot at that time. However, can be difficult to tell which servers I rebooted and which ones just aren’t begging for a reboot. Since TempDB is recreated each time services start, I check to make sure that happened during the WSUS window.

SELECT create_date FROM sys.databases WHERE database_id = 2

If I was doing this for hundreds of servers, I’d simply change the query to say AND create_date < GetDate()-1 then only my problem servers would return results.

DBCC and Backups up-to-date

My non-prod servers aren’t monitored, so things go missed sometimes. Backup jobs may be failing and I don’t know until I have time to check everything on all my non-prod servers. Unfortunately, that doesn’t happen very often. As with everything else in life, I cheat as much as possible to make it easier. This time by running this as a multiserver query:

If I get any results from this, I have something to fix. This also works for daily full backups because it does diff or full, but you may want to change it around to fit your maintenance schedules.

Version Check

If you’re evaluating what version all of your servers are on so you can make plans to patch them, just run this:

SELECT @@VERSION

You have everything you need to compare it to SQLServerBuilds.BlogSpot.com. The only thing is that I tend to copy/paste the results into Excel so I can sort them there.

More Uses

I use registered servers and multiserver queries for more than this. In fact, I wrote this post so I could link to it in a future post on Instant File Initialization. Once I have that one here, I’ll have four examples of uses.

Other readers and I would benefit from hearing more uses of this feature. Let us hear the best uses you have, preferably with a script.

Entry-Level Content Challenge

This is my third post in Tim Ford’s Entry Level Content Challenge. I was tempted to write a post and just tell people to run a script as a multiserver query. After all, it’s very simple after you see it for the first time. However, you have to see it for the first time before you get to that point.

This is the exact purpose of his challenge. Read over it and consider joining in.

Rate this:

Recompiles can be a hidden bottleneck on your server, and it may not be too obvious. I should know, it happened to me. Unfortunately for me, that was before I was a proponent of tracing or using XEvents to watch for recompiles. Fortunately for you, the experience changed me, and I enjoy sharing my experiences.

What happened to me was a proc that was called WAY too often was taking too long cumulatively. I knew the true root cause of my issue was in the application which was calling the proc once per row, often for about hundred rows, when loading a pretty important page on a web application. However, I just found out the service desk has been getting complaints for a while, and asking the development team to help me rewrite how the page works ended in an answer that didn’t even sound like now. Well, I guess “no” does kinda sound like “now”…

The proc was pretty simple, taking on average about 0.150 seconds to run. Sounds great until you remember that it’s being called about a hundred times for a page load on a web application. 100 * 0.15 = 15 seconds for the page to load, not counting anything else running. Users were threatening to not be users anymore even though they were happy with most other aspects.

I got into it, reworded the query a little, created a perfect index for it, and it got it down to 0.100 seconds to run. This was a large concession on my part because I HATE perfect indexes for a single query, and feel strongly that indexes should focus more on being reusable by many queries. However, even that only got the web page to load in 10 seconds.

Traces of the proc confused me. The statements were running in 0.010 seconds, and there was a 0.090 second gap between runs. That gap was 9 of my 10 seconds. At 10 seconds users avoid a website, but at 1 second users have to be prompted to say it’s not as fast as they’d like.

It took me longer than I’d like to admit to figure out that every run was kicking off a recompile that took 0.090 seconds. This is the point where I started tracing for the recompile reason, and found out it was because statistics were being updated. However, there were no stats in my database that were updated in the last 15 minutes, so I would have never guessed this was the reason without tracing on this event.

This proc loaded data into a temp table, then did some processing on that temp table, and I figured out that there were enough rows being loaded into the temp table to update the stats on it. Removing the temp table was a disaster, it raised my execution time by 0.010 seconds. Yeah, “disaster” can be relative…

Then I came up with a very odd solution that I’ve never seen or heard of before that point, and I haven’t used much since that point. The temp table was swapped out for a table variable. No stats to be updated, so no recompile. Although it assumed there would only be one record in the table variable (it always does), the execution plan was the same. The only difference was the total 0.100 second duration was dropped to 0.010 after losing the 0.090 recompile.

User experience was now a web page that loaded up in just over 1 second, with most of it still being this query. There wasn’t anything more I could do except wait until the page could be rewritten by development where I could rewrite it to get all the results at once. That process took time, and customers were not impatiently watching to see when it would happen.

I wrote this talking about the users, and users care about duration. However, that’s not the whole story. The 0.090 recompile was pure CPU time, it happening 100 times per page call added up to 9 seconds of pure CPU time, and the users having to use this page hundreds of times a day added up to a noticeable impact on the server performance when it went away. Now I admit that the users only used this when they HAD to instead of when they should have, so it was balanced out a little by the users opening the page about twice as often. Even then, it was still obvious that it was gone when comparing long-term CPU usage.

The Management Response

Side note for all of you managers out there… My manager pulled me aside the next week saying “step into my office for a second.”. I admit, that statement always makes you nervous, even when you KNOW everything is going great. He proceeded to tell me that before I started with the company they pulled developers and a DBA into a task force that spent tons of money to improve performance, upgrading servers, upgrading the SAN, upping the WAN speed, and looking into the database. Then he said that I did more in one day than they did in 6 months without spending any money, and that I should take my wife out for a nice dinner and bring him the receipt.

Now I know that the hardware upgrades helped out, that I only helped a single process that day, and many other things that could belittle what happened. I also know that spending $100 was very minor in terms of the effect that change made for the company. That’s not how I saw things.

What I saw was a response that stuck with me, a lot. Management stepped back to recognize a difference someone made, and made sure they knew they were appreciated. Then took it a step further bringing the family into it, having my wife feeling proud and appreciated at the same time.

I know the constraints management is under and all the things they can’t do. That being said, the only thing that frustrates me about this situation is that I don’t see others in management taking the same approach. I swore from that point that if I ever became a manager my one odd request would be that I would want a very small budget in writing to use at my digression. Small things to show appreciation go a long way.

My Challenge

The challenging part for me, and where I felt I didn’t do as well as I should have done, is that this flew under my radar. I was tracing for anything that took over 1 second duration and was looking at the most expensive queries all the time. However, this one took 0.150 seconds and never accumulated time in the plan cache due to the recompiles.

Here’s the thing, I pride myself on being specialized in database performance, so I feel I should have seen this before having someone in customer service casually mention it to me in conversation. It wasn’t even a request to look into it, just mentioning he just got off a call with the customer about the page, and that the page never worked right with no one being able to fix it.

After finding and resolving the root cause of the proc, I also wanted to find and resolve how this flew under my radar. This came up to two primary responses. First, the page was taking too long as a whole, so I requested stats on a per-page basis – how long it took to load each time and how long cumulatively. Second, I started watching for recompiles from time to time. Either one of these would have led me to this issue well before that point.

This post is not about monitoring performance on a website, so, although it’s a very important thing to do, I won’t be mentioning that part again.

Monitoring Recompiles

This post is about recompiles, how they can drag you down, and how you can find out about them. I covered how they drug me down, and in past posts I casually mentioned you should use server-side traces or XEvents to monitor them. Then I left it up to you to figure out how. Lets try a different approach today.

Recompiles aren’t something I’m watching for constantly, so I’m not going to run any kind of monitoring constantly for them. What I’ll do is run an XEvent session for 24 hours and analyze what it picked up. If it didn’t recompile often that day it almost definitely does not belong on my priority list, so this works for me.

Here’s my session. It starts immediately, writes to five 10MB rollover files, and will not start the next time your services restart.

Querying the Recompiles XEvent Session

Then I’ll use this query to look at my session. It’s a little awkward in two ways.

First, I only leave the session running for a day, but this script needs it to be running to read from it. The only reason for that is it pulls the file path from the running sessions. You can get around this by putting the path into the script manually. As for me, I usually query it and figure out my next steps while it’s running, then I stop the session.

Second, and this is quite humorous to me, it recompiles due to my temp tables every time it runs. It’s a great example of what to ignore because it’s a low-cost recompile that happens on a seldom-used ad-hoc query where performance is not a major priority. However, because temp table definitions are cached for proc plans, this goes away if you turn this script into a proc. I laugh every time…

When you get the results in the #Queries temp table in the end they’re what you’re used to with it just being a table and all the XML out of the way. The query at the end of the script views them all, but I’ll play around with it doing aggregations and more. I encourage you to do the same. This is also why it drops the table if it exists in the beginning instead of dropping them at the end.

Rate this:

What servers need the most attention when it comes to I/O? When I increase the memory on a server what effect does it have on I/O? What was it before the change and what was it afterwards?

These are tough questions, and I have a script that I hope will help you out.

Most monitoring tools, including Idera, are great at looking at what a single server is doing right now, but not so much with comparing performance and loads between multiple larger windows of time and seeing which servers are doing the most work. So I wrote a script to see how much I/O load I’m doing on each server and compare that to how much I did in that same time window in previous weeks.

Your servers will vary, a lot. They’ll vary from my servers, and they’ll vary from each other as well. Depending on the workload, specifically the difference between OLTP and OLAP, higher I/O may be expected and additional memory will have less of an impact. For OLAP, if you have 32 GB of memory and read a 64 GB table followed by a different 92 GB table then it’s all physical reads. If you up your memory to 48 GB, they’re still all physical reads. The cost to up this server (or VM) is the same as the cost to upgrade an OLTP server from 32 GB to 48 GB, but the story will end quite a bit differently there.

The problem is that this change, like any other change, should be measured before, measured afterwards, then have the costs of the change analyzed and justified. The results you’ll see below coupled with my Wait Stats Baseline Comparison using Idera post will help you figure out where to make a change like that and how to measure it.

There are other reasons you’d want to look at this data as well. Even a new index that dramatically dropped the physical reads for one query could have the opposite effect on your server as a whole. The point is, even something very straightforward needs to be verified that it’s doing what you expect it to. Looking at the details you see in the canned reports is great, but so is the ability to stand back and look at the big picture that I’ll get into here.

The solutions I’ve put into place using this type of query to varied from query tuning, timing of larger processes, adding memory like I mentioned above, buying faster disks, and many other options including acknowledging we were doing a ton of reads and leaving it be.

My Default Monitoring Message

I’ll copy/paste this section on multiple posts, so feel free to skip it:

The biggest issue is the word “before”. If you weren’t watching your servers before then you have nothing to compare the current state of your server to. Also, if you’re watching it consistently then you’ll be able to make an apples to apples comparison.

Your production servers should have some kind of monitoring software in place, and the monitoring software is capturing this type of information to help make all the pretty charts and graphs you see through the GUI. You need this software to run so you can get alerted when things go wrong, but pulling this data will just make in more valuable.

Monitoring software doesn’t perform magic, so you can pull this data on your own if you don’t have any. However, that’s taking on a lot of responsibility, time, and effort. If something is collecting the data already, take advantage of that.

For me, I use Idera Diagnostic Manager, so my queries are going to look at the data collected using this tool. Other monitoring tools are going to collect the same basic data in a database you can read.

If anyone from one of these companies wants to provide me with scripts that fill the same need, I’ll tack them on here and give you credit for writing them. I can only write against what I can test against, it’s nothing personal.

I/O Differences Script

What I look for are the true apples-to-apples comparisons, so I’m typically looking at a longer time range (I like 24 hours) during peak hours (12:00 to 21:00 UTC on weekdays for me, yours will differ, even server-to-server) comparing this week to the last couple weeks.

The Results

This is, by default, going to show you your biggest I/O users that are being monitored. It’s great to know, but it’s not the whole picture. Some servers are going to be almost pure OLTP, others are going to be almost pure OLAP, and most will be some kind of mix. The more OLAP you do, the more physical reads (Reads_GB and ReadAhead_GB) should be expected. It’s really not a bad thing, it’s how your databases are used.

Then there’s the realistic version of the users’ expectation (not always the version they tell you) to consider. Are the queries coming back as fast as they should? Large annual reports can take a bit, but a website that takes more than a couple seconds isn’t going to be used.

This report shows the load you’re throwing at the disks, but user expectations are measured in seconds, not gigabytes. Yeah, more gigabytes means more seconds, but it’s not like there’s a solid ratio.

That being said, the true value of this information is when you combine it with the type of server it is, the reasonable expectations of the users, the amount of time the server is waiting on I/O, and so much more. To start you out, look at the Wait Stats Baseline Comparison using Idera, too.

If there are OLTP servers that stand out as doing a ton of physical reads and the wait stats show that it’s an issue as well, look into more memory and more time for tuning. Take it a step further and use all of the trending info (not just this report) to split out I/O load and plan for faster disks. There’s a lot of potential to use this info to do a lot of good.

My Idera Scripts

I keep all the scripts that query Idera Diagnostic Manager in one place, on my Idera Diagnostic Manager Scripts page. Before this script had it’s own post, it was posted there. As I make posts for each script, that page will live on as a directory of what scripts I have to get more use out of this product.

This is the first time I’m doing something like this, but I’ll eventually make other pages that pull together topics like this as well. It feels good to find a great way to take blogging to the next level like this.

Speaking of taking it to the next level, let me know your thoughts on this script, others you’d like to see, and anything else in the comments below.

Rate this:

Database servers have to wait on different resources, and these waits are huge to the performance of SQL Server. Sometimes something changes without our knowledge and is running differently than normal, this is the first place you look to see what’s going on. On the other side, you may have made a change that could have been anything from increasing memory, tuning a query, moving to different hardware, changing indexes, etc.. All of these things should reduce waits, but how effective were they?

The trick is to know what things looked like before so you have something to compare it to. If you know what it looked like before then you’ll know how much of a difference you made.

My Default Monitoring Message

I’ll copy/paste this section on multiple posts, so feel free to skip it if you’ve seen it before.

The biggest issue is the word “before”. If you weren’t watching your servers before then you have nothing to compare the current state of your server to. Also, if you’re watching it consistently then you’ll be able to make an apples to apples comparison.

Your production servers should have some kind of monitoring software in place, and the monitoring software is already capturing this type of information to help make all the pretty charts and graphs you see through the GUI. You need this software to run so you can get alerted when things go wrong, and pulling this data will just make it more valuable.

Monitoring software doesn’t perform magic, so you can pull this data on your own if you don’t have any. However, that’s taking on a lot of responsibility, time, and effort. If something is collecting the data already, take advantage of that.

For me, I use Idera Diagnostic Manager, so my queries are going to look at the data collected using this tool. Other monitoring tools are going to collect the same basic data in a database you can read.

If anyone from one of these companies wants to provide me with scripts that fill the same need, I’ll make a write up on them and give you credit for being awesome supporting your product like that. I can only write against what I can test against, it’s nothing personal.

The Script

There are 3 parameters at the start, and that’s typically all I ever change. A couple things stick out with these parameters. The times are in UTC, so I default to GetUTCDate(). I want the time range to be at least an hour so I’m not letting a single query that ran skew the results too much. Then I’m also converting my instance name to upper case because Idera DM may very well be your only database that’s case sensitive, so I cheat by putting everything in upper case.

My upper case trick isn’t a recommended trick. It causes scans and possible bad joins. In this case I know it’s against small tables and that I don’t have two servers with the same name using different cases. It’s safe here, but not everywhere.

The Results

The results pretty much speak for themselves. What were the top 10 things you waited on recently and how much did you wait on them in the past. It’s great to have a view of it where you’re comparing totals side-by-side. Seeing the same thing in charts just doesn’t have the same effect for me.

These numbers are going to have natural variances to them. Did a user run a couple large queries during one interval but not another? Was the shared storage under different loads? Really anything can come up. The longer the period the more you smooth out these numbers.

So, the real question is what do these waits mean? I’m not going to pretend like I can cover all the wait stats, their meanings, and how to affect them in this post. What I can say is that there are good resources out there for you.

CXPACKET

One wait type I like to talk about, and discredit a bit, is CXPACKET. This wait type is just saying you used parallelism, nothing more. I first covered this topic in my post Why worry about CXPACKET.

Think of it in terms of a manager who delegates a large project to multiple employees. As the employees are hard at work accomplishing the task, the manager goes about their day and records their time as CXPACKET. The task is typically completed in less overall duration because it was a group effort, but the total amount of work done goes up some because there’s overhead and small duplicated tasks when you delegate to a group.

That less overall duration could make or break your application, so the idea of having CXPACKET isn’t necessarily bad (setting MAXDOP = 1 is rarely a good idea). However, a task being so large that it needed to be delegated to multiple employees may mean it’s not efficient, making query and index tuning a good approach. Also, your manager may be too eager to delegate their tasks, so possibly adjust your Cost Threshold for Parallelism as suggested in Jeremiah Peschka’s (b|t) post Five SQL Server Settings to Change.

As I said before, I can’t cover all the wait types here, so go to Paul Randal’s post I suggested above for a better list.

What I Skipped

This is looking specifically at WaitTimeInMilliseconds, which is one of two technically correct things to do. It does not look at ResourceWaitTimeInMilliseconds, which is the other technically correct thing to do. I wrote it one way and was happy with the results so I didn’t change it or clutter it with returning both.

Here’s the difference so you can decide if you want to change it or talk me into changing mine. It’s a conversation I’d enjoy…I’m strange like that.

ResourceWaitTimeInMilliseconds is how long you’re waiting on your actual resource. If you’re waiting for a lock, this is how long you waited for that lock to be made so you could get back in line for CPU time.

WaitTimeInMilliseconds also includes the time from when you get in line for CPU time until you’re using the CPU. This additional time is called your signal wait time.

The reason I like using WaitTimeInMilliseconds is that the reason you gave up the CPU and had to get back in line is due to the fact you weren’t ready due to what you were waiting on. In that way, it’s the whole picture.

The argument going the other way is that you were only waiting for the resource. It’s not the resource’s fault there was a line all queued up at the CPU when it got there. Why should I tack on extra time on a non-CPU wait for that?

I’m undecided on this part, but having the info both ways allows you to figure out the signal wait time on your entire instance and see if you have too much CPU contention. There’s no arguing about that being a good thing to know.

Steal my query, make it your own, and write that part the way that works best for you.

When Idera Finds Out What I’m Doing….

Someone asked me for queries pulling from Diagnostic Manager, and I posted on Twitter when I made my Idera Diagnostic Manager Scripts page. The outcry and rage from this was…very positive.

Vicky Harp (t), who is now Idera’s Corporate Strategist wrote pieces of Diagnostic Manager and its schema, took time out of her day to meet with me and talk about the details of querying the data they collect.

Scott Stone (t) who is Diagnostic Manager’s Product Manager also jumped in very quickly to see if I’d like to be a beta tester for them, bringing real power users into the loop to make their products better.

This is the level of support you want from your monitoring software.

While I don’t officially endorse or recommend any specific software at this time, I do recommend you look for responses like this.

Rate this:

Post navigation

Consulting

I am now working for Plus Consulting on a team that provides database performance, business intelligence, and database administration consulting services. In the past there has been requests for me to help during business hours that I had to decline, and I'm glad to say that's no longer the case.

Category Pages

I wrote a lot about these categories. Since I had trouble finding my own posts, I organized them and wrote a quick sentence or two about each one for my personal use. Then I decided I should share these pages, too.

Two pieces of advice:

Give thanks

If you benefit from something, say something. Also, if you would speak up if something wasn't done right, you should do the same if it is done right. While saying something is easy to do, for some reason it's more difficult to follow through with than it should be.

I'm personally working on this. Specifically, if I'd visit someone's cube if they didn't resolve a ticket as quick as I expected then I'll visit their cube to say thank you if it's done quicker or better than expected.

Give back

My career benefited greatly from the work others made public, so I thought I should do the same...it was a selfless concept. Then I realized how much I learned by putting my knowledge together in posts and presentations. I have yet to write a post, prep a presentation, deliver a presentation where I didn't learn more, even on topics I thought I knew well enough to just do everything off the top of my head.