Proactive Database Administration

Server Management

These three issues are all related, yet very different. When we deal with reports of problems, it’s essential to clearly identify what is actually happening. Sometimes, we need clarification by asking about the symptoms, instead of accepting what the user says without delving any deeper. The root cause of a problem may not even be what the user is experiencing directly, or what they claim they’re experiencing.

In this post, we’ll cover the similarities and differences of timeouts, blocking, and deadlocks, how they manifest themselves to an end user, and how they may cause, or lead to, each other.

Blocking

Blocking (also known as live locking) occurs when a process attempts to acquire a lock on a resource, but has to wait in the locking queue before the lock is granted.

From the outside, it appears that the process is doing nothing, when in fact it’s waiting for the other process(es) to release the lock(s) ahead of it in the queue.

If a process is blocked for too long, this can lead to timeouts. If locks are acquired in a specific order, blocking can lead to deadlocks.

The thing to remember is that blocking occurs all the time to synchronize access to resources. It only becomes a problem when the amount of time a process spends waiting becomes excessive, either because it’s causing timeouts, or because it’s causing a user to wait for more time than they expect is normal. Blocking may also be experienced as a “frozen” application. Users may not complain about blocking until their work is impeded enough to frustrate them into taking action to get the problem fixed.

Timeouts

Timeouts occur when a client application makes a request for a resource, and while waiting for a response to be returned, decides things have taken too long, and stops waiting. This usually results in an error being raised by the client API.

Timeouts occur for a variety of reasons, including blocking, the request needing to do a ton of work on the server, or maybe the network is saturated or simply really slow. There are millions of reasons, all ultimately because the client was waiting, and gave up after a certain period of time (the timeout threshold).

An important thing to note about timeouts is that they do not cause other problems (except if the error that’s raised is handled incorrectly). In other words, timeouts are only a symptom of another problem; to solve the timeout issue, solve the issue that caused the timeout.

Since timeouts always depend on something else happening (or not happening), they can be difficult to reproduce in a testing environment. Without sufficient tracing/logging mechanisms in place, it can be difficult to track down the source of the problem. Users will usually be able to tell you that something is timing out, and probably what they were doing at the time, but finding the cause often requires some digging.

Deadlocks

Deadlocks occur when two or more processes hold locks on resources (i.e., block each other), and also try to take locks on resources held by the other process(es).

This creates a situation such that neither process can continue unless one of them is terminated by another external process. In SQL Server, this external process is called the Deadlock Monitor, and upon detecting a deadlock, it summarily rolls back one or more of the queries to resolve the situation.

I demonstrated how deadlocks are created in a demo video in a previous post. As part of setting up the demo, I showed how blocking works in relation to deadlocks.

By definition, deadlocks are caused in part by blocking, and in part by the order in which the locks were acquired. Usually it’s very clear to users when deadlocks occur because of the error messages. In some cases, deadlocks can lead to timeouts if the external monitoring process takes too long to pick up on the fact there is a deadlock in the system (the deadlocked processes are blocked and waiting).

Fixing deadlocks is often an exercise in information gathering, because they normally aren’t reproducible except under specific circumstances. The key is setting up a tracing/logging solution to record the pertinent information when deadlocks happen, so you can analyze and fix the problems later. I explained how to do that in a previous post as well. A more proactive solution would be to determine the “hot” areas of the application, and ensure that locks are acquired in the same order in every piece of code that accesses that area.

While well-written applications may attempt to retry the operation in progress when a deadlock occurs, the question of why the deadlock occurred in the first place must still be asked and answered; tracing/logging or other solutions still apply.

As you can see, there are relationships between blocking, timeouts, and deadlocks. The next time you deal with an error report that involves one of these three phenomenon, I encourage you to dig deeper, and put in place processes that either prevent the problem from occurring in the first place, or simply record enough information to be able to fully solve the problem later.

Most of the time when we talk about tuning, it’s related to improving the performance of a production workload. Maybe (hopefully) we’ve gone as far as tuning more “hidden” code like stored procedures, triggers, and the black boxes that are user-defined functions. But what about the other queries we run against our production servers?

Maybe we use 3rd-party management/monitoring software, or in-built tools like Activity Monitor or Management Data Warehouse (aka Data Collector) that periodically checks up on our servers. In these cases, if we identify these systems as a performance problem, we’re pretty much stuck with either reducing the polling frequency, abandoning the use of the system entirely, or pestering the vendor until the problem is fixed. That’s pretty straight-forward.

There’s another class of queries we run against our production servers, though: diagnostic queries. If a server is in performance trouble, or we need to go troubleshooting for blocking, or even if we’re proactively searching for issues, we automatically whip out our trusty folder of scripts and start running queries to start to narrow things down; queries that could potentially make the problem worse by either adding extra load to the server, or even adding unnecessary time to the troubleshooting process.

It’s impossible, of course, to eliminate the need for the database engine to do some processing to return the results of our queries (the observer effect), but how much impact are our queries having on the server? Maybe in the grand scheme of things, not a lot. During a troubleshooting session, however, things are very different: people are anxious for answers, the problem needs to be solved now, and we’re working as hard as we can to not only fix the problem, but above all else, to not make things worse.

Now, I’m going to give an example, and I have to apologize in advance to Paul Randal and/or Glenn Barry because I’m about to single them out. Sorry, guys.

A good while back, Paul posted an excellent diagnostic query to analyze wait stats (note: I never saw Glenn’s original query), and I had stolenborrowed used it to troubleshoot some issues in the environment I work in. Of course, I played around with it before that, to see it was indeed a useful script to keep around. But when I ran it against the production server in peril, I was floored because it took so long to return the results I needed to see. I was also puzzled that after I reset the wait stats and then ran the query a few seconds later, the results… just didn’t look right for what I was expecting. What was going on? It was just a relatively simple SELECT statement from a DMV, which is usually pretty quick. I turned on STATISTICS IO and STATISTICS TIME and ran the query again:

Much better. (You can find the modified script on the Scripts & Code Resources Page. Note that it doesn’t do exactly the same thing as the original.)

So what can we learn from this?

While DMVs are usually fast, doing further analysis on the results can be expensive, depending on what we do, of course.

Test diagnostic queries for performance just as much as for functionality.

Fully test diagnostic queries long before they get thrust into the heat of battle. This is sword-sharpening for DBAs.

For the scripts that just… perform poorly… and there’s nothing more we can do about it, add comments to the script body that say as much. Whoever is about to run the script should be made aware that the query will take some time to finish. The warning will prevent situations like worrying about whether or not it’s working, or it even may alert the person to not run the script at all and use a different/similar method instead.

Whether you manage one SQL Server instance or hundreds, you need to have a way to keep track of them all.

Just like best-practices when it comes to security, it would be nice to create logical groups of instances and manage them based on their group membership. Perhaps also inspect, or perform actions against groups of instances:

Run queries, including management-type activities like creating or running jobs

Check if the settings match a pre-determined configuration, and if not, fix the settings

While there isn’t really an inbuilt group-based way to manage these kinds of things, there are two out-of-the-box solutions to mimic many of the advantages, and which help you accomplish these tasks. These solutions are, of course, Registered Servers, and Central Management Server.

In terms of requirements, Registered Servers was available starting with the Management Studio packaged with SQL Server 2005, and support for registering an instance as a Central Management Server was added in SQL Server 2008. I don’t believe there’s any limitation on version of the instances you can register within either system (and I can’t find it in the documentation); off the top of my head I would say SQL Server 2000 and higher sounds pretty safe. From personal experience, 2005 and higher definitely works.

The UI to access both of these features is through the View | Registered Servers menu item in Management Studio. While it might be a bit strange that both of them are contained in one UI, we’ll see in a second that that’s actually an advantage, because it’s likely you’ll need to use both in your environment.

What is a Registered Server?

While Registered Servers and Central Management Server have different names, both ultimately do the same thing: store a user-defined hierarchy of SQL Server instance connection strings. In other words, a tree view that lists and categorizes your instances. The behind-the-scenes way this is accomplished is very different between the two systems, and this means there are advantages and disadvantages to each.

Although both names contain the word “servers,” what we’re actually dealing with are instances of SQL Server. I guess it was a design decision by the SQL Server team to call them Registered Servers instead of Registered Instances. It’s also a bit confusing because Registered Servers refers to one user-defined hierarchy, while Central Management Servers (note the plural) would refer to multiple hierarchies.

Differences

The main difference is that a Registered Servers list is only visible to one Windows user, while a Central Management Server list is publicly available to anyone with enough permissions to view the list.

The best way to show all the differences is probably just to put them in a table, so let’s do that:

Registered Servers

Central Management Server

Storage Method

XML file in Windows roaming profile

msdb database

List Visibility

Current user only

Everyone who is authorized (view nothing, view all, edit all)

Security of Instance List

Based on Windows user context

Managed by SQL Server permissions, and msdb database roles

Types of Instances

Database engine, SSAS, SSIS, SSRS

Database engine only

Connection String Authentication

Windows Authentication and SQL Server authentication

Windows Authentication only

So now you can see the big tradeoffs between the two systems, and why I said it’s likely you’ll need to use both. Central Management Servers should in theory be able to support all types of instances, but sadly, it just isn’t implemented for some reason. My fingers are crossed that it’s implemented in the next version.

Setup

Since the storage method of the two system is very different, there are very different concerns when it comes to setting them up. Really there isn’t much to Registered Servers because of the fact that the list is isolated to a single user. With a Central Management Server, however, there are many more parts in play — the price we have to pay for power and flexibility — and so there are many more guidelines.

Briefly, the way I like to set things up is as follows:

Use Registered Servers for instances that I keep on my local machine (for development, or playing around) that aren’t usually accessible from the rest of the network, and for SSAS/SSIS/SSRS because there’s currently no other choice.

Set up a dedicated SQL Server instance just to be the Central Management Server instance, and register all publicly-available instances within a given domain. (One Central Management Server instance per domain is a good idea, particularly if that boundary separates development and production, for example.) The instance doesn’t have to be powerful in terms of either CPU or memory, but it does need to be very reliable because we will be using it to manage our servers in good times and bad. Even Express Edition can be used to set up a Central Management Server, but for obvious reasons, you will want to take regular backups of at least the msdb database by some means.

I recommend installing a Central Management Server instance on a separate management machine (again, probably one per domain), and not on a production machine. (I wrote a bit about this in the comments on my answer here). Definitely do not register a Central Management Server on an existing production instance — not only is that a bad practice in terms of separation of concerns for many reasons, but there’s also a limitation that a Central Management Server instance cannot appear in the hierarchy within “itself” as would happen in this case.

(If necessary) Configure security to determine who can view, and who can edit a Central Management Server list. As eluded to in the table, this is accomplished by assigning membership to the msdb fixed database roles ServerGroupReaderRole and ServerGroupAdministratorRole (MSDN page) as appropriate. It may also be desirable to use multiple Central Management Server instances to separate tiers of registered servers, such as development, test/QA, and production, because there isn’t any other way to control instance-level visibility besides the aforementioned roles.

Organizing the List

One of the great features of both Registered Servers and Central Management Server is that registered instances can appear in the hierarchy multiple times. Why is that great? Because it allows us to organize the instances in multiple ways within the same hierarchy.

We may want to group our instances together by:

Version and/or edition

Criticality (24/7, business working day, etc.)

Business area or department (development, test/QA, production, etc.)

And I’m sure you can think of a few others for your environment as well.

In fact, we may wish to categorize by more than one of these things for frequently-used parts of the hierarchy. When arranging the list for maximum usefulness, think about how to organize it such that if a single group is selected in the tree, changes would happen to all the instances within that group folder.

If you’re setting this up for the first time, and assuming there aren’t too many instances in play, it may just be a good idea to start with a relatively flat hierarchy, and build the groups as you find they’re needed. The reason why this strategy isn’t a good idea with many servers is because the ability to quickly move servers between groups in the UI is extremely limited. For a Central Management Server, the system tables in msdb could probably be edited directly, but I can’t say I’d recommend that, and I haven’t tried it myself either.

In a future post (likely a video demo), we’ll dive into using and leveraging Registered Servers and Central Management Server to their full potential through the UI.

To leverage Central Management Servers externally, such as through PowerShell or an external application, we can run queries against the following two views in msdb: dbo.sysmanagement_shared_registered_servers and dbo.sysmanagement_shared_server_groups, or use the various stored procedures to manage the list.

PANIC! A customer clicked through four layers of warning messages and accidentally deleted a bunch of data from the database, but didn’t bother to tell you about it until today.

Great. The database is in FULL or BULK_LOGGED recovery, and we have a full set of transaction log backups that contain all the transactions in the database for the time when things “happened.” Okay… now what? Log backups seem very opaque, as we can’t just open them up in Notepad and expect things to be human-readable.

Enter the undocumented table-valued function: fn_dump_dblog.

This function reads a transaction log backup file and returns a human-readable geek-readable description of the physical log records from the backup.

(The sister function fn_dblog does the same sort of thing, except it operates only on the active transaction log of a database, not a transaction log backup.)

Paul Randal wrote a detailed blog entry on the basics of how to use both of these functions, so I won’t cover that here. Where this post differs is in consuming the output of the function so it’s much more usable.

The first step is to read the entire backup and dump the output into a temporary table. This will make querying the same set of transactions (or more transactions if you load additional backups into the table) much faster, as the log reading part of things is rather slow.

Now that we have the transactions available for querying more readily, let’s show what we need to see in an easy-to-consume format. This works best if you’ve restored a copy of the database in STANDBY mode to a point in time before the time of interest. If the script is run in the context of that database, the code will show you the names of the tables affected, the login SID of who made the change, and also proactively return a DBCC PAGE command for when you want to look at the raw data values. This makes it really easy to inch through the transaction log to figure out what changed using out-of-the-box tools. (Yes, there are 3rd-party tools that do this, too.)

When we manage databases in either the FULL or BULK_LOGGED recovery models, we need to pay particular attention to the size of the transaction log files. If our processes aren’t optimal, we can see log files grow either out of control, or beyond what we think is a reasonable size.

Virtual Log Files

As I mentioned in a previous post, the transaction log contains an ordered sequence of the physical operations that occur on a database.

What I didn’t mention is that the physical transaction log file also contains logical sections, called virtual log files (VLFs). It’s the VLFs which actually contain the physical operations I mentioned. The purpose of VLFs is to allow SQL Server to more efficiently manage the log records; specifically, to know which portions of the transaction log are used, and which aren’t.

Knowing the portions of the log that are used is important when we go to take a transaction log backup, which creates a copy of all the transaction log records since the last transaction log backup. After a transaction log backup, as long as all the log records within a VLF are not part of an active transaction (or are required for some other feature, such as replication), the VLF can be marked as unused/inactive. This allows SQL Server to reuse that portion of the log file. This process is called many names, including clearing the log and log truncation. It does not affect the physical size of the log file.

Problems only start happening if all the VLFs end up being used, and none are available1. This means the log file (the physical file) has to grow to accommodate more transactions. A physical log file growth automatically creates new unused VLFs, and so the transactions can continue.

What is causing the log to grow after I shrank it?

Any write operation in the database can potentially cause the log file to grow. The real question is: why are all the VLFs used up when that’s not what is expected?

Here are some of the problems you might encounter:

Not taking transaction log backups frequently enough, or not taking them at all. The only way to mark VLFs as inactive is to take a transaction log backup (again, only for FULL and BULK_LOGGED databases). If transaction log backups are only taken once/day, besides exposing yourself to a full day of data loss, you also need a transaction log file large enough to hold all the transactions that occur for a full day. That could be quite a bit of space! The solution is obvious, and it’s a win for recoverability, too: take transaction log backups more frequently.

Overly aggressive and/or too frequent index maintenance. I’m certainly a proponent of index maintenance, but it’s very easy to get carried away. Maintenance plans that rebuild shiny new indexes for you every night are both ridiculously easy to set up, and also ridiculously bad for the size of the transaction log. Typically, rebuilding all the indexes of a database takes more log space than the data files actually take up. If you’re dealing with databases of even small-to-moderate sizes (say, 1 GB and larger), this can add up to a tremendous amount of wasted storage space, because all that transaction log has to be backed up at some point, and the log file will likely end up larger than the data files. What I strongly recommend doing is putting in place a much more selective index maintenance process, such as Ola Hallengren’s index maintenance scripts, which I use in production.

The database is in FULL or BULK_LOGGED when it actually should be in SIMPLE. Before you go flipping the settings around (which can potentially cause data loss), go read about choosing the right recovery model for the database. Typically, I see this happen when a production backup is restored to a development environment, and a backup gets taken for some reason. And then the transaction log ends up growing, and growing, and growing as developers test and roll back their work, right up until things screech to a halt when the drive that contains the log file has filled up.

A SQL Server feature, such as replication or mirroring, is enabled. Until the transactions have been copied to the secondary server, the records cannot be cleared from the transaction log. If the secondary is slow applying changes, or if you don’t even have a secondary any more and things weren’t cleaned up correctly, this could lead to infinite transaction log growth. To solve this really depends on where the specific problem is — solve that first, and then see if there is still an issue with the transaction log.

If none of those cover your scenario, you can run the following query, which will tell you why the transaction log is still in use.

1 Extra empty log space is pre-allocated such that there is always enough space for active transactions to be able to roll back. Regardless, that space is “used” in the sense that it’s been spoken for already.