Proactive Database Administration

Uncategorized

It’s often the case for a technical analyst to have occasion to reach beyond their original job description to solve problems. Perhaps, though, no more so than a database administrator, who is tasked with managing what is often the focal point of a business — data — a point where every technical discipline can converge.

I suppose it would be helpful to take a step back and ask the question: what does a database administrator do?

There are as many answers to that question as there are database administrators. You’ve probably had to think about answering that question for a layperson before, and come to the realization that there just isn’t a general answer. You probably ended up answering the question “what do I do as a database administrator?” to avoid giving some sort of cop-out answer like “manage databases” after having a puzzled look on your face for several seconds.

Maybe all you do is make sure all the data is backed up and recoverable, and respond to alerts when those processes are in trouble. Maybe all you do is analyze traces to make performance improvements. Maybe all you do is write report queries for the people wearing suits.

More typically, though, you do many different things at different times — or all at once — to solve a problem. Analyze a performance trace to determine why a report query is slow, and find out that while the code could be improved, there’s a latency issue with the storage network. That sounds a lot more like it. Sometimes the lines drawn between technical disciplines become so blurry they disappear altogether.

Everyone has their own set of likes and dislikes, and of course that isn’t limited only to your career. It’s a set that’s constantly changing — sometimes gradually, sometimes rapidly — as you experience things for yourself.

In the field of information technology, there are so many different areas to explore, even when only considering what’s involved with database administration specifically. Programming, hardware, storage, networking, and more — the expanse of knowledge is vast and growing every single day, and so it’s becoming more and more difficult (if not impossible) to be an expert in all of them.

Practically speaking, I think abandoning the idea of becoming that all-knowing expert is the only reasonable approach. That isn’t to say that goals should be abandoned as well; my suggestion is simply to focus, or re-focus. Concentrate on the one or two areas where your natural aptitude and positive attitude intersect, and hold an interest in many of the other major areas, even if you don’t consider some of them to be particularly interesting.

While that sounds a bit like punishment, the broader your exposure is to different subjects, the better able you’ll be able to not only solve a wider variety of problems, but you’ll have real communication ability with everyone in your IT department, which is truly valuable.

Conversations about VLANs and ORMs take on real, practical meaning as they affect the database, your primary responsibility. Even a small amount of understanding of just the subject jargon can go a long way.

Those who become database administrators often start in a peripheral field of study, usually either software development or system administration. Like the answer to “what does a database administrator do” varies, so are the paths to be a database administrator.

It’s somewhat ironic that people struggle with the question “how do I become a database administrator” (there is no database administrator college program), yet database administration has numerous peripheral technical fields, making it possible to transition from a wider variety of traditional roles.

I think of it more as building a foundation on which to rely, so it should be a very strong foundation. Not necessarily as a fallback (although that’s probably a good plan), but more as a deep-level area, as I mentioned. Above all else, though, you have to bring your problem-solving and information synthesizing skills to the table. Peripheral knowledge is far less useful without the ability to practically apply it in some way.

I suppose the point of this post is to encourage you to reach out into some of the peripheral subjects of database administration. It will only help your career.

Make it a habit, too, even if only once/week. A great way to proactively do this is to subscribe to a newsletter such as Database Weekly, which delivers a wide cross-section of database-related topics directly to your inbox.

These three issues are all related, yet very different. When we deal with reports of problems, it’s essential to clearly identify what is actually happening. Sometimes, we need clarification by asking about the symptoms, instead of accepting what the user says without delving any deeper. The root cause of a problem may not even be what the user is experiencing directly, or what they claim they’re experiencing.

In this post, we’ll cover the similarities and differences of timeouts, blocking, and deadlocks, how they manifest themselves to an end user, and how they may cause, or lead to, each other.

Blocking

Blocking (also known as live locking) occurs when a process attempts to acquire a lock on a resource, but has to wait in the locking queue before the lock is granted.

From the outside, it appears that the process is doing nothing, when in fact it’s waiting for the other process(es) to release the lock(s) ahead of it in the queue.

If a process is blocked for too long, this can lead to timeouts. If locks are acquired in a specific order, blocking can lead to deadlocks.

The thing to remember is that blocking occurs all the time to synchronize access to resources. It only becomes a problem when the amount of time a process spends waiting becomes excessive, either because it’s causing timeouts, or because it’s causing a user to wait for more time than they expect is normal. Blocking may also be experienced as a “frozen” application. Users may not complain about blocking until their work is impeded enough to frustrate them into taking action to get the problem fixed.

Timeouts

Timeouts occur when a client application makes a request for a resource, and while waiting for a response to be returned, decides things have taken too long, and stops waiting. This usually results in an error being raised by the client API.

Timeouts occur for a variety of reasons, including blocking, the request needing to do a ton of work on the server, or maybe the network is saturated or simply really slow. There are millions of reasons, all ultimately because the client was waiting, and gave up after a certain period of time (the timeout threshold).

An important thing to note about timeouts is that they do not cause other problems (except if the error that’s raised is handled incorrectly). In other words, timeouts are only a symptom of another problem; to solve the timeout issue, solve the issue that caused the timeout.

Since timeouts always depend on something else happening (or not happening), they can be difficult to reproduce in a testing environment. Without sufficient tracing/logging mechanisms in place, it can be difficult to track down the source of the problem. Users will usually be able to tell you that something is timing out, and probably what they were doing at the time, but finding the cause often requires some digging.

Deadlocks

Deadlocks occur when two or more processes hold locks on resources (i.e., block each other), and also try to take locks on resources held by the other process(es).

This creates a situation such that neither process can continue unless one of them is terminated by another external process. In SQL Server, this external process is called the Deadlock Monitor, and upon detecting a deadlock, it summarily rolls back one or more of the queries to resolve the situation.

I demonstrated how deadlocks are created in a demo video in a previous post. As part of setting up the demo, I showed how blocking works in relation to deadlocks.

By definition, deadlocks are caused in part by blocking, and in part by the order in which the locks were acquired. Usually it’s very clear to users when deadlocks occur because of the error messages. In some cases, deadlocks can lead to timeouts if the external monitoring process takes too long to pick up on the fact there is a deadlock in the system (the deadlocked processes are blocked and waiting).

Fixing deadlocks is often an exercise in information gathering, because they normally aren’t reproducible except under specific circumstances. The key is setting up a tracing/logging solution to record the pertinent information when deadlocks happen, so you can analyze and fix the problems later. I explained how to do that in a previous post as well. A more proactive solution would be to determine the “hot” areas of the application, and ensure that locks are acquired in the same order in every piece of code that accesses that area.

While well-written applications may attempt to retry the operation in progress when a deadlock occurs, the question of why the deadlock occurred in the first place must still be asked and answered; tracing/logging or other solutions still apply.

As you can see, there are relationships between blocking, timeouts, and deadlocks. The next time you deal with an error report that involves one of these three phenomenon, I encourage you to dig deeper, and put in place processes that either prevent the problem from occurring in the first place, or simply record enough information to be able to fully solve the problem later.

I was watching a car review on YouTube the other day, and the host made the point that the voice recognition system only understood names as they’re pronounced, according to the locale of the infotainment software (U.S. English in this case). Moreover, he said the user should think about how this process works in order for it to be more usable.

Now, that’s a great idea, and it kinda-sorta works, but it’s not very user-friendly. Users shouldn’t be expected to understand how a system functions internally for it to be usable. Many people know how to drive, yet they may not understand exactly how a car works.

These are the kinds of issues we encounter when developing globalized applications. How do we bridge the gap between what the user understands, and what the computerized system understands?

I think it comes down to assumptions. In other words, the computerized system assumes something, while the user either isn’t aware of that assumption, or doesn’t provide enough input (or is never prompted for input) for the system to do anything except make an assumption.

In the case of the voice recognition system, the system assumes the user’s contact names are pronounced based on the locale of the system. This sounds like a reasonable assumption, and for the most part, it works. When thinking about it in the context of an increasingly globalized society, though, it’s becoming a less and less valid assumption. If the system is only single-locale-aware, there’s no other choice for it to make, just given a list of names.

While giving the user the ability to choose a different default locale for the system probably helps quite a bit, it doesn’t solve the problem of the inevitable exceptions to the default.

If we think about the data available, is there any way we could help the system better understand the exceptions?

We could add a property to each name that specifies the locale in which to interpret it. This would solve the problem of complete misinterpretation as shown in the car review. It doesn’t, however, solve the problem of idiosyncrasies within a locale (i.e. “Smith” and “Smythe” could be pronounced the same, and in more than just one way), and it would make the software more complex (although it could potentially save on costs, because the same bits could be deployed to all locales, just with different default settings).

Another approach would be to allow the user to input a phonetic version of the name, in order for the system to understand the pronunciation based only on the single default locale. This would solve the problem of exceptions, and the issue of same-pronunciations-different-spelling as mentioned in the previous point. While the phonetic data is assumed to be of the same locale as the locale of the names themselves, this is probably an acceptable drawback for most applications.

With globalization, other factors come into play, too:

The characters accepted by the system. For example, if the system’s default locale is Chinese, is it going to allow the entry of names using English characters, and how would the system understand a phonetic spelling of that name?

What if there are differences between languages where some soundings of syllables may not exist in both?

How would efficient searching of the same data be accomplished for multiple locales?

How much extra effort (i.e., cost and time) is required to add a reasonably globalized solution for the specific application?

As you can probably tell, these are not necessarily problems that are easily and effortlessly solved, and I’m not going to claim for a second that either of the two approaches I mentioned will solve all application issues related to globalization. Since every application is different, though, they may offer a good solution, or at least a reasonable compromise — either way, an improvement in usability, which was the original goal.

As database designers, we must be aware of how globalization affects which pieces of data we store, how they are arranged/sorted for efficient searching, and how users will be interacting with the data. It’s also important to think about designing database structures such that they can easily be enhanced to accommodate a globalized application, even if the application will not support that kind of scenario in Version 1. It’s coming; be prepared for it.

The recovery model of a database is a critical setting that determines how the transaction log both records and retains information about transactions that occur in the database. It’s important to remember that the setting always applies to the entire database.

As we’ll explore in this post, there are 3 recovery models in SQL Server to choose from: FULL, BULK_LOGGED, and SIMPLE; each model has advantages, disadvantages, and tradeoffs. Only you can decide which model is appropriate for your particular situation and goals; we’ll go through some of those tradeoffs and a few typical scenarios.

First of all, let’s take a quick step back and ask why we care about what gets recorded in the transaction log.

What is recorded in the transaction log?

The transaction log contains an ordered sequence of physical operations that occur within a database. Each physical operation is written to the transaction log before the change itself is applied to the data page(s). This is called write-ahead logging, and it occurs regardless of the recovery model chosen.

Usually, the physical operations written to the transaction log also contain enough information to perform the inverse operation of the change (i.e., to undo a delete, do an insert). This combined with the write-ahead logging mechanism ensures transactions succeed or fail as a whole (atomicity) by giving SQL Server the ability to both undo and redo transactions. Hopefully those two bolded words set off hints about how logging is intimately tied to backup, restore, and several other features in SQL Server — we’ll get to those in a minute.

Minimal Logging

There are a handful of operations we can do in a database that can be minimally logged — this means that only enough information required to undo the transaction is logged. In the case of a BULK INSERT, for example, all that’s logged are page allocations — we can undo those by deallocating the pages, but we can’t redo them because the actual contents of the pages aren’t recorded in the log.

We gain speed by writing to the transaction log less, but lose the ability to perform point-in-time recovery during the operation.

You might be wondering how a database restored using only transaction log backups can be complete after a point in time where an operation such as a BULK INSERT took place. How does SQL Server know what to put on those data pages, since only the allocations were logged?

The answer is in what is recorded by the first transaction log backup taken after the minimally-logged operation. SQL Server keeps track of extents that are modified by minimally-logged operations in a special metadata page (the ML map, or minimally-logged map, which contains 1 bit per extent), and copies those extents into the transaction log backup, along with the required portion of the transaction log itself. Yes, transaction log backups can contain data!

The net result of minimal logging isn’t that we get the operation completely for free; we just defer a good chunk of the work to the backup subsystem for an immediate gain in speed.

Armed with this knowledge, let’s take a look at the features of each of the three recovery models.

FULL

Only transaction log backups internally clear the transaction log, so a good backup strategy is a must.

Required for certain features, such as Database Mirroring and Availability Replicas.

Typically used for OLTP databases, and where data protection is a high priority.

BULK_LOGGED

Full logging, except when the required conditions are met for minimally-logged operations.

Supports point-in-time recovery (again, you have to take transaction log backups) for all transaction log backups that do not contain minimally-logged operations. This means you should take a transaction log backup immediately before, and immediately after the minimally-logged operation(s), to maximize your ability to do a point-in-time restore.

Used for performing bulk operations on a database that would otherwise prefer to be in FULL recovery. A database can be switched freely between FULL and BULK_LOGGED without breaking the transaction log backup chain.

SIMPLE

Full logging, except when the required conditions are met for minimally-logged operations.

Does not support point-in-time recovery, and the transaction log cannot be backed up.

A database CHECKPOINT internally clears the transaction log as part of the process (since you can’t take transaction log backups), which means you don’t necessarily have to manage the transaction log as much as the other recovery models.

Usually used for databases that frequently handle large quantities of data (i.e., data warehouses), or databases where point-in-time recovery is not important.

Switching a database from FULL or BULK_LOGGED to SIMPLE breaks the transaction log backup chain. The transaction log backup chain can be restarted by switching from SIMPLE back to FULL or BULK_LOGGED, and taking a full or differential backup.

A good place to start choosing a recovery model would be by answering the following questions. Remember that these are only guidelines, and your specific situation will determine the best course of action.

Does the database require point-in-time recovery? (Yes = FULL or BULK_LOGGED, No = SIMPLE)

Are special features going be used on this database? (Yes = FULL or BULK_LOGGED — usually FULL — depending on the feature)

Will large quantities of data be imported into, or manipulated within this database? (Yes = BULK_LOGGED or SIMPLE)