Practicing Recovery

Microsoft escalation engineers bring a vast amount of experience to the Product Support Services (PSS) groups. In Tuesday's special edition of SQL Server Magazine UPDATE, I told you about the role of Bob Ward, an escalation engineer with SQL Server PSS, and explained the evolution of PSS. This week, let's look at some practical advice that Ward gave me.

I had hoped that Ward, who knows more about SQL Server than most other people in the world, would give me a "silver bullet" that would let me quickly solve all my SQL Server support problems. However, Ward says that no such silver bullet exists. (Ok, I didn't really think there was a silver bullet. But you can't blame for me asking!) Just as you and I do, escalation engineers need a solid understanding of the technology, a tight problem statement, and good troubleshooting skills to solve problems.

But, Ward did share a few thoughts about three common problems that many database customers bring to PSS:

My backups are worthless; I still can't restore my data!

My system seems slow, but I have no idea what it looks like when it's fast!

Gee, I wonder how or when that configuration parameter changed?

Ward says that you can avoid these problems by adhering to basic SQL Server best practices. This advice won't come as a surprise, but interestingly, most people who ignore these suggestions know that it's the right advice to follow. This week, let's examine the first of the best practices that Ward recommends—developing and testing a recovery plan—and look at the remaining two in next week's commentary.

Most DBAs know that a great backup is worthless if you can't restore the data, and they know that they can help ensure successful data recovery by following the best practice of developing and testing a recovery plan. But Ward says he can't count the times that he's had to help a customer recover missing data after the customer couldn't restore what he thought was a good backup. We can't blame Microsoft for this kind of failure. SQL Server has effective, tightly integrated backup tools. The inability to restore data doesn't happen because the tools fail; it happens because a DBA hasn't fully thought through and tested the recovery plan. In fact, the early results of this week's SQL Server Magazine Instant Poll ( at http://sqlmag.com ) show that a surprising number of respondents don't even have a recovery plan.

Most DBAs give lip service to the idea that, in the event of a failure, the ability to restore data is what counts. But I'm amazed at the number of database customers—not just SQL Server customers—who don't test their recovery plans. The worst time to test your recovery plan is during a critical outage. Unfortunately, most customers write a recovery plan, test the backup portion, and perform only a limited and trivial test of the plan's restore component. I admit that testing a backup is easier than testing a restore. Testing a restore requires creativity and agility to fake an outage, then recover the data—especially if you don't have an extra production-quality recovery server lying around. But testing your recovery plan is a crucial step in maintaining the readiness of your production database environment.

Creating and testing a restore plan is entirely the responsibility of the SQL Server user, but Microsoft can do a few things to make it easier for us to adhere to the other two best practices that Ward listed. Next week, I'll discuss the difficulty involved in setting performance baselines and monitoring changes to your database configuration and other settings, and I'll offer some suggestions for how Microsoft can make these tasks easier for the SQL Server community to implement.